Building VoiceKick: voice stream + Whisper

Combining voice stream and OpenAI Whisper model with manipulated logits

Dec 29, 2024

When it comes to speech recognition and natural language processing, achieving precision in transcriptions and responses is as much an art as it is a science. In voice whisper, one of the key techniques to refine outputs and ensure accurate recognition lies in logit boosting and penalizing. This nuanced approach transforms raw predictions into meaningful, structured results, making it a cornerstone of implementation.

What are logits?

Before diving into boosting, let’s start with logits. In machine learning, logits represent the raw, unnormalized scores a model produces for each possible token in its vocabulary. These scores indicate the model’s confidence in predicting a specific token. However, without intervention, logits can be influenced by:

• Repetition bias: models may over-predict certain tokens, prevalent words or patterns.

• Noise: unwanted tokens (like filler words or punctuation) may overshadow relevant ones.

• Uncertainty: subtle nuances in speech or ambiguous contexts can dilute predictions.

This is where boosting and penalizing comes into play—fine-tuning these logits to emphasize or suppress specific tokens based on context.

Logit boosting in Whisper

Voice Whisper employs logit boosting to direct the model’s focus during decoding. By strategically adjusting the scores of specific tokens, the system ensures:

1. Clarity: amplifying tokens relevant to the task (e.g. numbers, commands).

2. Relevance: suppressing noise-like tokens or unwanted sequences.

3. Adaptability: tailoring outputs based on predefined priorities (e.g., penalizing overused tokens or boosting domain-specific terms).

In Whisper, boosting is applied during the masking phase of decoding. Here’s how it works:

Boost tokens for specific contexts

Certain tasks require prioritizing specific words or phrases. For example:

• Numbers like “one,” “two,” and “three” are critical in transcriptions of commands or sequences.

• Action-oriented terms like “start,” “stop,” “left,” and “right” are essential for voice-driven interfaces.

Applying a boost value to these tokens increases their logits, making them more likely to be selected during decoding.

for &token in self.boost_tokens.iter().filter(|t| *t < &dims1) {
    m = m.slice_assign(
        &[token as usize..=token as usize],
        &Tensor::new(&[boost_value], mel.device())?,
    )?;
}

Suppress noise and unwanted tokens

On the flip side, tokens that frequently cause noise or detract from clarity (e.g., excessive punctuation, filler words) are penalized. This ensures they are deprioritized during decoding:

for &token in self.penalty_tokens.iter().filter(|t| *t < &dims1) {
    m = m.slice_assign(
        &[token as usize..=token as usize],
        &Tensor::new(&[-f32::INFINITY], mel.device())?,
    )?;
}

3. Dynamic adjustments

Voice-whisper also employs repetition penalties, dynamically reducing the logits of tokens that appear too frequently. This prevents the model from “hallucinating” repeated phrases or getting stuck on specific tokens.

if *frequency > repetition_frequency {
    let penalty = repetition_penalty.powi(*frequency as i32);
    logits_vec[token_idx] = if logits_vec[token_idx] < ZERO {
        logits_vec[token_idx] * penalty
    } else {
        logits_vec[token_idx] / penalty
    };
}

Why boosting works

Logit boosting aligns the model’s raw predictions with the desired output structure. By amplifying relevant tokens and suppressing noise:

• Speech commands become more reliable.

• Transcriptions gain higher accuracy, especially in noisy environments.

• The model becomes context-aware, prioritizing terms relevant to the application.

This method empowers Whisper to function seamlessly in real-time systems, where precision and efficiency are paramount.

Practical applications of logit boosting

Command recognition: boosting ensures action words like “stop,” “go,” “left,” and “right” are accurately captured in voice-controlled systems.
Numerical transcription: amplifying numbers improves the model’s ability to transcribe sequences or perform calculations.
Multilingual support: language-specific tokens are boosted to guide the model when decoding multilingual inputs.
Real-time adaptability: boosting can be dynamically adjusted based on user preferences or environmental conditions.

Striking a balance

While boosting enhances precision, it’s important to strike the right balance. Over-boosting can lead to:

• Bias: forcing the model to prioritize irrelevant tokens in certain contexts.

• Rigidity: reducing the model’s flexibility and creative problem-solving ability.

By carefully tuning boost values and penalties, voice-whisper maintains a balance between structured outputs and adaptability.

Conclusion

Logit boosting is more than just a tweak—it’s a strategic approach to decoding that amplifies Whisper's strengths while addressing its weaknesses. By emphasizing clarity, relevance, and adaptability, boosting transforms raw logits into polished, context-aware outputs, making voice whisper a powerful tool for real-time speech recognition and beyond.

Building VoiceKick: voice stream

ernis

November 19, 2024

Read full story

out13 substack

Building VoiceKick: voice stream

Discussion about this post