Building VoiceKick: voice stream

Capturing the voice in the real time sound stream from the microphone

Nov 19, 2024

This is going to be a series about building the VoiceKick. As the name hints, it has something to do with the voice and sidekicks but I will continue on that in the next series. Before we get into the details, let’s pause to explore the foundation of it all: voice, sound, silence, and noise.

Silence is often misunderstood. It’s not the total absence of sound but the deliberate absence of unnecessary noise. It’s the calm before a word is spoken, the void that invites attention, and the space where meaning can take root. Silence is not emptiness - it’s potential.

Sound is the canvas on which voice paints. Its vibration was made perceptible, ranging from the subtle rustle of leaves to the thunder of a jet engine. Sound is neutral, neither inherently meaningful nor distracting until it’s given context.

Noise, by contrast, is a sound that competes for attention but offers no substance. It’s chaotic, disorganized, and disruptive - a signal without a purpose. Noise drowns clarity, masking what truly matters and creating a barrier to understanding.

Voice, however, is sound imbued with intent. It’s a uniquely human ability to transform air and vibration into communication, emotion, and expression. Voice is the sound we listen for, the one that cuts through the chaos of noise and stands out in a sea of silence. It carries purpose and connection, making it far more than just a sound - it’s a message.

Capturing the voice involves the interplay of these elements - silence, sound, noise, and voice - and becomes central. It’s about capturing the essence of voice in a world filled with noise, finding clarity in chaos, and using sound not just as a medium but as a tool for action.

This series will take you through the journey of building VoiceKick, exploring how these elements create something meaningful.

Let’s see how we can capture and process a stream of sound to move from the conceptual to the technical. Capturing voice is about harnessing sound while minimizing noise and framing it within the constraints of available hardware. This is where tools like the cpal library come into play, providing a flexible and powerful way to interact with audio devices.

A audio stream consists of raw audio data - waves of sound sampled at specific rates to create a digital representation. The quality of this stream depends on several factors:

Sample rate: how often the sound is measured per second (e.g., 16 kHz, 48 kHz).
Channels: number of audio streams (e.g., mono or stereo).
Buffer size: the chunk of data processed at a time, influencing latency and performance.
Format: the data type for audio samples, such as floating-point values (F32).

On average, one word in speech takes about 250 to 300 milliseconds (0.25 to 0.3 seconds) to pronounce. This estimate varies depending on the language, speech rate, and the type of words being spoken. Key factors influencing the duration are:

Average speech rate:

The average rate of speech for English speakers is around 120 to 160 words per minute (wpm).
At 150 wpm, each word would take roughly 400 milliseconds (or 0.4 seconds).
At 160 wpm, each word would take about 375 milliseconds.

Factors affecting the duration of a word:

Shorter words (like “the”, and “is”) take less time to pronounce, typically around 200-300 milliseconds.
Longer words (like “computer”, and “organization”) may take closer to 500 milliseconds or more.

Audio sampling considerations:

In terms of audio processing, at a typical sample rate (e.g., 16 kHz), each second of audio contains 16,000 samples.
For a word lasting about 300 milliseconds, we’d have roughly 4,800 samples (300 ms 16 kHz = 4,800 samples).

These estimates provide a good baseline for real-time audio processing and processing speech-related data.

VoiceKick will use these foundational capabilities to capture voice streams efficiently. By leveraging cpal, it ensures compatibility across a wide range of devices - from noise-cancelling headphones to built-in microphones and wireless earbuds. The next step is designing a pipeline to process the captured data, distinguishing voice from noise, and turning raw sound into actionable insights.

Before we can get to voice, we must first capture the incoming sound to prepare it in a form suitable for voice analysis. This involves a series of transformations to ensure the sound data is clear, uniform, and compatible with downstream processing like speech recognition or audio analysis.

Capturing sound from the microphone

Microphones provide raw audio data in various formats depending on the device. These formats include differences in:

Sample format: integer or floating-point representation.
Sample rate: the frequency at which sound is digitized (e.g., 16,000 Hz, 44,100 Hz).
Channels: mono (1 channel) or stereo (2 or more channels).

The cpal library acts as a bridge between the application and the hardware, abstracting the complexities of device-specific configurations. Using cpal, we can:

Query-supported configurations for each device.
Select optimal parameters for capturing a clean voice stream.
Handle platform-specific nuances seamlessly.

For example, cpal detects these device settings:

Bose QuietComfort 2
- Channels: 1 (mono)
- Sample Rate: fixed at 16,000 Hz
- Buffer size: ranges from 5 to 4096
- Format: floating Point (F32)

This setup is tailored for voice clarity, making it ideal for speech-focused applications with low sample rates and compact buffer sizes.

MacBook Pro M1
- Channels: 1 (Mono)
- Sample Rates: supported rates of 44,100 Hz, 48,000 Hz, 88,200 Hz, and 96,000 Hz
- Buffer Size: ranges from 15 to 4096
- Format: floating Point (F32)

The MacBook Pro offers higher sample rates, making it versatile for both voice and high-fidelity sound capture.

Apple AirPods
- Channels: 1 (Mono)
- Sample Rate: Fixed at 24,000 Hz
- Buffer Size: Ranges from 8 to 4096
- Format: Floating Point (F32)

AirPods strike a balance with their mid-range sample rate, optimized for clear voice input and wireless transmission.

Converting samples into F32 format

Speech recognition models typically work with 32-bit floating-point (F32) audio data. Many devices, however, output integer formats like 16-bit PCM (Pulse Code Modulation). The first step is to normalize these values into the floating-point range of -1.0 to 1.0. This ensures:

Consistent data representation.
Precision is required for audio processing.
Compatibility with machine learning models.

This is easily done by IntoF32 trait.

Resampling to 16,000 Hz

Microphones output audio at various sample rates, often much higher than what’s needed for speech recognition. Speech recognition models typically operate at 16,000 Hz (16 kHz) because:

It strikes a balance between clarity and efficiency.
Human speech does not typically exceed 8 kHz in frequency, so 16 kHz covers all necessary details.
Most ML models are trained on this standard rate.

Resampling involves interpolating or downsampling the audio to 16,000 Hz while preserving its original characteristics. Libraries like rubato in Rust handles this efficiently.

Converting Multi-Channel to Mono

If the input is stereo (2 channels) or has even more channels, it must be down-mixed to mono. This simplifies processing and ensures uniformity, as speech recognition typically only requires one channel of audio. A common approach is to average the channels out.

Buffering Samples

Once the audio data is formatted as floating-point, resampled to 16 kHz, and converted to mono, it is preferable to buffer them into chunks suitable for real-time processing. For speech recognition, short samples (10–100 ms) are ideal. This corresponds to buffer sizes of 160 to 1600 samples at 16 kHz.

We buffer at least 512 samples corresponding to 32 ms of audio as it aligns nicely and is preferable by the Silero voice detection model. This is sufficient for real-time use cases like:

Voice recognition: short samples are ideal for identifying speech patterns quickly.
Segmentation: breaking down continuous speech into manageable chunks.
Spectral Analysis: applying Short-Time Fourier Transform (STFT) or other techniques to analyze sound frequencies over time.

Detecting voice

These above steps are crucial for real-time applications because they address clarity, compatibility, and efficiency. Noise and inconsistencies in raw audio can confuse speech recognition models, so converting samples to the F32 format, resampling to a standard 16 kHz rate, and downmixing to mono ensure clean and normalized data for processing. Additionally, many speech recognition models and spectral analysis tools are designed to work with this standard format, making it essential for compatibility. By buffering smaller chunks of audio, latency is minimized, enabling systems like VoiceKick to respond quickly and effectively in real-time scenarios.

To filter actual voice samples from a continuous audio stream, we need to distinguish between noise, silence, and true voice activity. This is where VoiceKick employs a two-model approach: WebRTC VAD for detecting noise and Silero VAD for detecting voice. Each model operates on specific assumptions and requirements, which we integrate to create a robust pipeline for voice activity detection.

Step 1: Converting samples for WebRTC VAD

The WebRTC voice activity detection model is designed for low-latency environments and operates at an 8 kHz sample rate, requiring only 240 samples per evaluation. Since our input stream is at 16 kHz, we need to downsample the audio to 8 kHz. This step reduces computational overhead and aligns the sample rate with the model’s requirements while preserving the essential characteristics of sound and noise.

The WebRTC VAD’s primary role is to detect any sound - whether it’s noise, voice, or a mixture of both. By identifying areas of potential activity, we can avoid processing large portions of silence, making the pipeline more efficient.

Step 2: Using WebRTC and Silero VAD Together

While WebRTC VAD identifies sound, it does not differentiate between voice and non-voice noises (e.g., background chatter, keyboard clicks, or wind). This is where Silero VAD comes in. Silero is a neural network-based model specifically designed to predict the likelihood that a given sample contains a human voice. Silero works natively with both 16 kHz and 8 kHz sample rates, allowing us to reuse the original 16 kHz stream without additional downsampling. The model returns a voice prediction score between 0 and 1. Based on empirical testing, a threshold of 0.01 has proven effective for detecting voice while minimizing false positives in noisy environments.

By combining the outputs of these models:

WebRTC VAD ensures we only process regions with “noise”.
Silero VAD confirms whether the detected “noise” is the actual human voice.

Step 3: handling detected noise and voice

Once both models have evaluated a sample batch, the results are matched into a tuple (is_noise, is_voice) to determine the next step:

1. (true, true): Sound and voice are detected.

The audio is classified as containing voice activity.
These samples are accumulated in a buffer, which stores voice segments until a logical boundary (e.g., silence) is reached.

2. (true, false): Sound but no voice is detected.

The audio is classified as noise.
The buffer is cleared, as noise cannot contribute to meaningful voice data.

3. (false, _): Silence is detected.

Silence acts as a logical separator in the stream.
If the buffer contains accumulated voice samples, they are returned as a completed segment.
If the buffer is empty, no action is taken.

Why this approach works

This combination of noise and voice detection creates a robust filtering mechanism:

Efficiency: WebRTC VAD reduces the processing burden by skipping silent regions entirely.
Accuracy: Silero VAD focuses on identifying actual human voice, ensuring that the output is relevant for speech-based applications.
Noise resilience: by clearing the buffer when noise is detected, we avoid mixing voice with irrelevant sound, preserving the quality of the extracted voice segments.
Flexibility: the use of thresholds and model combinations allows the system to adapt to various environments, from quiet rooms to noisy coffee shops.

Shortcomings of This Approach

While the combination of WebRTC VAD and Silero VAD offers a robust solution for filtering voice activity, there are inherent shortcomings:

Speech modulation or pauses clearing the buffer

Speech often includes natural modulations, such as changes in tone, pitch, and intensity, as well as brief pauses between words or phrases. These can occasionally cause:

WebRTC VAD to misclassify quieter speech as silence.
The buffer to clear prematurely, cutting off voice segments in the middle of a sentence or thought.

This is particularly problematic for speech that includes significant dynamics, like expressive conversations or storytelling, where pauses are integral to communication.

Background noise during pauses

In noisy environments, background noise (e.g., traffic, office chatter) might be classified as “sound” by WebRTC VAD, even if it contains no meaningful voice. This can disrupt the buffer-clearing mechanism, leading to:

Unintended noise accumulation in the buffer.
Difficulty in cleanly segmenting voice from environmental noise.

Latency in segment completion

The approach relies on detecting silence to finalize and return buffered voice segments. In cases where the silence threshold isn’t met, the buffer might accumulate more samples than necessary, introducing latency in returning the processed voice segment.

Threshold Sensitivity

The empirically tested 0.01 threshold for Silero VAD may work well in most environments but could fail in edge cases:
- Too high a threshold may miss softer voices.
- Too low a threshold may allow non-voice sounds to be misclassified as speech.

Resource Utilization

Running two VAD models simultaneously (WebRTC for noise and Silero for voice) can be computationally expensive in real-time systems, especially on resource-constrained devices. Optimizations may be required to reduce this overhead.

Possible Mitigations

To address these shortcomings, several improvements can be considered:

Buffer Management Enhancements: implementing a mechanism to preserve the buffer during natural speech pauses, perhaps by setting a minimum voice duration before the buffer is cleared.
Dynamic thresholding: adapting the Silero VAD threshold based on the environment’s noise level, which could be assessed dynamically using WebRTC VAD.
Noise suppression preprocessing: introducing a noise suppression step before VAD evaluation to improve the accuracy of both WebRTC and Silero predictions.
Hybrid silence detection: combining WebRTC VAD with energy-based silence detection measuring amplitude levels to reduce false positives and improve segmentation.
Model Consolidation: Exploring single-model solutions that can handle both noise and voice detection, reducing computational overhead.

Voice stream

Finally, the samples are returned as a segment of audio that has been filtered for noise, silence, and irrelevant sound. This segment represents meaningful speech, ready for downstream processing. Whether it’s fed into a speech recognition system, analyzed for voice-based commands, or used for real-time feedback, the returned buffer serves as a reliable and high-quality representation of human voice activity. This ensures that subsequent applications - whether for transcription, feature extraction, or command execution - can focus entirely on relevant data without the distractions of noise or gaps in the voice stream.

Source: https://github.com/voicekick/voicekick

out13 substack

Discussion about this post