Building VoiceKick: modern desktop application

Graphical user interface combining voice stream and Whisper

Jan 11, 2025

When I started working on VoiceKick, one of the first decisions I faced was whether to build a terminal-based application (TUI) or a modern desktop application with a graphical user interface (GUI). Terminal interfaces are lightweight and often preferred by developers, but I realized a desktop application would be more convenient and accessible for a broader audience, especially non-technical users.

After deciding on a GUI, the next step was choosing the right framework to bring VoiceKick to life. This required striking a balance between performance, ease of development, and user experience. Below is a short video depicting the application, source in voicekick-dioxus.

Exploring the Framework Landscape

I researched several frameworks, and a few stood out as the most promising options for building modern desktop applications in Rust:

Tauri: lightweight and fast, using web technologies like HTML, CSS, and JavaScript for the frontend while leveraging Rust for the backend. It seemed like a natural choice for small, fast apps.
Dioxus: inspired by React, Dioxus offers a declarative approach to building GUIs in Rust. It felt more aligned with my preference for modern, component-driven designs.
Iced: A robust framework for building native GUIs, offering a clean API but somewhat limited in terms of styling and modern UI capabilities.
Slint: Designed for building sleek, custom user interfaces, but its focus seemed more niche for applications requiring intricate visuals.

First iteration: Tauri

Initially, I chose Tauri, drawn by its promise of creating lightweight desktop apps with a strong Rust backend. The idea of building GUIs using familiar web technologies was appealing. I quickly set up the first version of VoiceKick, which allowed basic functionality like selecting input devices and displaying waveforms.

However, I ran into issues almost immediately. While Tauri is great for some use cases, debugging turned out to be a frustrating experience. After half a day of troubleshooting with minimal feedback or actionable insights, I decided it wasn’t worth the time investment - at least not for this project.

Switching to Dioxus

After leaving Tauri behind, I turned to Dioxus, which had some compelling advantages:

Extensive documentation: the documentation is comprehensive and developer-friendly, making it easier to troubleshoot and experiment.
Familiar concepts: it borrows from React’s declarative component model, which I found intuitive and efficient for building UIs.
Flexibility and performance: Dioxus reuses some of Tauri’s better ideas but focuses on performance and a more ergonomic developer experience.

With Dioxus, I quickly recreated the basic structure of VoiceKick and even added features without the earlier debugging headaches. The framework’s component-based architecture allowed me to iterate faster and stay organized.

Building VoiceKick’s core features

For the first iteration of the VoiceKick desktop application, I implemented two main pages:

Page 1: Voice configuration and waveforms

Input device selection: users can choose the audio input device (e.g., a specific microphone).
Voice detection threshold: a slider allows fine-tuning the voice detection threshold, balancing sensitivity and noise filtering.
Waveform visualization: the page displays real-time audio waveforms, giving users instant feedback on their input. This visualization is crucial for understanding how the app interprets sound.

Page 2: Whisper configuration

Model selection: users can select the Whisper model to use for transcription. The default is TinyEn, optimized for lightweight and fast transcription tasks.
Language settings: users can specify the language, ensuring accurate transcription for multilingual setups.

Both pages focus on simplicity and functionality, ensuring the app is user-friendly and practical for real-world use cases.

Summary

Switching frameworks early in development was an easy decision and ultimately the right one. Here are a few takeaways from the process:

1. Start simple: the first version contains a bunch of unwrap and expect statements; it just needs to work. This helped me focus on core features before worrying about polish.

2. Prioritize DevEx: a framework that simplifies debugging and iteration and saves significant time and frustration.

3. Flexibility matters: choosing a flexible framework like Dioxus made it easier to adapt as the project evolved.

4. Keep the user in mind: Transitioning from a TUI to a GUI may have been more work initially, but it made VoiceKick more accessible to a wider audience.

Building VoiceKick: voice stream + Whisper

ernis

December 29, 2024

When it comes to speech recognition and natural language processing, achieving precision in transcriptions and responses is as much an art as it is a science. In voice whisper, one of the key techniques to refine outputs and ensure accurate recognition lies in

Read full story

out13 substack

Building VoiceKick: voice stream + Whisper

Discussion about this post