Realtime Voice Assistant
Source: 11_realtime_voice_assistant
A voice assistant that transcribes microphone audio with local Whisper and sends each segment to a language model for a response. Two entrypoints let you choose between a remote LLM (streaming tokens) or a fully local Mistral 7B (no API key needed).
Running
With a remote LLM (GPT-4o):
melodium run 11_realtime_voice_assistant/Compo.toml --openai_key sk-...Fully local (no API key, requires ~14 GB RAM):
melodium run 11_realtime_voice_assistant/Compo.toml localonly[…] info: assistant: ready — speak into the microphone
[…] info: you: What time is it in Tokyo?
[…] info: assistant: Tokyo is in Japan Standard Time (JST), which is UTC+9…How it works
Both entrypoints share the same Whisper loading sequence. The difference lies only in which LLM backend is used downstream.
main — local Whisper + remote LLM
See in Compositeur Studio
Each transcribed segment fans out to two consumers simultaneously:
asrDecode.transcribed -> logQuestion.messages
asrDecode.transcribed -> remoteAnswer.questionremoteAnswer uses llmStream which emits tokens one by one as a Stream<string>, printed to the log in real time without waiting for the full response.
localonly — local Whisper + local Mistral
See in Compositeur Studio
The localOnly entrypoint loads Mistral 7B in parallel with Whisper. Its localAnswer sub-treatment uses generate instead of llmStream, but exposes the same Stream<string> output interface — the fan-out and logging logic in the entrypoint is unchanged.
Shared interface, different backends
remoteAnswer and localAnswer both accept Stream<string> and emit Stream<string>. The entrypoint does not know or care which one it calls — swapping backends is purely a model-level concern.
Dependencies
[dependencies]
std = "0.10.1" # core flows, logging, data structures
audio = "0.10.1" # audio decode / encode / resample
record = "0.10.1" # microphone capture
ml = "0.10.1" # LLM, STT, TTS and local model inference