Skip to Content
Mélodium 0.10.1 is now available!
DocsExamplesRealtime Voice Assistant

Realtime Voice Assistant

Source: 11_realtime_voice_assistant

A voice assistant that transcribes microphone audio with local Whisper and sends each segment to a language model for a response. Two entrypoints let you choose between a remote LLM (streaming tokens) or a fully local Mistral 7B (no API key needed).

Running

With a remote LLM (GPT-4o):

melodium run 11_realtime_voice_assistant/Compo.toml --openai_key sk-...

Fully local (no API key, requires ~14 GB RAM):

melodium run 11_realtime_voice_assistant/Compo.toml localonly
[…] info: assistant: ready — speak into the microphone […] info: you: What time is it in Tokyo? […] info: assistant: Tokyo is in Japan Standard Time (JST), which is UTC+9…

How it works

Both entrypoints share the same Whisper loading sequence. The difference lies only in which LLM backend is used downstream.

main — local Whisper + remote LLM

See in Compositeur Studio

Each transcribed segment fans out to two consumers simultaneously:

asrDecode.transcribed -> logQuestion.messages asrDecode.transcribed -> remoteAnswer.question

remoteAnswer uses llmStream which emits tokens one by one as a Stream<string>, printed to the log in real time without waiting for the full response.

localonly — local Whisper + local Mistral

See in Compositeur Studio

The localOnly entrypoint loads Mistral 7B in parallel with Whisper. Its localAnswer sub-treatment uses generate instead of llmStream, but exposes the same Stream<string> output interface — the fan-out and logging logic in the entrypoint is unchanged.

Shared interface, different backends

remoteAnswer and localAnswer both accept Stream<string> and emit Stream<string>. The entrypoint does not know or care which one it calls — swapping backends is purely a model-level concern.

Dependencies

[dependencies] std = "0.10.1" # core flows, logging, data structures audio = "0.10.1" # audio decode / encode / resample record = "0.10.1" # microphone capture ml = "0.10.1" # LLM, STT, TTS and local model inference