Skip to Content
Mélodium 0.10.1 is now available!
DocsExamplesSpeech Transcription

Speech Transcription

Source: 02_speech_transcription

Transcribes audio to text using a local Whisper model downloaded automatically from Hugging Face on first run. Two entrypoints: main for live microphone input, fromfile for transcribing an existing audio file.

Running

Live transcription from the microphone:

melodium run 02_speech_transcription/Compo.toml

Transcribe an audio file:

melodium run 02_speech_transcription/Compo.toml fromfile -- --input_file speech.wav

Expected output:

[…] info: transcription: Hello, this is a test. […] info: transcription: The model is running locally.

How it works

Two models are declared at the top of each entrypoint:

model Hub() : HfHub { repo_id = "openai/whisper-tiny" } model Speech() : Whisper {}

Hub points to the HuggingFace repository for Whisper tiny. Speech is an empty Whisper model configuration — default parameters are used.

Model loading sequence

The connections enforce that audio capture only starts once the model is ready:

fetch downloads weights and tokenizer; load initialises the model; load.loaded simultaneously gates both decode.ready and the audio source. No synchronisation primitive is needed — the dataflow itself enforces the ordering.

See in Compositeur Studio

Fan-out to log and file

Once Whisper produces a transcribed segment, it is forwarded to two outputs at once using the --> double-arrow fan-out:

decode.transcribed --> log.messages decode.transcribed --> write.text

Both operations run concurrently.

fromfile entrypoint

The fromFile entrypoint replaces recordMono with readLocal + decodeMono. The decodeMono(hint="wav") treatment handles container format detection transparently — the same pipeline works for WAV, MP3, FLAC, and other formats.

See in Compositeur Studio

Dependencies

[dependencies] std = "0.10.1" # core flows, logging, data structures fs = "0.10.1" # local file I/O audio = "0.10.1" # audio decode / encode / resample record = "0.10.1" # microphone capture ml = "0.10.1" # LLM, STT, TTS and local model inference