Skip to Content
Mélodium 0.10.1 is now available!
DocsExamplesFull Voice Pipeline

Full Voice Pipeline

Source: 08_full_voice_pipeline

A complete speech-in / speech-out loop using three cloud APIs: reads an audio file, transcribes it with OpenAI Whisper, generates a response with GPT-4o, synthesises the response as speech with ElevenLabs, and writes the output audio file.

Running

melodium run 08_full_voice_pipeline/Compo.toml \ --input_file question.wav \ --openai_key sk-... \ --elevenlabs_key el-... \ --elevenlabs_voice JBFqnCBsd6RMkjVDRZzb
[…] info: pipeline: starting voice pipeline… […] info: pipeline: answer written

How it works

Three models cover the three API stages:

model stt: Stt(openai_key=openai_key) model llm: Llm(openai_key=openai_key) model tts: Tts(elevenlabs_key=elevenlabs_key, voice=elevenlabs_voice)

The main treatment wires them sequentially through sub-treatments:

See in Compositeur Studio

Block/Stream boundary at STT output

transcribe (remote STT) returns a Block<string> — one value for the whole audio file. The downstream llmRespond treatment expects a Stream<string> input. The stream<string>() adapter bridges this:

transcribe.transcript -> transcriptAsStream.block,stream -> llmRespond.question

Three sub-treatments, one interface each

Each stage is isolated in its own sub-treatment with a clean Stream<T> in / Stream<T> out signature. This keeps error handling co-located with the stage that can produce it, and makes each stage independently replaceable.

TTS output

synthesize from RemoteTts emits audio bytes as a Stream<byte>, written directly to the output file. The audio format (MP3 by default for ElevenLabs) is determined by the TTS backend.

Dependencies

[dependencies] std = "0.10.1" # core flows, logging, data structures fs = "0.10.1" # local file I/O audio = "0.10.1" # audio decode / encode / resample record = "0.10.1" # microphone capture ml = "0.10.1" # LLM, STT, TTS and local model inference