Full Voice Pipeline
Source: 08_full_voice_pipeline
A complete speech-in / speech-out loop using three cloud APIs: reads an audio file, transcribes it with OpenAI Whisper, generates a response with GPT-4o, synthesises the response as speech with ElevenLabs, and writes the output audio file.
Running
melodium run 08_full_voice_pipeline/Compo.toml \
--input_file question.wav \
--openai_key sk-... \
--elevenlabs_key el-... \
--elevenlabs_voice JBFqnCBsd6RMkjVDRZzb[…] info: pipeline: starting voice pipeline…
[…] info: pipeline: answer writtenHow it works
Three models cover the three API stages:
model stt: Stt(openai_key=openai_key)
model llm: Llm(openai_key=openai_key)
model tts: Tts(elevenlabs_key=elevenlabs_key, voice=elevenlabs_voice)The main treatment wires them sequentially through sub-treatments:
See in Compositeur StudioBlock/Stream boundary at STT output
transcribe (remote STT) returns a Block<string> — one value for the whole audio file. The downstream llmRespond treatment expects a Stream<string> input. The stream<string>() adapter bridges this:
transcribe.transcript -> transcriptAsStream.block,stream -> llmRespond.questionThree sub-treatments, one interface each
Each stage is isolated in its own sub-treatment with a clean Stream<T> in / Stream<T> out signature. This keeps error handling co-located with the stage that can produce it, and makes each stage independently replaceable.
TTS output
synthesize from RemoteTts emits audio bytes as a Stream<byte>, written directly to the output file. The audio format (MP3 by default for ElevenLabs) is determined by the TTS backend.
Dependencies
[dependencies]
std = "0.10.1" # core flows, logging, data structures
fs = "0.10.1" # local file I/O
audio = "0.10.1" # audio decode / encode / resample
record = "0.10.1" # microphone capture
ml = "0.10.1" # LLM, STT, TTS and local model inference