Skip to Content
Mélodium 0.10.1 is now available!
DocsExamplesDistributed LLM Inference

Distributed LLM Inference

Source: 15_distributed_llm_inference

An HTTP server that accepts plain-text prompts and streams LLM responses back. The LLM call runs on a Mélodium cloud runner — the ml package only needs to be available on the runner, not on the front-end machine. The front-end requires no ML dependencies at all.

Running

melodium run 15_distributed_llm_inference/Compo.toml \ --api_token "my-api-token" \ --openai_key sk-...
$ curl -X POST http://127.0.0.1:8080/chat -d "Explain Mélodium in one sentence." Mélodium is a dataflow programming language…

How it works

The Assistant model and the inferText treatment run on the remote runner. The front-end only needs the http, distrib, and work packages:

model distributor: DistributionEngine( treatment = "distributed_llm_inference/main::inferText", version = "0.1.0" )
See in Compositeur Studio

Passing const parameters to the remote treatment

inferText needs the openai_key to configure its Assistant model, but const parameters cannot be passed through streams. They are sent via the distribution engine’s start call:

distribStart: start[distributor=distributor]( params = |map([|entry<string>("openai_key", openai_key)]) )

On the remote side, inferText declares:

treatment inferText(const openai_key: string) model llm: Assistant(openai_key=openai_key)

The const is set once when the runner starts and shared across all invocations of that treatment.

Token streaming end-to-end

chat on the remote side emits response tokens as Stream<string>. They are encoded to bytes, sent back through recvStream<byte>, and forwarded directly into connection.data on the front-end — tokens appear in the HTTP response as they are generated, with no intermediate buffering.

See in Compositeur Studio

Dependencies

[dependencies] std = "0.10.1" # core flows, logging, data structures http = "0.10.1" # HTTP server and client net = "0.10.1" # IP address helpers encoding = "0.10.1" # UTF-8 encode / decode work = "0.10.1" # cloud runner provisioning distrib = "0.10.1" # stream distribution across runners ml = "0.10.1" # LLM, STT, TTS and local model inference