Spaces:
Sleeping
title: Kokoro ONNX TTS (SSE)
emoji: π£οΈ
colorFrom: indigo
colorTo: purple
sdk: docker
pinned: false
license: apache-2.0
Kokoro-ONNX TTS β FastAPI + SSE (Docker)
A minimal streaming TTS API using FastAPI and kokoro-onnx, with a Server-Sent Events (SSE) endpoint. Includes:
/v1/tts.sseβ streams base64-encoded PCM16 frames (24 kHz mono)/v1/voicesβ list voices/healthzβ health- A simple HTML client (buffers then plays after generation) and a Python client that streams to ffplay for real-time playback
- Dockerfile using uv (Astral) as the package manager
Note: Browsers can't easily play raw PCM from SSE without an AudioWorklet and resampling. For reliable live playback, use the Python client +
ffplaypipeline below.
1) Get model + voices
Download the Kokoro ONNX model and voices file and place them under models/:
mkdir -p models
# Example (update URLs to the latest release if needed):
curl -L -o models/kokoro-v1.0.int8.onnx "https://github.com/thewh1teagle/kokoro-onnx/releases/download/model-files-v1.0/kokoro-v1.0.int8.onnx"
curl -L -o models/voices-v1.0.bin "https://github.com/thewh1teagle/kokoro-onnx/releases/download/model-files-v1.0/voices-v1.0.bin"
Set env vars if you use different paths:
KOKORO_MODEL(default:models/kokoro-v1.0.int8.onnx)KOKORO_VOICES(default:models/voices-v1.0.bin)
2) Run locally (without Docker)
uv venv
source .venv/bin/activate
uv pip install -r requirements.txt
# Start
uv run uvicorn app:app --host 0.0.0.0 --port 8000
# Test SSE β ffplay (streaming)
curl -G --data-urlencode 'text=Hello from Kokoro SSE!' "http://localhost:8000/v1/tts.sse?voice=af_sarah&speed=1.0&lang=en-us" | python3 clients/py_play_ffplay.py
# (This script writes raw PCM16 to stdout; pipe to ffplay)
# Example:
# curl -G --data-urlencode 'text=Hello!' "http://localhost:8000/v1/tts.sse" | python3 clients/py_play_ffplay.py | ffplay -nodisp -autoexit -f s16le -ar 24000 -ac 1 -
3) Build & run with Docker
# Build (ensure models/ exists in build context if you want to COPY them)
docker build -t kokoro-sse .
# Run
docker run --rm -p 8000:8000 -e KOKORO_MODEL=/app/models/kokoro-v1.0.int8.onnx -e KOKORO_VOICES=/app/models/voices-v1.0.bin kokoro-sse
If you don't want to bake models into the image, you can -v mount a host folder containing the model files to /app/models.
4) Endpoints
GET /v1/tts.sse?text=...&voice=af_sarah&speed=1.0&lang=en-us
Returnstext/event-streamwith events:{"seq": 0, "sr": 24000, "ch": 1, "format": "s16le", "pcm16": "<base64>"}Final message:
event: done,data: {"total_chunks": N, "total_samples": M}GET /v1/voicesβ{"voices": [...]}GET /healthzβ{"status": "ok", "model": "..."}
5) Simple HTML client
Open static/client.html in a browser. It buffers SSE chunks and plays after the stream completes.
6) Python client for real-time playback (recommended)
# Linux / macOS:
curl -G --data-urlencode 'text=Live streaming via SSE' "http://localhost:8000/v1/tts.sse" | python3 clients/py_play_ffplay.py | ffplay -nodisp -autoexit -f s16le -ar 24000 -ac 1 -
7) Notes
- Output sample rate is 24 kHz (mono), PCM16 LE.
- If you need a WebSocket endpoint (binary frames) or an OpenAI-compatible API shim, you can extend
app.pyeasily.