kokoro-sse / README.md
ArunKr's picture
Update README.md
dbacbf7 verified
|
raw
history blame
3.44 kB
metadata
title: Kokoro ONNX TTS (SSE)
emoji: πŸ—£οΈ
colorFrom: indigo
colorTo: purple
sdk: docker
pinned: false
license: apache-2.0

Kokoro-ONNX TTS β€” FastAPI + SSE (Docker)

A minimal streaming TTS API using FastAPI and kokoro-onnx, with a Server-Sent Events (SSE) endpoint. Includes:

  • /v1/tts.sse β†’ streams base64-encoded PCM16 frames (24 kHz mono)
  • /v1/voices β†’ list voices
  • /healthz β†’ health
  • A simple HTML client (buffers then plays after generation) and a Python client that streams to ffplay for real-time playback
  • Dockerfile using uv (Astral) as the package manager

Note: Browsers can't easily play raw PCM from SSE without an AudioWorklet and resampling. For reliable live playback, use the Python client + ffplay pipeline below.

1) Get model + voices

Download the Kokoro ONNX model and voices file and place them under models/:

mkdir -p models
# Example (update URLs to the latest release if needed):
curl -L -o models/kokoro-v1.0.int8.onnx "https://github.com/thewh1teagle/kokoro-onnx/releases/download/model-files-v1.0/kokoro-v1.0.int8.onnx"
curl -L -o models/voices-v1.0.bin      "https://github.com/thewh1teagle/kokoro-onnx/releases/download/model-files-v1.0/voices-v1.0.bin"

Set env vars if you use different paths:

  • KOKORO_MODEL (default: models/kokoro-v1.0.int8.onnx)
  • KOKORO_VOICES (default: models/voices-v1.0.bin)

2) Run locally (without Docker)

uv venv
source .venv/bin/activate
uv pip install -r requirements.txt

# Start
uv run uvicorn app:app --host 0.0.0.0 --port 8000

# Test SSE β†’ ffplay (streaming)
curl -G --data-urlencode 'text=Hello from Kokoro SSE!'   "http://localhost:8000/v1/tts.sse?voice=af_sarah&speed=1.0&lang=en-us" | python3 clients/py_play_ffplay.py
# (This script writes raw PCM16 to stdout; pipe to ffplay)
# Example:
# curl -G --data-urlencode 'text=Hello!' "http://localhost:8000/v1/tts.sse" | python3 clients/py_play_ffplay.py | ffplay -nodisp -autoexit -f s16le -ar 24000 -ac 1 -

3) Build & run with Docker

# Build (ensure models/ exists in build context if you want to COPY them)
docker build -t kokoro-sse .

# Run
docker run --rm -p 8000:8000   -e KOKORO_MODEL=/app/models/kokoro-v1.0.int8.onnx   -e KOKORO_VOICES=/app/models/voices-v1.0.bin   kokoro-sse

If you don't want to bake models into the image, you can -v mount a host folder containing the model files to /app/models.

4) Endpoints

  • GET /v1/tts.sse?text=...&voice=af_sarah&speed=1.0&lang=en-us
    Returns text/event-stream with events:

    {"seq": 0, "sr": 24000, "ch": 1, "format": "s16le", "pcm16": "<base64>"}
    

    Final message: event: done, data: {"total_chunks": N, "total_samples": M}

  • GET /v1/voices β†’ {"voices": [...]}

  • GET /healthz β†’ {"status": "ok", "model": "..."}

5) Simple HTML client

Open static/client.html in a browser. It buffers SSE chunks and plays after the stream completes.

6) Python client for real-time playback (recommended)

# Linux / macOS:
curl -G --data-urlencode 'text=Live streaming via SSE' "http://localhost:8000/v1/tts.sse"  | python3 clients/py_play_ffplay.py  | ffplay -nodisp -autoexit -f s16le -ar 24000 -ac 1 -

7) Notes

  • Output sample rate is 24 kHz (mono), PCM16 LE.
  • If you need a WebSocket endpoint (binary frames) or an OpenAI-compatible API shim, you can extend app.py easily.