Spaces:

vbrajon
/

whisperlivekit

Runtime error

App Files Files Community

whisperlivekit / README.md

Dominik Macháček

feat: HF / git LFS

36d4bc9 over 2 years ago

preview code

raw

history blame contribute delete

13.3 kB

	---
	title: Whisper Live Kit
	emoji: 🐳
	colorFrom: purple
	colorTo: gray
	sdk: docker
	app_port: 7860
	---

	<h1 align="center">WhisperLiveKit</h1>

	<p align="center">
	<img src="https://raw.githubusercontent.com/QuentinFuxa/WhisperLiveKit/refs/heads/main/demo.png" alt="WhisperLiveKit Demo" width="730">
	</p>

	<p align="center"><b>Real-time, Fully Local Speech-to-Text with Speaker Diarization</b></p>

	<p align="center">
	<a href="https://pypi.org/project/whisperlivekit/"><img alt="PyPI Version" src="https://img.shields.io/pypi/v/whisperlivekit?color=g"></a>
	<a href="https://pepy.tech/project/whisperlivekit"><img alt="PyPI Downloads" src="https://static.pepy.tech/personalized-badge/whisperlivekit?period=total&units=international_system&left_color=grey&right_color=brightgreen&left_text=downloads"></a>
	<a href="https://pypi.org/project/whisperlivekit/"><img alt="Python Versions" src="https://img.shields.io/badge/python-3.9--3.13-dark_green"></a>
	<a href="https://github.com/QuentinFuxa/WhisperLiveKit/blob/main/LICENSE"><img alt="License" src="https://img.shields.io/badge/License-MIT/Dual Licensed-dark_green"></a>
	</p>


	WhisperLiveKit brings real-time speech transcription directly to your browser, with a ready-to-use backend+server and a simple frontend. ✨

	Built on [SimulStreaming](https://github.com/ufal/SimulStreaming) (SOTA 2025) and [WhisperStreaming](https://github.com/ufal/whisper_streaming) (SOTA 2023) for transcription, plus [Streaming Sortformer](https://arxiv.org/abs/2507.18446) (SOTA 2025) and [Diart](https://github.com/juanmc2005/diart) (SOTA 2021) for diarization.


	### Key Features

	- Real-time Transcription - Locally (or on-prem) convert speech to text instantly as you speak
	- Speaker Diarization - Identify different speakers in real-time. (⚠️ backend Streaming Sortformer in developement)
	- Multi-User Support - Handle multiple users simultaneously with a single backend/server
	- Automatic Silence Chunking – Automatically chunks when no audio is detected to limit buffer size
	- Confidence Validation – Immediately validate high-confidence tokens for faster inference (WhisperStreaming only)
	- Buffering Preview – Displays unvalidated transcription segments (not compatible with SimulStreaming yet)
	- Punctuation-Based Speaker Splitting [BETA] - Align speaker changes with natural sentence boundaries for more readable transcripts
	- SimulStreaming Backend - [Dual-licensed](https://github.com/ufal/SimulStreaming#-licence-and-contributions) - Ultra-low latency transcription using SOTA AlignAtt policy.

	### Architecture

	<img alt="Architecture" src="architecture.png" />


	## Quick Start

	```bash
	# Install the package
	pip install whisperlivekit

	# Start the transcription server
	whisperlivekit-server --model tiny.en

	# Open your browser at http://localhost:8000 to see the interface.
	# Use -ssl-certfile public.crt --ssl-keyfile private.key parameters to use SSL
	```

	That's it! Start speaking and watch your words appear on screen.

	## Installation

	```bash
	#Install from PyPI (Recommended)
	pip install whisperlivekit

	#Install from Source
	git clone https://github.com/QuentinFuxa/WhisperLiveKit
	cd WhisperLiveKit
	pip install -e .
	```

	### FFmpeg Dependency

	```bash
	# Ubuntu/Debian
	sudo apt install ffmpeg

	# macOS
	brew install ffmpeg

	# Windows
	# Download from https://ffmpeg.org/download.html and add to PATH
	```

	### Optional Dependencies

	```bash
	# Voice Activity Controller (prevents hallucinations)
	pip install torch

	# Sentence-based buffer trimming
	pip install mosestokenizer wtpsplit
	pip install tokenize_uk # If you work with Ukrainian text

	# Speaker diarization
	pip install diart

	# Alternative Whisper backends (default is faster-whisper)
	pip install whisperlivekit[whisper] # Original Whisper
	pip install whisperlivekit[whisper-timestamped] # Improved timestamps
	pip install whisperlivekit[mlx-whisper] # Apple Silicon optimization
	pip install whisperlivekit[openai] # OpenAI API
	pip install whisperlivekit[simulstreaming]
	```

	### 🎹 Pyannote Models Setup

	For diarization, you need access to pyannote.audio models:

	1. [Accept user conditions](https://huggingface.co/pyannote/segmentation) for the `pyannote/segmentation` model
	2. [Accept user conditions](https://huggingface.co/pyannote/segmentation-3.0) for the `pyannote/segmentation-3.0` model
	3. [Accept user conditions](https://huggingface.co/pyannote/embedding) for the `pyannote/embedding` model
	4. Login with HuggingFace:
	```bash
	pip install huggingface_hub
	huggingface-cli login
	```

	## 💻 Usage Examples

	### Command-line Interface

	Start the transcription server with various options:

	```bash
	# Basic server with English model
	whisperlivekit-server --model tiny.en

	# Advanced configuration with diarization
	whisperlivekit-server --host 0.0.0.0 --port 8000 --model medium --diarization --language auto

	# SimulStreaming backend for ultra-low latency
	whisperlivekit-server --backend simulstreaming --model large-v3 --frame-threshold 20
	```


	### Python API Integration (Backend)
	Check [basic_server.py](https://github.com/QuentinFuxa/WhisperLiveKit/blob/main/whisperlivekit/basic_server.py) for a complete example.

	```python
	from whisperlivekit import TranscriptionEngine, AudioProcessor, parse_args
	from fastapi import FastAPI, WebSocket, WebSocketDisconnect
	from fastapi.responses import HTMLResponse
	from contextlib import asynccontextmanager
	import asyncio

	transcription_engine = None

	@asynccontextmanager
	async def lifespan(app: FastAPI):
	global transcription_engine
	transcription_engine = TranscriptionEngine(model="medium", diarization=True, lan="en")
	# You can also load from command-line arguments using parse_args()
	# args = parse_args()
	# transcription_engine = TranscriptionEngine(**vars(args))
	yield

	app = FastAPI(lifespan=lifespan)

	# Process WebSocket connections
	async def handle_websocket_results(websocket: WebSocket, results_generator):
	async for response in results_generator:
	await websocket.send_json(response)
	await websocket.send_json({"type": "ready_to_stop"})

	@app.websocket("/asr")
	async def websocket_endpoint(websocket: WebSocket):
	global transcription_engine

	# Create a new AudioProcessor for each connection, passing the shared engine
	audio_processor = AudioProcessor(transcription_engine=transcription_engine)
	results_generator = await audio_processor.create_tasks()
	results_task = asyncio.create_task(handle_websocket_results(websocket, results_generator))
	await websocket.accept()
	while True:
	message = await websocket.receive_bytes()
	await audio_processor.process_audio(message)
	```

	### Frontend Implementation

	The package includes a simple HTML/JavaScript implementation that you can adapt for your project. You can find it [here](https://github.com/QuentinFuxa/WhisperLiveKit/blob/main/whisperlivekit/web/live_transcription.html), or load its content using `get_web_interface_html()` :

	```python
	from whisperlivekit import get_web_interface_html
	html_content = get_web_interface_html()
	```

	## ⚙️ Configuration Reference

	WhisperLiveKit offers extensive configuration options:

	\| Parameter \| Description \| Default \|
	\|-----------\|-------------\|---------\|
	\| `--host` \| Server host address \| `localhost` \|
	\| `--port` \| Server port \| `8000` \|
	\| `--model` \| Whisper model size. Caution : '.en' models do not work with Simulstreaming \| `tiny` \|
	\| `--language` \| Source language code or `auto` \| `en` \|
	\| `--task` \| `transcribe` or `translate` \| `transcribe` \|
	\| `--backend` \| Processing backend \| `faster-whisper` \|
	\| `--diarization` \| Enable speaker identification \| `False` \|
	\| `--punctuation-split` \| Use punctuation to improve speaker boundaries \| `True` \|
	\| `--confidence-validation` \| Use confidence scores for faster validation \| `False` \|
	\| `--min-chunk-size` \| Minimum audio chunk size (seconds) \| `1.0` \|
	\| `--vac` \| Use Voice Activity Controller \| `False` \|
	\| `--no-vad` \| Disable Voice Activity Detection \| `False` \|
	\| `--buffer_trimming` \| Buffer trimming strategy (`sentence` or `segment`) \| `segment` \|
	\| `--warmup-file` \| Audio file path for model warmup \| `jfk.wav` \|
	\| `--ssl-certfile` \| Path to the SSL certificate file (for HTTPS support) \| `None` \|
	\| `--ssl-keyfile` \| Path to the SSL private key file (for HTTPS support) \| `None` \|
	\| `--segmentation-model` \| Hugging Face model ID for pyannote.audio segmentation model. [Available models](https://github.com/juanmc2005/diart/tree/main?tab=readme-ov-file#pre-trained-models) \| `pyannote/segmentation-3.0` \|
	\| `--embedding-model` \| Hugging Face model ID for pyannote.audio embedding model. [Available models](https://github.com/juanmc2005/diart/tree/main?tab=readme-ov-file#pre-trained-models) \| `speechbrain/spkrec-ecapa-voxceleb` \|

	SimulStreaming-specific Options:

	\| Parameter \| Description \| Default \|
	\|-----------\|-------------\|---------\|
	\| `--frame-threshold` \| AlignAtt frame threshold (lower = faster, higher = more accurate) \| `25` \|
	\| `--beams` \| Number of beams for beam search (1 = greedy decoding) \| `1` \|
	\| `--decoder` \| Force decoder type (`beam` or `greedy`) \| `auto` \|
	\| `--audio-max-len` \| Maximum audio buffer length (seconds) \| `30.0` \|
	\| `--audio-min-len` \| Minimum audio length to process (seconds) \| `0.0` \|
	\| `--cif-ckpt-path` \| Path to CIF model for word boundary detection \| `None` \|
	\| `--never-fire` \| Never truncate incomplete words \| `False` \|
	\| `--init-prompt` \| Initial prompt for the model \| `None` \|
	\| `--static-init-prompt` \| Static prompt that doesn't scroll \| `None` \|
	\| `--max-context-tokens` \| Maximum context tokens \| `None` \|
	\| `--model-path` \| Direct path to .pt model file. Download it if not found \| `./base.pt` \|

	## 🔧 How It Works

	1. Audio Capture: Browser's MediaRecorder API captures audio in webm/opus format
	2. Streaming: Audio chunks are sent to the server via WebSocket
	3. Processing: Server decodes audio with FFmpeg and streams into the model for transcription
	4. Real-time Output: Partial transcriptions appear immediately in light gray (the 'aperçu') and finalized text appears in normal color

	## 🚀 Deployment Guide

	To deploy WhisperLiveKit in production:

	1. Server Setup (Backend):
	```bash
	# Install production ASGI server
	pip install uvicorn gunicorn

	# Launch with multiple workers
	gunicorn -k uvicorn.workers.UvicornWorker -w 4 your_app:app
	```

	2. Frontend Integration:
	- Host your customized version of the example HTML/JS in your web application
	- Ensure WebSocket connection points to your server's address

	3. Nginx Configuration (recommended for production):
	```nginx
	server {
	listen 80;
	server_name your-domain.com;

	location / {
	proxy_pass http://localhost:8000;
	proxy_set_header Upgrade $http_upgrade;
	proxy_set_header Connection "upgrade";
	proxy_set_header Host $host;
	}}
	```

	4. HTTPS Support: For secure deployments, use "wss://" instead of "ws://" in WebSocket URL

	### 🐋 Docker

	A basic Dockerfile is provided which allows re-use of Python package installation options. ⚠️ For large models, ensure that your docker runtime has enough memory available. See below usage examples:


	#### All defaults
	- Create a reusable image with only the basics and then run as a named container:
	```bash
	docker build -t whisperlivekit-defaults .
	docker create --gpus all --name whisperlivekit -p 8000:8000 whisperlivekit-defaults
	docker start -i whisperlivekit
	```

	> Note: If you're running on a system without NVIDIA GPU support (such as Mac with Apple Silicon or any system without CUDA capabilities), you need to remove the `--gpus all` flag from the `docker create` command. Without GPU acceleration, transcription will use CPU only, which may be significantly slower. Consider using small models for better performance on CPU-only systems.

	#### Customization
	- Customize the container options:
	```bash
	docker build -t whisperlivekit-defaults .
	docker create --gpus all --name whisperlivekit-base -p 8000:8000 whisperlivekit-defaults --model base
	docker start -i whisperlivekit-base
	```

	- `--build-arg` Options:
	- `EXTRAS="whisper-timestamped"` - Add extras to the image's installation (no spaces). Remember to set necessary container options!
	- `HF_PRECACHE_DIR="./.cache/"` - Pre-load a model cache for faster first-time start
	- `HF_TKN_FILE="./token"` - Add your Hugging Face Hub access token to download gated models

	## 🔮 Use Cases
	Capture discussions in real-time for meeting transcription, help hearing-impaired users follow conversations through accessibility tools, transcribe podcasts or videos automatically for content creation, transcribe support calls with speaker identification for customer service...

	## 🙏 Acknowledgments

	We extend our gratitude to the original authors of:

	\| [Whisper Streaming](https://github.com/ufal/whisper_streaming) \| [SimulStreaming](https://github.com/ufal/SimulStreaming) \| [Diart](https://github.com/juanmc2005/diart) \| [OpenAI Whisper](https://github.com/openai/whisper) \|
	\| -------- \| ------- \| -------- \| ------- \|