Silent Lip Reader β VSR weights
The visual-speech-recognition (lip-reading) model weights used by the Silent Lip Reader Space. Re-hosted here so that the open-source Space is self-contained and does not break if upstream repos move.
- Architecture: Auto-AVSR β ResNet-3D + Conformer encoder, Transformer decoder,
joint CTC/attention. Input: 88Γ88 grayscale mouth crops @ 25fps. Output: text via a
5000-unit SentencePiece (
unigram5000) vocabulary. Video-only (no audio path). - Files:
pytorch_model.pt(state dict),unigram5000.model,unigram5000_units.txt.
Credits / provenance (please read)
This checkpoint is not trained by the re-host. Honest attribution:
- Model architecture & training: Auto-AVSR (Pingchuan Ma et al., "Auto-AVSR: Audio-Visual Speech Recognition with Automatic Labels"). All model credit to the original authors.
- Checkpoint source: mirrored from
AD1TEYA/lip-reading-modelon the Hub. - Re-host + the surrounding system, demo, visual-VAD pipeline, evaluation and research: Ahmet Dedeler (π€ aaahmet).
Intended use
Research and demos of silent visual speech recognition. The weights were trained on LRS3-derived data; treat as research use. Best on clear, frontal, well-articulated English. ~25β30% WER on clean speech, higher on casual speech (lip reading is inherently ambiguous β many phonemes look identical on the lips).
Usage
Used by the Silent Lip Reader Space β record a (silent) video, it crops your mouth, chunks utterances by lip motion, and decodes text. See the Space for the full pipeline and research log.
Built / curated by Ahmet Dedeler β https://ahmetdedeler.com. A credit/link back is appreciated if you use this. License MIT (follows the upstream Space).