Silent Lip Reader β€” VSR weights

The visual-speech-recognition (lip-reading) model weights used by the Silent Lip Reader Space. Re-hosted here so that the open-source Space is self-contained and does not break if upstream repos move.

  • Architecture: Auto-AVSR β€” ResNet-3D + Conformer encoder, Transformer decoder, joint CTC/attention. Input: 88Γ—88 grayscale mouth crops @ 25fps. Output: text via a 5000-unit SentencePiece (unigram5000) vocabulary. Video-only (no audio path).
  • Files: pytorch_model.pt (state dict), unigram5000.model, unigram5000_units.txt.

Credits / provenance (please read)

This checkpoint is not trained by the re-host. Honest attribution:

  • Model architecture & training: Auto-AVSR (Pingchuan Ma et al., "Auto-AVSR: Audio-Visual Speech Recognition with Automatic Labels"). All model credit to the original authors.
  • Checkpoint source: mirrored from AD1TEYA/lip-reading-model on the Hub.
  • Re-host + the surrounding system, demo, visual-VAD pipeline, evaluation and research: Ahmet Dedeler (πŸ€— aaahmet).

Intended use

Research and demos of silent visual speech recognition. The weights were trained on LRS3-derived data; treat as research use. Best on clear, frontal, well-articulated English. ~25–30% WER on clean speech, higher on casual speech (lip reading is inherently ambiguous β€” many phonemes look identical on the lips).

Usage

Used by the Silent Lip Reader Space β€” record a (silent) video, it crops your mouth, chunks utterances by lip motion, and decodes text. See the Space for the full pipeline and research log.


Built / curated by Ahmet Dedeler β€” https://ahmetdedeler.com. A credit/link back is appreciated if you use this. License MIT (follows the upstream Space).

Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support

Space using aaahmet/silent-lip-reader-model 1