mms-meta
/

mms-zeroshot-300m

+---
+tags:
+- mms
+- xlsr
+license: cc-by-nc-4.0
+datasets:
+- google/fleurs
+- mozilla-foundation/common_voice_8_0
+metrics:
+- wer
+- cer
+---
+# Massively Multilingual Speech (MMS) - Finetuned ASR - ALL
+This checkpoint is a model fine-tuned for multi-lingual ASR and part of Facebook's [Massive Multilingual Speech project](https://research.facebook.com/publications/scaling-speech-technology-to-1000-languages/).
+This checkpoint is based on the [Wav2Vec2 architecture](https://huggingface.co/docs/transformers/model_doc/wav2vec2) and makes use of adapter models to transcribe 1000+ languages.
+The checkpoint consists of **1 billion parameters** and has been fine-tuned from [facebook/mms-1b](https://huggingface.co/facebook/mms-1b) on 1162 languages.
+## Table Of Content
+- [Example](#example)
+- [Supported Languages](#supported-languages)
+- [Model details](#model-details)
+- [Additional links](#additional-links)
+## Example
+This MMS checkpoint can be used with [Transformers](https://github.com/huggingface/transformers) to transcribe audio of 1107 different
+languages. Let's look at a simple example.
+First, we install transformers and some other libraries
+```
+pip install torch accelerate torchaudio datasets
+pip install --upgrade transformers
+````
+**Note**: In order to use MMS you need to have at least `transformers >= 4.30` installed. If the `4.30` version
+is not yet available [on PyPI](https://pypi.org/project/transformers/) make sure to install `transformers` from
+source:
+```
+pip install git+https://github.com/huggingface/transformers.git
+```
+Next, we load a couple of audio samples via `datasets`. Make sure that the audio data is sampled to 16000 kHz.
+```py
+from datasets import load_dataset, Audio
+# English
+stream_data = load_dataset("mozilla-foundation/common_voice_13_0", "en", split="test", streaming=True)
+stream_data = stream_data.cast_column("audio", Audio(sampling_rate=16000))
+en_sample = next(iter(stream_data))["audio"]["array"]
+# French
+stream_data = load_dataset("mozilla-foundation/common_voice_13_0", "fr", split="test", streaming=True)
+stream_data = stream_data.cast_column("audio", Audio(sampling_rate=16000))
+fr_sample = next(iter(stream_data))["audio"]["array"]
+```
+Next, we load the model and processor
+```py
+from transformers import Wav2Vec2ForCTC, AutoProcessor
+import torch
+model_id = "facebook/mms-1b-all"
+processor = AutoProcessor.from_pretrained(model_id)
+model = Wav2Vec2ForCTC.from_pretrained(model_id)
+```
+Now we process the audio data, pass the processed audio data to the model and transcribe the model output, just like we usually do for Wav2Vec2 models such as [facebook/wav2vec2-base-960h](https://huggingface.co/facebook/wav2vec2-base-960h)
+```py
+inputs = processor(en_sample, sampling_rate=16_000, return_tensors="pt")
+with torch.no_grad():
+    outputs = model(**inputs).logits
+ids = torch.argmax(outputs, dim=-1)[0]
+transcription = processor.decode(ids)
+# 'joe keton disapproved of films and buster also had reservations about the media'
+```
+We can now keep the same model in memory and simply switch out the language adapters by calling the convenient [`load_adapter()`]() function for the model and [`set_target_lang()`]() for the tokenizer. We pass the target language as an input - "fra" for French.
+```py
+processor.tokenizer.set_target_lang("fra")
+model.load_adapter("fra")
+inputs = processor(fr_sample, sampling_rate=16_000, return_tensors="pt")
+with torch.no_grad():
+    outputs = model(**inputs).logits
+ids = torch.argmax(outputs, dim=-1)[0]
+transcription = processor.decode(ids)
+# "ce dernier est volé tout au long de l'histoire romaine"
+```
+In the same way the language can be switched out for all other supported languages. Please have a look at:
+```py
+processor.tokenizer.vocab.keys()
+```
+For more details, please have a look at [the official docs](https://huggingface.co/docs/transformers/main/en/model_doc/mms).
+## Model details
+- **Developed by:** Jinming Zhao et al.
+- **Model type:** Scaling A Simple Approach to Zero-Shot Speech Recognition
+- **License:** CC-BY-NC 4.0 license
+- **Num parameters**: 300 million
+- **Cite as:**
+      @article{zhao2024scaling,
+        title={Scaling A Simple Approach to Zero-Shot Speech Recognition},
+        author={Zhao, Jinming and Pratap, Vineel and Auli, Michael},
+        journal={arXiv preprint arXiv:2407.17852},
+        year={2024}
+      }
+## Additional Links
+- [Paper](https://arxiv.org/abs/2407.17852)
+- [GitHub Repository](https://github.com/facebookresearch/fairseq/tree/main/examples/mms/zero_shot)
+- [Official Space](https://huggingface.co/spaces/mms-meta/mms-zeroshot)