Audio transcription examples added and model names changed

Browse files

Files changed (7) hide show

.gitattributes +4 -0
README.md +76 -9
audio_samples/example1.wav +3 -0
audio_samples/example2.wav +3 -0
audio_samples/example3.wav +3 -0
images/cer.png +0 -0
images/wer.png +0 -0

.gitattributes CHANGED Viewed

@@ -37,3 +37,7 @@ unigrams.txt filter=lfs diff=lfs merge=lfs -text
 language_model/3gram.bin filter=lfs diff=lfs merge=lfs -text
 language_model/attrs.json filter=lfs diff=lfs merge=lfs -text
 language_model/unigrams.txt filter=lfs diff=lfs merge=lfs -text

 language_model/3gram.bin filter=lfs diff=lfs merge=lfs -text
 language_model/attrs.json filter=lfs diff=lfs merge=lfs -text
 language_model/unigrams.txt filter=lfs diff=lfs merge=lfs -text
+audio_samples/example1.wav filter=lfs diff=lfs merge=lfs -text
+audio_samples/example2.wav filter=lfs diff=lfs merge=lfs -text
+audio_samples/example3.wav filter=lfs diff=lfs merge=lfs -text
+audio_samples/example4.wav filter=lfs diff=lfs merge=lfs -text

README.md CHANGED Viewed

@@ -53,13 +53,80 @@ Next you can use the model using the `transformers` Python package as follows:
 {'text': 'your transcription'}
 ```
 ## Model Details
 Wav2Vec2 is a state-of-the-art model architecture for speech recognition, leveraging self-supervised learning from raw audio data. The pre-trained [Wav2Vec2-XLS-R-300M](facebook/wav2vec2-xls-r-300m) has been fine-tuned for automatic speech recognition with the [CoRal-v2 dataset](https://huggingface.co/datasets/CoRal-dataset/coral-v2/tree/main) dataset to enhance its performance in recognizing Danish speech with consideration to different dialects. The model was trained for 30K steps using the training setup in the [CoRaL repository](https://github.com/alexandrainst/coral/tree) by running:
 ```
 python src/scripts/finetune_asr_model.py model=wav2vec2-small max_steps=30000 datasets.coral_conversation_internal.id=CoRal-dataset/coral-v2 datasets.coral_readaloud_internal.id=CoRal-dataset/coral-v2
 ```
-The model is evaluated using a Language Model (LM) as post-processing. The utilized LM is the one trained and used by [alexandrainst/roest-315m](https://huggingface.co/alexandrainst/roest-315m).
 ## Dataset
 ### [CoRal-v2](https://huggingface.co/datasets/CoRal-dataset/coral-v2/tree/main)
@@ -84,8 +151,8 @@ The model was evaluated using the following metrics:
 | :----------------------------------------------------------------------------------------------- | -------------------: | --------------------------: | --------------------------------------------------------------------------------------: | --------------------------------------------------------------------------------------: |
 | [CoRal-dataset/roest-wav2vec2-315M-v2](https://huggingface.co/CoRal-dataset/roest-whisper-large) |                 315M | Read-aloud and conversation |                                                                             6.5% ± 0.2% |                                                                            16.3% ± 0.4% |
 | [CoRal-dataset/roest-whisper-large-v2](https://huggingface.co/CoRal-dataset/roest-whisper-large) |                1540M | Read-aloud and conversation |                                                                             5.3%  ± 0.2%            |                                                                               12.0% ± 0.4%          |
-| [Alvenir/coral-1-whisper-large](https://huggingface.co/Alvenir/coral-1-whisper-large)            |                1540M |                  Read-aloud |                                                                         **4.3% ± 0.2%** |                                                                        **10.4% ± 0.3%** |
-| [alexandrainst/roest-315m](https://huggingface.co/alexandrainst/roest-315m)                      |                 315M |                  Read-aloud |                                                                             6.6% ± 0.2% |                                                                            17.0% ± 0.4% |
 | [mhenrichsen/hviske-v2](https://huggingface.co/syvai/hviske-v2)                                  |                1540M |                  Read-aloud |                                                                             4.7% ± 0.2% |                                                                            11.8% ± 0.3% |
 | [openai/whisper-large-v3](https://hf.co/openai/whisper-large-v3)                                 |                1540M |                           - |                                                                            11.4% ± 0.3% |                                                                            28.3% ± 0.6% |
@@ -97,7 +164,7 @@ The model was evaluated using the following metrics:
 <img src="https://huggingface.co/CoRal-dataset/roest-wav2vec2-315m-v2/resolve/main/images/cer.png">
 ### Table CER scores in % of evaluation across demographics on the CoRal test data
-| Category | roest-wav2vec2-315m-v2 | roest-315m | roest-whisper-large-v2 | coral-1-whisper-large |
 |:---:|:---:|:---:|:---:|:---:|
 | female | 7.2 | 7.4 | 6.9 | 5.1 |
 | male | 5.7 | 5.8 | 3.7 | 3.6 |
@@ -117,7 +184,7 @@ The model was evaluated using the following metrics:
 | Overall | 6.5 | 6.6 | 5.3 | 4.3 |
 ### Table WER scores in % of evaluation across demographics on the CoRal test data
-| Category | roest-wav2vec2-315m-v2 | roest-315m | roest-whisper-large-v2 | coral-1-whisper-large |
 |:---:|:---:|:---:|:---:|:---:|
 | female | 17.7 | 18.5 | 14.2 | 11.5 |
 | male | 14.9 | 15.5 | 9.9 | 9.4 |
@@ -138,19 +205,19 @@ The model was evaluated using the following metrics:
 ### Roest-wav2vec2-315M with and without language model
-The inclusion of a post-processing language model can affect the performance significantly. The Roest-v1 and Roest-v2 models are using the same Language Model (LM). The utilized LM is the one trained and used by [alexandrainst/roest-315m](https://huggingface.co/alexandrainst/roest-315m).
 | Model                                                                                         | Number of parameters |   Finetuned on data of type | Postprocessed with Language Model | [CoRal](https://huggingface.co/datasets/alexandrainst/coral/viewer/read_aloud/test) CER | [CoRal](https://huggingface.co/datasets/alexandrainst/coral/viewer/read_aloud/test) WER |
 | :-------------------------------------------------------------------------------------------- | -------------------: | --------------------------: | --------------------------------: | --------------------------------------------------------------------------------------: | --------------------------------------------------------------------------------------: |
 | [CoRal-dataset/roest-wav2vec2-315M-v2](https://huggingface.co/CoRal-dataset/roest-wav2vec2-315m-v2) |                 315M | Read-aloud and conversation |                               Yes |                                                                         **6.5% ± 0.2%** |                                                                        **16.3% ± 0.4%** |
 | [CoRal-dataset/roest-wav2vec2-315M-v2](https://huggingface.co/CoRal-dataset/roest-wav2vec2-315m-v2) |                 315M | Read-aloud and conversation |                                No |                                                                             8.2% ± 0.2% |                                                                            25.1% ± 0.4% |
-| [alexandrainst/roest-315m](https://huggingface.co/alexandrainst/roest-315m)                   |                 315M |                  Read-aloud |                               Yes |                                                                             6.6% ± 0.2% |                                                                            17.0% ± 0.4% |
-| [alexandrainst/roest-315m](https://huggingface.co/alexandrainst/roest-315m)                   |                 315M |                  Read-aloud |                                No |                                                                             8.6% ± 0.2% |                                                                            26.3% ± 0.5% |
 ### Detailed Roest-wav2vec2-315M with and without language model on different dialects
 Here are the results of the model on different danish dialects in the test set:
-|             | Roest-1 |         | Roest-1 |         | Roest-2 |         | Roest-2 |         |
 |-------------|---------|---------|---------|---------|---------|---------|---------|---------|
 | LM          | No      |         | Yes     |         | No      |         | Yes     |         |
 |-------------|---------|---------|---------|---------|---------|---------|---------|---------|

 {'text': 'your transcription'}
 ```
+## Transcription examples
+### Example 1
+<audio controls>
+  <source src="https://huggingface.co/CoRal-dataset/roest-wav2vec2-315m-v2/resolve/main/audio_samples/example1.wav" type="audio/wav">
+  Your browser does not support the audio tag.
+</audio>
+**Dialect:** Vestjysk
+**Transcription:** det blev til yderlig ti mål i den første sæson på trods af en position som back
+**Target transcription:** det blev til yderligere ti mål i den første sæson på trods af en position som back
+**CER:** 3.7%
+**WER:** 5.9%
+### Example 2
+<audio controls>
+  <source src="https://huggingface.co/CoRal-dataset/roest-wav2vec2-315m-v2/resolve/main/audio_samples/example2.wav" type="audio/wav">
+  Your browser does not support the audio tag.
+</audio>
+**Dialect:** Sønderjysk
+**Transcription:** en arkitektoniske udformning af pladser forslagene iver benzen
+**Target transcription:** den arkitektoniske udformning af pladsen er forestået af ivar bentsen
+**CER:** 20.3%
+**WER:** 60.0%
+### Example 3
+<audio controls>
+  <source src="https://huggingface.co/CoRal-dataset/roest-wav2vec2-315m-v2/resolve/main/audio_samples/example3.wav" type="audio/wav">
+  Your browser does not support the audio tag.
+</audio>
+**Dialect:** Nordsjællandsk
+**Transcription:** østrig og ungarn samarbejder om søen gennem den østrigske og ungarske vandkommission
+**Target transcription:** østrig og ungarn samarbejder om søen gennem den østrigske og ungarske vandkommission
+**CER:** 0.0%
+**WER:** 0.0%
+### Example 4
+<audio controls>
+  <source src="https://huggingface.co/CoRal-dataset/roest-wav2vec2-315m-v2/resolve/main/audio_samples/example4.wav" type="audio/wav">
+  Your browser does not support the audio tag.
+</audio>
+**Dialect:** Lollandsk
+**Transcription:** det er produceret af thomas helme og indspillede i easy sound recording studio i københavn
+**Target transcription:** det er produceret af thomas helmig og indspillet i easy sound recording studio i københavn
+**CER:** 4.4%
+**WER:** 13.3%
 ## Model Details
 Wav2Vec2 is a state-of-the-art model architecture for speech recognition, leveraging self-supervised learning from raw audio data. The pre-trained [Wav2Vec2-XLS-R-300M](facebook/wav2vec2-xls-r-300m) has been fine-tuned for automatic speech recognition with the [CoRal-v2 dataset](https://huggingface.co/datasets/CoRal-dataset/coral-v2/tree/main) dataset to enhance its performance in recognizing Danish speech with consideration to different dialects. The model was trained for 30K steps using the training setup in the [CoRaL repository](https://github.com/alexandrainst/coral/tree) by running:
 ```
 python src/scripts/finetune_asr_model.py model=wav2vec2-small max_steps=30000 datasets.coral_conversation_internal.id=CoRal-dataset/coral-v2 datasets.coral_readaloud_internal.id=CoRal-dataset/coral-v2
 ```
+The model is evaluated using a Language Model (LM) as post-processing. The utilized LM is the one trained and used by [alexandrainst/roest-wav2vec2-315m-v1](https://huggingface.co/alexandrainst/roest-315m).
 ## Dataset
 ### [CoRal-v2](https://huggingface.co/datasets/CoRal-dataset/coral-v2/tree/main)
 | :----------------------------------------------------------------------------------------------- | -------------------: | --------------------------: | --------------------------------------------------------------------------------------: | --------------------------------------------------------------------------------------: |
 | [CoRal-dataset/roest-wav2vec2-315M-v2](https://huggingface.co/CoRal-dataset/roest-whisper-large) |                 315M | Read-aloud and conversation |                                                                             6.5% ± 0.2% |                                                                            16.3% ± 0.4% |
 | [CoRal-dataset/roest-whisper-large-v2](https://huggingface.co/CoRal-dataset/roest-whisper-large) |                1540M | Read-aloud and conversation |                                                                             5.3%  ± 0.2%            |                                                                               12.0% ± 0.4%          |
+| [Alvenir/roest-whisper-large-v1](https://huggingface.co/Alvenir/coral-1-whisper-large)            |                1540M |                  Read-aloud |                                                                         **4.3% ± 0.2%** |                                                                        **10.4% ± 0.3%** |
+| [alexandrainst/roest-wav2vec2-315M-v1](https://huggingface.co/alexandrainst/roest-315m)                      |                 315M |                  Read-aloud |                                                                             6.6% ± 0.2% |                                                                            17.0% ± 0.4% |
 | [mhenrichsen/hviske-v2](https://huggingface.co/syvai/hviske-v2)                                  |                1540M |                  Read-aloud |                                                                             4.7% ± 0.2% |                                                                            11.8% ± 0.3% |
 | [openai/whisper-large-v3](https://hf.co/openai/whisper-large-v3)                                 |                1540M |                           - |                                                                            11.4% ± 0.3% |                                                                            28.3% ± 0.6% |
 <img src="https://huggingface.co/CoRal-dataset/roest-wav2vec2-315m-v2/resolve/main/images/cer.png">
 ### Table CER scores in % of evaluation across demographics on the CoRal test data
+| Category | roest-wav2vec2-315m-v2 | roest-wav2vec2-315m-v1 | roest-whisper-large-v2 | roest-whisper-large-v1 |
 |:---:|:---:|:---:|:---:|:---:|
 | female | 7.2 | 7.4 | 6.9 | 5.1 |
 | male | 5.7 | 5.8 | 3.7 | 3.6 |
 | Overall | 6.5 | 6.6 | 5.3 | 4.3 |
 ### Table WER scores in % of evaluation across demographics on the CoRal test data
+| Category | roest-wav2vec2-315m-v2 | roest-wav2vec2-315m-v1 | roest-whisper-large-v2 | roest-whisper-large-v1 |
 |:---:|:---:|:---:|:---:|:---:|
 | female | 17.7 | 18.5 | 14.2 | 11.5 |
 | male | 14.9 | 15.5 | 9.9 | 9.4 |
 ### Roest-wav2vec2-315M with and without language model
+The inclusion of a post-processing language model can affect the performance significantly. The Roest-v1 and Roest-v2 models are using the same Language Model (LM). The utilized LM is the one trained and used by [alexandrainst/roest-wav2vec2-315m-v1](https://huggingface.co/alexandrainst/roest-315m).
 | Model                                                                                         | Number of parameters |   Finetuned on data of type | Postprocessed with Language Model | [CoRal](https://huggingface.co/datasets/alexandrainst/coral/viewer/read_aloud/test) CER | [CoRal](https://huggingface.co/datasets/alexandrainst/coral/viewer/read_aloud/test) WER |
 | :-------------------------------------------------------------------------------------------- | -------------------: | --------------------------: | --------------------------------: | --------------------------------------------------------------------------------------: | --------------------------------------------------------------------------------------: |
 | [CoRal-dataset/roest-wav2vec2-315M-v2](https://huggingface.co/CoRal-dataset/roest-wav2vec2-315m-v2) |                 315M | Read-aloud and conversation |                               Yes |                                                                         **6.5% ± 0.2%** |                                                                        **16.3% ± 0.4%** |
 | [CoRal-dataset/roest-wav2vec2-315M-v2](https://huggingface.co/CoRal-dataset/roest-wav2vec2-315m-v2) |                 315M | Read-aloud and conversation |                                No |                                                                             8.2% ± 0.2% |                                                                            25.1% ± 0.4% |
+| [alexandrainst/roest-wav2vec2-315m-v1](https://huggingface.co/alexandrainst/roest-315m)                   |                 315M |                  Read-aloud |                               Yes |                                                                             6.6% ± 0.2% |                                                                            17.0% ± 0.4% |
+| [alexandrainst/roest-wav2vec2-315m-v1](https://huggingface.co/alexandrainst/roest-315m)                   |                 315M |                  Read-aloud |                                No |                                                                             8.6% ± 0.2% |                                                                            26.3% ± 0.5% |
 ### Detailed Roest-wav2vec2-315M with and without language model on different dialects
 Here are the results of the model on different danish dialects in the test set:
+|             | Roest-v1 |         | Roest-v1 |         | Roest-v2 |         | Roest-v2 |         |
 |-------------|---------|---------|---------|---------|---------|---------|---------|---------|
 | LM          | No      |         | Yes     |         | No      |         | Yes     |         |
 |-------------|---------|---------|---------|---------|---------|---------|---------|---------|

audio_samples/example1.wav ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:97be8c695d4c6debdd4096cea9400468992ecf27743f313ff8e988271c9b6aae
+size 529978

audio_samples/example2.wav ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:e97d6c5d5999f8f2c6eed1f2847f4dae0006e7025148a17503b3f836c5f4a57a
+size 249658

audio_samples/example3.wav ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:5e8c8870082c39d13d1f2800cefb971bd9d56667d1e4437d05feee8e3900e18a
+size 361018

images/cer.png CHANGED Viewed

images/wer.png CHANGED Viewed