File size: 2,210 Bytes
ec0dc79
c853a3e
ec0dc79
c853a3e
 
 
 
 
 
 
 
 
 
 
 
 
 
 
ec0dc79
c853a3e
 
 
 
 
 
 
 
 
 
 
 
75aefd7
 
 
 
 
221b354
75aefd7
c383faf
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
---

language: fr
license: mit
tags:
- bert
- language-model
- flaubert 
- french
- flaubert-base
- uncased
- asr
- speech
- oral
- natural language understanding
- NLU
- spoken language understanding
- SLU
- understanding
---


# FlauBERT-Oral models: Using ASR-Generated Text for Spoken Language Modeling

**FlauBERT-Oral** are French BERT models trained on a very large amount of automatically transcribed speech from 350,000 hours of diverse French TV shows. They were trained with the [**FlauBERT software**](https://github.com/getalp/Flaubert) using the same parameters as the [flaubert-base-uncased](https://huggingface.co/flaubert/flaubert_base_uncased) model (12 layers, 12 attention heads, 768 dims, 137M parameters, uncased).
 
## Available FlauBERT-Oral models

- `flaubert-oral-asr` : trained from scratch on ASR data, keeping the BPE tokenizer and vocabulary of flaubert-base-uncased
- `flaubert-oral-asr_nb` : trained from scratch on ASR data, BPE tokenizer is also trained on the same corpus
- `flaubert-oral-mixed` : trained from scratch on a mixed corpus of ASR and text data, BPE tokenizer is also trained on the same corpus
- `flaubert-oral-ft` : fine-tuning of flaubert-base-uncased for a few epochs on ASR data

## Usage for sequence classification
```python

flaubert_tokenizer = FlaubertTokenizer.from_pretrained("nherve/flaubert-oral-asr")

flaubert_classif = FlaubertForSequenceClassification.from_pretrained("nherve/flaubert-oral-asr", num_labels=14)

flaubert_classif.sequence_summary.summary_type = 'mean'

# Then, train your model

```

## References
If you use FlauBERT-Oral models for your scientific publication, or if you find the resources in this repository useful, please cite the following papers:
```

@InProceedings{herve2022flaubertoral,

  author    = {Herv\'{e}, Nicolas and Pelloin, Valentin and Favre, Benoit and Dary, Franck and Laurent, Antoine and Meignier, Sylvain and Besacier, Laurent},

  title     = {Using ASR-Generated Text for Spoken Language Modeling},

  booktitle = {Proceedings of "Challenges & Perspectives in Creating Large Language Models" ACL 2022 Workshop},

  month     = {May},

  year      = {2022}

}

```