File size: 2,618 Bytes
b0b2b76
b4cefe2
b0b2b76
 
 
 
 
 
 
 
 
 
 
 
 
b1c1b0f
b4cefe2
 
 
b0b2b76
 
 
 
 
 
 
 
 
 
 
 
 
 
 
63dc3d3
 
b0b2b76
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
---
license: apache-2.0
task_categories:
- audio-classification
language:
- it
tags:
- intent
- intent-classification
- audio-classification
- audio
pretty_name: ITALIC
size_categories:
- 10K<n<100K
base_model:
- jonatasgrosman/wav2vec2-large-xlsr-53-italian
model-index:
- name: xls-r-53-it-italic-speaker
  results: []
datasets:
- RiTA-nlp/ITALIC
library_name: transformers
---

# wav2vec 2.0 XLS-R 53-IT (300m) fine-tuned on ITALIC - "Hard Speaker"

ITALIC is an intent classification dataset for the Italian language, which is the first of its kind. 
It includes spoken and written utterances and is annotated with 60 intents. 
The dataset is available on [Zenodo](https://zenodo.org/record/8040649) and connectors ara available for the [HuggingFace Hub](https://huggingface.co/datasets/RiTA-nlp/ITALIC).

This is the [jonatasgrosman/wav2vec2-xls-r-53-IT](https://huggingface.co/jonatasgrosman/wav2vec2-large-xlsr-53-italian) model fine-tuned on the "Hard Speaker" split.

It achieves the following results on the test set:

- Accuracy: 0.837
- F1: 0.778

## Usage

You can use the model directly in the following manner:

```python
import torch
import librosa
from transformers import AutoModelForAudioClassification, AutoFeatureExtractor

## Load an audio file
audio_array, sr = librosa.load("path_to_audio.wav", sr=16000)

## Load model and feature extractor
model = AutoModelForAudioClassification.from_pretrained("alkiskoudounas/xls-r-53-it-italic-speaker")
feature_extractor = AutoFeatureExtractor.from_pretrained("jonatasgrosman/wav2vec2-large-xlsr-53-italian")

## Extract features
inputs = feature_extractor(audio_array.squeeze(), sampling_rate=feature_extractor.sampling_rate, padding=True, return_tensors="pt")

## Compute logits
logits = model(**inputs).logits
```

For more information about the dataset, please refer to the [paper](https://arxiv.org/abs/2306.08502).

## Citation

If you use this model in your research, please cite the following papers:

```bibtex
@inproceedings{koudounas2023italic,
  title={ITALIC: An Italian Intent Classification Dataset},
  author={Koudounas, Alkis and La Quatra, Moreno and Vaiani, Lorenzo and Colomba, Luca and Attanasio, Giuseppe and Pastor, Eliana and Cagliero, Luca and Baralis, Elena},
  booktitle={Proc. Interspeech 2023},
  pages={2153--2157},
  year={2023}
}

@inproceedings{koudounas2025unlearning,
  title={"Alexa, can you forget me?" Machine Unlearning Benchmark in Spoken Language Understanding},
  author={Koudounas, Alkis and Savelli, Claudio and Giobergia, Flavio and Baralis, Elena},
  booktitle={Proc. Interspeech 2025}, 
  year={2025},
}
```