--- library_name: transformers pipeline_tag: audio-classification tags: - audio - audio-classification - keyword-spotting - kws - wav2vec2 - pytorch - onnx - sagemaker - streaming-inference - realtime datasets: - google/speech_commands base_model: - facebook/wav2vec2-base license: other language: en --- # Model Card for hf-kws (Wav2Vec2 Keyword Spotting) A compact, end‑to‑end pipeline for training, evaluating, and deploying a **Wav2Vec2‑based keyword spotting (KWS)** model on **Google Speech Commands v2**. The repository includes offline and **real‑time streaming inference**, **ONNX export**, and **AWS SageMaker** deployment scripts. ## Model Details ### Model Description This project fine‑tunes a Wav2Vec2 audio classifier (e.g., `facebook/wav2vec2-base`) for keyword spotting on **Speech Commands v2** using Hugging Face `transformers`/`datasets`. It supports microphone streaming with sliding‑window smoothing, file‑based inference, saved JSON metrics/plots, and a minimal **SageMaker** stack (train, realtime/serverless deploy, batch transform). - **Developed by:** Amirhossein Yousefiramandi (GitHub: [@amirhossein-yousefi](https://github.com/amirhossein-yousefi)) - **Model type:** Audio Classification (Keyword Spotting) — Wav2Vec2 backbone with classification head - **Language(s) (NLP):** English - **License:** No explicit repository license file found (verify before redistribution) - **Finetuned from model :** `facebook/wav2vec2-base` (16 kHz) ### Model Sources - **Repository:** https://github.com/amirhossein-yousefi/keyword-spotting - **Paper :** Warden, P. (2018). *Speech Commands: A Dataset for Limited-Vocabulary Speech Recognition*. arXiv:1804.03209. ## Uses ### Direct Use - On‑device or edge keyword detection for small command sets (e.g., “yes/no/up/down/stop/go”). - Real‑time wake word / trigger prototypes via the included streaming inference script. - Batch scoring of short audio clips for command presence via CLI or SageMaker Batch Transform. ### Downstream Use - Fine‑tune on custom keyword lists or languages (swap dataset, keep pipeline). - Distillation/quantization for mobile deployment (roadmap mentions TFLite/CoreML). ### Out-of-Scope Use - Open‑vocabulary ASR or general transcription. - Long‑form audio or multi‑speaker diarization. - Safety‑critical activation (e.g., medical/industrial controls) without rigorous evaluation and fail‑safes. - Always‑on surveillance scenarios without clear user consent and privacy controls. ## Bias, Risks, and Limitations - **Language & domain bias:** Trained on **English**, one‑second command words—limited transfer to other languages, accents, far‑field mics, or noisy environments without adaptation. - **Vocabulary constraints:** Detects from a fixed label set; out‑of‑vocabulary words may map to “unknown” or be misclassified. - **Data licensing:** Ensure **CC‑BY‑4.0** attribution when redistributing models trained on Speech Commands. ### Recommendations Users (both direct and downstream) should be made aware of the risks, biases and limitations of the model. Evaluate on target devices/microphones; add noise augmentation and tune detection thresholds for deployment context. ## Usage in HuggingFace (Recommended) ```bash from transformers import AutoFeatureExtractor, AutoModelForAudioClassification, pipeline model_id = "Amirhossein75/Keyword-Spotting" # Option A — simple: clf = pipeline("audio-classification", model=model_id) print(clf("path/to/1sec_16kHz.wav")) # Option B — manual pre/post: fe = AutoFeatureExtractor.from_pretrained(model_id) model = AutoModelForAudioClassification.from_pretrained(model_id) import soundfile as sf, torch wave, sr = sf.read("path/to/1sec_16kHz.wav") inputs = fe(wave, sampling_rate=sr, return_tensors="pt") with torch.no_grad(): logits = model(**inputs).logits pred_id = int(logits.argmax(-1)) print(model.config.id2label[pred_id]) ``` **Note** : it is better not to use the `AutoProcessor` ## How to Get Started with the Model Use the code below to get started with the model. ```bash # clone and install git clone https://github.com/amirhossein-yousefi/keyword-spotting cd keyword-spotting python -m venv .venv && source .venv/bin/activate # Windows: .venv\Scripts\activate pip install --upgrade pip pip install -r requirements.txt # train (example) python -m src.train --checkpoint facebook/wav2vec2-base --output_dir ./checkpoints/kws_w2v2 --num_train_epochs 8 --per_device_train_batch_size 16 --per_device_eval_batch_size 16 # single-file inference python -m src.infer --model_dir ./checkpoints/kws_w2v2 --wav_path /path/to/your.wav --top_k 5 # streaming (microphone) python -m src.stream_infer --model_dir ./checkpoints/kws_w2v2 # evaluate python -m src.evaluate_fn --model_dir ./checkpoints/kws_w2v2 ``` ## Training Details ### Training Data - **Dataset:** Google **Speech Commands v2** (1‑second WAVs, 16 kHz; English; CC‑BY‑4.0). Typical label set includes “yes/no, up/down, left/right, on/off, stop/go,” plus auxiliary words and silence/unknown classes. ### Training Procedure #### Preprocessing - Resampled/processed at **16 kHz**. - Augmentations: **time‑shift, noise, random gain**. #### Training Hyperparameters - **Training regime:** fp32 (example; adjust as needed) - **Backbone:** `facebook/wav2vec2-base` (audio classification head). - **Epochs (example):** 8 - **Batch size (example):** 16 train / 16 eval - **Framework:** PyTorch + Hugging Face `transformers` #### Speeds, Sizes, Times - **Example environment:** Single **NVIDIA GeForce RTX 3080 Ti Laptop GPU (16 GB)**, PyTorch **2.8.0+cu129**, CUDA **12.9**. - **Reported training runtime:** ~**3,446.3 s** for the default run (see repository logs/README). ## Evaluation ### Testing Data, Factors & Metrics #### Testing Data - Speech Commands v2 **test split**. #### Factors - Evaluate by **keyword**, **speaker**, **noise type/level**, and **device/mic** to assess robustness. #### Metrics - **Accuracy**, **F1**, **Precision**, **Recall**, and **Cross‑entropy loss**; plus runtime and throughput. ### Results Below are the aggregated metrics at **epoch 10**. | Split | Accuracy | F1 (weighted) | Precision (weighted) | Recall (weighted) | Loss | Runtime (s) | Samples/s | Steps/s | |------:|:--------:|:-------------:|:---------------------:|:-----------------:|:----:|:-----------:|:---------:|:-------:| | **Validation** | 97.13% | 97.14% | 97.17% | 97.13% | 0.123 | 9.29 | 1074.9 | 33.60 | | **Test** | 96.79% | 96.79% | 96.81% | 96.79% | 0.137 | 9.99 | 1101.97 | 34.446 | #### Summary The pipeline reproduces standard Wav2Vec2 KWS performance on Speech Commands; tailor thresholds and augmentations for deployment. ## Model Examination - Inspect per‑class confusion matrices and score distributions from saved metrics to identify false‑positive/negative patterns. ## Environmental Impact Carbon emissions can be estimated using the [Machine Learning Impact calculator](https://mlco2.github.io/impact#compute) presented in [Lacoste et al. (2019)](https://arxiv.org/abs/1910.09700). - **Hardware Type:** Single NVIDIA GeForce RTX 3080 Ti Laptop GPU - **Hours used:** ~0.96 h (example run) ## Technical Specifications ### Model Architecture and Objective - **Architecture:** Wav2Vec2 (self‑supervised acoustic encoder) + classification head for KWS. ### Compute Infrastructure #### Hardware - Example: **NVIDIA RTX 3080 Ti Laptop**, 16 GB VRAM. #### Software - **PyTorch 2.8.0+cu129**, CUDA driver **12.9**; Hugging Face `transformers`/`datasets`. ## Citation **BibTeX (Dataset):** ``` @article{warden2018speechcommands, author = {Warden, Pete}, title = {Speech Commands: A Dataset for Limited-Vocabulary Speech Recognition}, journal= {arXiv e-prints}, eprint = {1804.03209}, year = {2018}, month = apr, url = {https://arxiv.org/abs/1804.03209} } ``` **APA (Dataset):** Warden, P. (2018). *Speech Commands: A Dataset for Limited-Vocabulary Speech Recognition*. arXiv:1804.03209. ## Glossary - **KWS:** Keyword Spotting — detecting a small set of pre‑registered words in short audio clips. - **Streaming inference:** Frame‑by‑frame scoring with smoothing over a sliding window. ## More Information - Speech Commands dataset card: https://huggingface.co/datasets/google/speech_commands - Wav2Vec2 model docs: https://huggingface.co/docs/transformers/en/model_doc/wav2vec2 ## Model Card Authors - Amirhossein Yousefiramandi ## Model Card Contact Please open a GitHub Issue in this repository with questions or requests.