Bram Vanroy PRO

BramVanroy

https://bramvanroy.github.io/

AI & ML interests

Artificial intelligence, natural language processing, computational linguistics

Recent Activity

liked a dataset about 6 hours ago

GPT-NL/GPT-NL_Public_Corpus

liked a dataset about 14 hours ago

openeurollm/propella-annotations

new activity about 14 hours ago

openeurollm/propella-annotations:Dutch FineWeb 2 and HPLT3

View all activity

Organizations

liked a dataset about 6 hours ago

GPT-NL/GPT-NL_Public_Corpus

Viewer • Updated 4 days ago • 302M • 452 • 6

liked a dataset about 14 hours ago

openeurollm/propella-annotations

Viewer • Updated 19 days ago • 3.17B • 1.74k • 17

New activity in openeurollm/propella-annotations about 14 hours ago

Dutch FineWeb 2 and HPLT3

➕ 3

#5 opened about 14 hours ago by

BramVanroy

liked a model about 17 hours ago

RedHatAI/gemma-4-31B-it-FP8-Dynamic

33B • Updated 1 day ago • 408 • 4

liked a Space 5 days ago

Evaluation Guidebook

📝

300

Explore LLM benchmark trends over time

liked a dataset 19 days ago

GPT-NL/DuidelijkeTaal-v1.0-split

Viewer • Updated Dec 23, 2025 • 1.07k • 43 • 2

liked a dataset 22 days ago

nvidia/Nemotron-Personas-France

Viewer • Updated 23 days ago • 1M • 7.06k • 78

reactedto yuriyvnv's post with 🚀 27 days ago

Post

429

🎯 WAVe-1B-Multimodal-NL: Word-Level Speech Quality Assessment for Dutch

Following the release of the Portuguese model, we're releasing the Dutch variant of WAVe — a 1B multimodal embedding model that assesses synthetic speech quality at the word level, thereby improving the quality of synthetically augmented datasets for training ASR models.

Trained on CommonVoice 16.1 Dutch with 5 corruption strategies, this model catches mispronunciations, timing errors, and prosody issues in synthetic data that sentence-level embeddings miss entirely.
Resources

- Dutch model: yuriyvnv/WAVe-1B-Multimodal-NL
- Portuguese model: yuriyvnv/WAVe-1B-Multimodal-PT
- Code: https://github.com/yuriyvnv/WAVe

This model builds on CommonVoice Dutch data — thanks to @mozilla and the CommonVoice community for making multilingual speech data accessible.

Would be great to hear from the Dutch NLP community — @BramVanroy @GroNLP — especially if you're working on Dutch ASR or TTS pipelines where quality filtering could help. Also tagging @hf-audio as this sits at the intersection of speech processing and data curation.