Model Returns 7 Classes
According to the README, the model should return 2 classifications.
When I run the model, I am seeing 7 logits.
When I try to map to a [Human, NotHuman] array I get an index out of bounds error.
What are the other 7 classes? I am new at this, so any help would be appreciated.
Thank you for raising this — I understand the confusion.
You’re right that the final prediction aims to classify between [Human, NotHuman], but let me clarify the model design and why you’re seeing 7 logits in the output.
The model is trained with 7 classes, where:
1 class represents real human voice
6 classes represent AI-generated voices from different TTS or voice cloning models (e.g., melgan, wavegan, difgan, etc.)
This multi-class setup was chosen intentionally for two reasons:
Improved generalization: By training the model to recognize different types of synthetic voices individually, it learns more nuanced differences between real and AI-generated audio.
Limited real data: Since real human voice samples were limited, grouping all AI-generated data into a single “NotHuman” class would risk the model being biased or underperforming. Treating AI sources separately ensures better balance and feature learning.
class 0 → Human
class 1–6 → NotHuman
Thank you for the quick response!
I found this list of TTS classes on your GitHub:gt , wavegrad, diffwave, parallel wave gan, wavernn, wavenet, melgan
Do they apply to the published HuggingFace model as well?
Source: https://github.com/Mrkomiljon/voiceguard/blob/main/eval.py