Model Returns 7 Classes

#1
by ibliskavka - opened

According to the README, the model should return 2 classifications.

When I run the model, I am seeing 7 logits.

When I try to map to a [Human, NotHuman] array I get an index out of bounds error.

What are the other 7 classes? I am new at this, so any help would be appreciated.

Thank you for raising this — I understand the confusion.

You’re right that the final prediction aims to classify between [Human, NotHuman], but let me clarify the model design and why you’re seeing 7 logits in the output.

The model is trained with 7 classes, where:

1 class represents real human voice

6 classes represent AI-generated voices from different TTS or voice cloning models (e.g., melgan, wavegan, difgan, etc.)

This multi-class setup was chosen intentionally for two reasons:

Improved generalization: By training the model to recognize different types of synthetic voices individually, it learns more nuanced differences between real and AI-generated audio.

Limited real data: Since real human voice samples were limited, grouping all AI-generated data into a single “NotHuman” class would risk the model being biased or underperforming. Treating AI sources separately ensures better balance and feature learning.

class 0 → Human

class 1–6 → NotHuman

Thank you for the quick response!

I found this list of TTS classes on your GitHub:
gt , wavegrad, diffwave, parallel wave gan, wavernn, wavenet, melgan

Do they apply to the published HuggingFace model as well?

Source: https://github.com/Mrkomiljon/voiceguard/blob/main/eval.py

Sign up or log in to comment