Strugglking to make it work - Potential problem in the fine tuning dataset

#2
by Jettsy - opened

I am also willing to help btw.

I have been struggling to implement it for days now.
I can't even run it on your audio files.

@Jettsy Hey, I'm on vacation till 29th October. I'll check it once I'm back.
Meanwhile, could you please elaborate on what issues you're facing and what you've tried so far?

I have seen your other comment too, will check for the onnx support.

I am building a dataset to try this out with a broader perspective, but I tried the humaware model on a regular youtube conference and, with the same settings, the humaware model produced 2 to 4 times less speech timestamps than the original model.

I looked at the dataset used to fine tuned the model and it seems that the silences in the regular speech audio files were maybe large enough to be considered as silence by the original model but were cutoff due to the chosen training timestamps of the fine tuning dataset that did not include those silences.
I am worried that timestamping only the humming in general, and not the included silences in the original speech dataset, in the fine tuning dataset might have reduced a lot the performance of the original model.
Furthermore I think that the reason why the silero_vad model fails to recognize huming is for two main reasons that are not considered in the fine tuning dataset :

  1. humming occurs before a sentence or a word is attached and blends into the word (not the case with the method with which the dataset was built)
  2. humming is often as loud as the speech itself (not the case with the method with which the dataset was built)

Anyway, let's talk about this when you return, I will build another dataset using open source datasets meanwhile.

Jettsy changed discussion title from Struggling to make it work to Strugglking to make it work - Potential problem in the fine tuning dataset

Sign up or log in to comment