License For Pretrained and Pretrained Models Code #174

EthanEpp · 2024-05-27T18:32:23Z

EthanEpp
May 27, 2024

Wasn't sure exactly where to put this, but I reached out to the creator of the impulse response dataset used for the pretrained and for the background noise regarding license and confirmed it was MIT 4.0 for huggingface hosted data and for the reference in the colab notebook. I can show the email as well if you would like to confirm.

Related to this, I would just like to confirm that the non-commercially licensed component of the pretrained models is only for the validation features correct? If possible would you be able to provide the code that creates the validation featuresl? I assume it is just what is in the collab notebook but it is mentioned specfic subsets of some of the data are selected so I would like to keep that consistent, then just remove the non-commercial licensed datasets and try ones that are commercially licensed.

Thanks!

dscripka · 2024-06-13T00:24:08Z

dscripka
Jun 13, 2024
Maintainer

That is very interesting! Thanks for contacting the creator of that IR dataset, if you wouldn't mind sharing the email that would be useful.

As for the non-commercial components, its actually not just on the validation features. For the pre-trained models much of the training data itself is from data that either has non-commercial licensing, or is composed of data with mixed/unknown licensing and thus should conservatively be considered as non-commercial.

It should be possible to train models without this data, but you'd have to find a good (and large) combination of speech, background noise, and music all with permissive licensing. I'd be happy discuss some options here, if you already have some datasets you are considering.

2 replies

EthanEpp Jun 13, 2024
Author

Yea sure, just let me know where to forward it.

And yes, I see my original question was a little misstated, I meant the feature sets that were hosted in in the huggingface, not the pretrained models. I have some data for training a new word, and it seems to be performing pretty well but I do have some ideas on improving my specific use case that I'm messing around with.

Right now for positive I'm using just Piper with some augmentations for positive but I plan on adding some other generator as well as some non-synthetic samples, and am using the precomputed ACAV100M features included in the huggingface.
For negative and false positive validation: AudioSet, FMA, DipCo, VoxCeleb, SB Corpus, Commonvoice, and a TBD subset of Freesound.

I confirmed these are all commercial use permissible, but I was wondering if you had any thoughts on the datasets selected. A lot of these are already being used in some of the provided notebooks, I just confirmed the licensing.

dscripka Jun 15, 2024
Maintainer

You can forward it to [email protected], thanks!

Those are certainly all good datasets to use, but I'm not sure they are all commercially licensed. For example, ACAV100M ultimately sources video from youtube (where each video can have its own license), as does AudioSet. Similarly, for FMA only the metadata is permissively licensed, while each audio sample itself is licensed individually (with many having non-commercial licensing).

However, Commonvoice and certain subsets of Freesound are excellent datasets with clear licensing, so that's definitely a good place to start. Beyond this, People's Speech is another potentially useful dataset.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

License For Pretrained and Pretrained Models Code #174

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 1 comment 2 replies

{{title}}

{{title}}

{{title}}

Select a reply

License For Pretrained and Pretrained Models Code #174

EthanEpp May 27, 2024

Replies: 1 comment · 2 replies

dscripka Jun 13, 2024 Maintainer

EthanEpp Jun 13, 2024 Author

dscripka Jun 15, 2024 Maintainer

EthanEpp
May 27, 2024

Replies: 1 comment 2 replies

dscripka
Jun 13, 2024
Maintainer

EthanEpp Jun 13, 2024
Author

dscripka Jun 15, 2024
Maintainer