Source of the dataset

#3
by Nilllya - opened

What exactly is the nature of the used dataset? Freepik seems to be an AI-Service. Do you have an internal non-AI stock-photo database which you used and have the rights to, or how exactly is your database copyright-safe?

Freepik org

Freepik originally was a stock(non-AI) company, but as you mentioned, it now has a powerful AI suite. However, the stock is still part of the company. I hope the information helps! Here is the link to our stock images: https://www.freepik.com/search?format=search.

Thank you for this answer, and for the open model. These comments raise questions about the nature of the database origin, but they do not challenge the idea of copyright-safe material. Instead, they suggest some of the database may be synthetic. This is not necessarily good or bad, just interesting.

Do you have advice on how to search the stock images on Freepik to narrow the images to the cutoff point in time of the model's training dataset? Will you allow training of new models from the dataset? I look forward to your reply.

Freepik org

We have been actively working on improving the quality of the Freepik database since that news. More specifically, our filters to filter AI-generated content from the catalog are now much better and the overall quality has been raised. We have been excluding AI-generated images from the training data. Some of them might have leaked as 100% accuracy on filtering is impossible, but the amount of AI-generated data in the training database is residual.

Could you clarify whether the SHA-256 list of the 80 M training images will be released? It would help external auditing.

We have been actively working on improving the quality of the Freepik database since that news. More specifically, our filters to filter AI-generated content from the catalog are now much better and the overall quality has been raised. We have been excluding AI-generated images from the training data. Some of them might have leaked as 100% accuracy on filtering is impossible, but the amount of AI-generated data in the training database is residual.

does this dataset contain public domain art? how much of it is not public domain?

Sign up or log in to comment