Is the datasets for foundational model pre-training publicly accessible?

#10
by JayceCeleste - opened

Hello, thanks for the great paper, and publishing the model here!

I noticed you mentioned in the paper that "To investigate the impact of species diversity on genome foundational models, we’ve compiled and made publicly available two datasets for foundational model pre-training: the human genome and the multi-species genome. "

I tried to find it but failed, only to find the GUE dataset.

Could you please provide a link for it? Thanks : )

Not sure if this is still relevant, but there is a link for the pretraining dataset currently available on the DNABERT-2 repository. https://github.com/MAGICS-LAB/DNABERT_2

Your need to confirm your account before you can post a new comment.

Sign up or log in to comment