Large web-mined general corpus based on CommonCrawl.
Amir Hossein Kargaran
kargaranamir
AI & ML interests
#NLP, checkout https://huggingface.co/cis-lmu
Recent Activity
upvoted
a
paper
about 6 hours ago
FineWeb2: One Pipeline to Scale Them All -- Adapting Pre-Training Data
Processing to Every Language
authored
a paper
about 19 hours ago
FineWeb2: One Pipeline to Scale Them All -- Adapting Pre-Training Data
Processing to Every Language
liked
a dataset
2 days ago
microsoft/Taskbench