Common Pile v0.1 Filtered Data

common-pile 's Collections

Common Pile v0.1

Common Pile v0.1 Raw Data

Comma v0.1 Artifacts

updated 22 days ago

An LLM pre-training dataset produced by filtering and deduplicating the raw text collected in the Common Pile v0.1

Upvote

common-pile/arxiv_abstracts_filtered

Viewer • Updated 22 days ago • 2.5M • 464 • 5
common-pile/arxiv_papers_filtered

Viewer • Updated 22 days ago • 309k • 882
common-pile/biodiversity_heritage_library_filtered

Viewer • Updated 22 days ago • 16.5M • 521
common-pile/caselaw_access_project_filtered

Viewer • Updated 22 days ago • 5.5M • 598 • 1
common-pile/cccc_filtered

Viewer • Updated 22 days ago • 10.8M • 878 • 1
common-pile/data_provenance_initiative_filtered

Viewer • Updated 22 days ago • 3.51M • 261
common-pile/doab_filtered

Viewer • Updated 22 days ago • 404k • 625 • 1
common-pile/foodista_filtered

Preview • Updated 22 days ago • 220 • 1
common-pile/github_archive_filtered

Viewer • Updated 22 days ago • 23.3M • 485
common-pile/library_of_congress_filtered

Viewer • Updated 22 days ago • 128k • 701 • 2
common-pile/libretexts_filtered

Viewer • Updated 22 days ago • 40k • 221
common-pile/news_filtered

Viewer • Updated 22 days ago • 127k • 284 • 1
common-pile/oercommons_filtered

Viewer • Updated 22 days ago • 5.25k • 212 • 1
common-pile/peS2o_filtered

Viewer • Updated 22 days ago • 6.09M • 1.05k
common-pile/pre_1929_books_filtered

Viewer • Updated 22 days ago • 122k • 720
common-pile/pressbooks_filtered

Viewer • Updated 22 days ago • 54.5k • 211
common-pile/project_gutenberg_filtered

Viewer • Updated 22 days ago • 57.1k • 675
common-pile/public_domain_review_filtered

Viewer • Updated 22 days ago • 1.41k • 195
common-pile/pubmed_filtered

Viewer • Updated 22 days ago • 4.77M • 821 • 2
common-pile/python_enhancement_proposals_filtered

Viewer • Updated 22 days ago • 655 • 200 • 1
common-pile/regulations_filtered

Viewer • Updated 22 days ago • 192k • 374
common-pile/stackexchange_filtered

Viewer • Updated 22 days ago • 27.5M • 1.02k • 2
common-pile/stackv2_edu_filtered

Viewer • Updated 22 days ago • 57M • 1.15k • 1
common-pile/stackv2_html_filtered

Viewer • Updated May 23 • 1.67M • 228
common-pile/ubuntu_irc_filtered

Viewer • Updated 22 days ago • 216k • 481
common-pile/uk_hansard_filtered

Viewer • Updated 22 days ago • 47.9k • 708 • 1
common-pile/usgpo_filtered

Viewer • Updated 22 days ago • 2.34M • 312
common-pile/uspto_filtered

Viewer • Updated 22 days ago • 14.4M • 1.67k • 1
common-pile/wikimedia_filtered

Viewer • Updated 22 days ago • 12.9M • 812 • 3
common-pile/wikiteam_filtered

Viewer • Updated 22 days ago • 10.2M • 338
common-pile/youtube_filtered

Viewer • Updated 22 days ago • 986k • 467 • 3

Upvote

Collection guide
Browse collections