Hugging Face's logo Hugging Face
  • Models
  • Datasets
  • Spaces
  • Docs
  • Enterprise
  • Pricing

  • Log In
  • Sign Up
BramVanroy 's Collections
CommonCrawl-Creative Commons (C5)
Fietje 2
🐐 GEITje 7B ultra 🤖
SFT & RL datasets for Dutch
Dutch Simplification
Multilingual text-to-AMR
Leesplank 2023-2024
Llama 2 & Falcon finetunes
BLEURT

CommonCrawl-Creative Commons (C5)

updated 12 days ago

Raw CommonCrawl crawls, annotated with Creative Commons license information

Upvote
-

  • BramVanroy/CommonCrawl-CreativeCommons

    Viewer • Updated 2 days ago • 739M • 923 • 31

  • BramVanroy/CommonCrawl-CreativeCommons-fine

    Updated about 5 hours ago • 358 • 1

    Note Only retaining samples that are also present in FineWeb or FineWeb-2


  • BramVanroy/CommonCrawl-CreativeCommons-recommended

    Viewer • Updated about 5 hours ago • 32.8M • 397 • 1

    Note Strong filters, only retaining FineWeb data, removing non-commercial data, removing Wiki data

Upvote
-
  • Collection guide
  • Browse collections
Company
TOS Privacy About Jobs
Website
Models Datasets Spaces Pricing Docs