Hugging Face
Models
Datasets
Spaces
Posts
Docs
Enterprise
Pricing
Log In
Sign Up
9
16
56
Tong Zhu
Spico
Follow
Warrieryes's profile picture
huxy912's profile picture
dark-pen's profile picture
14 followers
Ā·
37 following
https://Spico197.github.io
TongZhu197
Spico197
AI & ML interests
Information Extraction, Mixture-of-Experts, LLM
Recent Activity
updated
a dataset
8 days ago
Spico/Mirror_ACE
reacted
to
BramVanroy
's
post
with ā¤ļø
8 days ago
š¢š¾ Introducing the Common Crawl Creative Commons Corpus (C5)! C5 is a large-scale effort to heavily filter web-crawled data, as collected by the non-profit Common Crawl, to only documents that are Creative Commons-licensed such as cc-by-4.0 or public domain cc0. At this stage 150 billion tokens have been collected. --- š data: https://huggingface.co/datasets/BramVanroy/CommonCrawl-CreativeCommons š§° software: https://github.com/BramVanroy/CommonCrawl-CreativeCommons --- </> To build C5, HTML pages are scrutinized and all links (if any) to CC licenses are collected, both in regular hyperlinks as well as in metadata. Additional data fields are included such as "was the license found in the `head`?" or "if multiple licenses were found, do they contradict each other?", which makes further filtering a breeze. š In this first version of C5, 8 languages are included (Afrikaans, German, English, French, Frysian, Italian, Dutch and Spanish). The language set was limited for two reasons: computational and storage limitations, and a collaboration with GPT-NL, which requested CC data for these languages to train a Dutch-focused, copyright-conscious LLM. In total, this V1 release contains almost 150 thousand documents and 150 billion tokens. This data was not filtered on quality nor deduplicated so that you can decide for yourself how much data to keep. To give some quality indication, a dataset field is present to describe whether a document is included in the FineWeb(-2) datasets, which are of high quality. š More work needs to be done! Only 7 out of 100+ Common Crawl crawls have been processed so far. That's encouraging because it means there is a lot more Creative Commons data to be collected! But to get there I need help in terms of compute. The current processing was already heavily sponsored by the Flemish Supercomputer but more is needed. If you have the compute available and which to collaborate in an open and transparent manner, please get in touch!
upvoted
a
paper
about 1 month ago
A Survey of Efficient Reasoning for Large Reasoning Models: Language, Multimodality, and Beyond
View all activity
Organizations
Spico
's activity
All
Models
Datasets
Spaces
Papers
Collections
Community
Posts
Upvotes
Likes
Articles
liked
3 datasets
about 2 months ago
kkChimmy/REALM
Viewer
ā¢
Updated
Apr 12
ā¢
93.3k
ā¢
123
ā¢
13
RadiCat/SimpleToolQuestions
Preview
ā¢
Updated
Mar 25
ā¢
66
ā¢
1
nvidia/HelpSteer3
Viewer
ā¢
Updated
Mar 18
ā¢
99k
ā¢
1.95k
ā¢
41
liked
a dataset
5 months ago
zhliu/RETURN
Viewer
ā¢
Updated
Jul 17, 2024
ā¢
2.49k
ā¢
28
ā¢
1
liked
2 models
7 months ago
microsoft/OmniParser
Image-Text-to-Text
ā¢
Updated
Dec 2, 2024
ā¢
815
ā¢
1.66k
genmo/mochi-1-preview
Text-to-Video
ā¢
Updated
Dec 18, 2024
ā¢
36.1k
ā¢
ā¢
1.22k
liked
a dataset
7 months ago
neuralwork/arxiver
Viewer
ā¢
Updated
Nov 1, 2024
ā¢
63.4k
ā¢
314
ā¢
362
liked
a dataset
8 months ago
facebook/emu_edit_test_set
Viewer
ā¢
Updated
Nov 19, 2023
ā¢
5.61k
ā¢
1.24k
ā¢
41
liked
2 datasets
11 months ago
nvidia/HelpSteer2
Viewer
ā¢
Updated
Dec 18, 2024
ā¢
21.4k
ā¢
4k
ā¢
413
HuggingFaceM4/the_cauldron
Viewer
ā¢
Updated
May 6, 2024
ā¢
1.88M
ā¢
509k
ā¢
427
liked
a model
about 1 year ago
AI4Chem/ChemLLM-20B-Chat-DPO
Text Generation
ā¢
Updated
Sep 17, 2024
ā¢
73
ā¢
8
liked
a dataset
about 1 year ago
m-a-p/COIG-CQIA
Viewer
ā¢
Updated
Apr 18, 2024
ā¢
44.7k
ā¢
3.81k
ā¢
626
liked
a Space
about 1 year ago
Running
279
279
Qwen1.5 110B Chat Demo
š
Chat with Qwen1.5-110B-Chat Bot
liked
a dataset
about 1 year ago
BLINK-Benchmark/BLINK
Viewer
ā¢
Updated
Aug 13, 2024
ā¢
3.81k
ā¢
3.82k
ā¢
25
liked
a Space
about 1 year ago
Running
on
CPU Upgrade
736
736
TTS Arena V2
š
Vote on the latest TTS models!
liked
a model
about 1 year ago
Vezora/Mistral-22B-v0.1
Text Generation
ā¢
Updated
Apr 12, 2024
ā¢
9
ā¢
149
liked
a dataset
about 1 year ago
unalignment/toxic-dpo-v0.2
Viewer
ā¢
Updated
Jan 9, 2024
ā¢
541
ā¢
214
ā¢
126
liked
a model
about 1 year ago
Qwen/Qwen-Audio-Chat
Text Generation
ā¢
Updated
Jan 12
ā¢
4.52k
ā¢
88
liked
a dataset
over 1 year ago
liwu/MNBVC
Updated
1 day ago
ā¢
36.6k
ā¢
542
liked
a model
over 1 year ago
OrionZheng/openmoe-8b
Text Generation
ā¢
Updated
Jan 19, 2024
ā¢
24
ā¢
3
Load more