kaizuberbuehler
's Collections
Getting it Right: Improving Spatial Consistency in Text-to-Image Models
Paper
•
2404.01197
•
Published
•
32
CosmicMan: A Text-to-Image Foundation Model for Humans
Paper
•
2404.01294
•
Published
•
16
mOSCAR: A Large-scale Multilingual and Multimodal Document-level Corpus
Paper
•
2406.08707
•
Published
•
17
DataComp-LM: In search of the next generation of training sets for
language models
Paper
•
2406.11794
•
Published
•
53
XLand-100B: A Large-Scale Multi-Task Dataset for In-Context
Reinforcement Learning
Paper
•
2406.08973
•
Published
•
90
OmniCorpus: A Unified Multimodal Corpus of 10 Billion-Level Images
Interleaved with Text
Paper
•
2406.08418
•
Published
•
31
GUI Odyssey: A Comprehensive Dataset for Cross-App GUI Navigation on
Mobile Devices
Paper
•
2406.08451
•
Published
•
26
argilla/magpie-ultra-v0.1
Viewer
•
Updated
•
50k
•
662
•
221
Viewer
•
Updated
•
25B
•
210k
•
2.21k
Viewer
•
Updated
•
61.6M
•
73.6k
•
849
Viewer
•
Updated
•
31.1M
•
5.39k
•
623
Viewer
•
Updated
•
546M
•
13.9k
•
832
Viewer
•
Updated
•
1M
•
3.22k
•
735
Viewer
•
Updated
•
2.14M
•
25k
•
687
Viewer
•
Updated
•
55.1k
•
105
•
97
HuggingFaceFW/fineweb-edu
Viewer
•
Updated
•
3.3B
•
119k
•
702
Viewer
•
Updated
•
1.75M
•
297
•
98
Viewer
•
Updated
•
100k
•
18.9k
•
214
InfiMM-WebMath-40B: Advancing Multimodal Pre-Training for Enhanced
Mathematical Reasoning
Paper
•
2409.12568
•
Published
•
51
RedPajama: an Open Dataset for Training Large Language Models
Paper
•
2411.12372
•
Published
•
56
BLIP3-KALE: Knowledge Augmented Large-Scale Dense Captions
Paper
•
2411.07461
•
Published
•
24
OpenCoder: The Open Cookbook for Top-Tier Code Large Language Models
Paper
•
2411.04905
•
Published
•
126
URSA: Understanding and Verifying Chain-of-thought Reasoning in
Multimodal Mathematics
Paper
•
2501.04686
•
Published
•
53
Viewer
•
Updated
•
450k
•
26.9k
•
595
WILDCHAT-50M: A Deep Dive Into the Role of Synthetic Data in
Post-Training
Paper
•
2501.18511
•
Published
•
20
MAGA: MAssive Genre-Audience Reformulation to Pretraining Corpus
Expansion
Paper
•
2502.04235
•
Published
•
22
Hephaestus: Improving Fundamental Agent Capabilities of Large Language
Models through Continual Pre-Training
Paper
•
2502.06589
•
Published
•
19
CoSER: Coordinating LLM-Based Persona Simulation of Established Roles
Paper
•
2502.09082
•
Published
•
30
EgoLife: Towards Egocentric Life Assistant
Paper
•
2503.03803
•
Published
•
45
KodCode: A Diverse, Challenging, and Verifiable Synthetic Dataset for
Coding
Paper
•
2503.02951
•
Published
•
32
VisualWebInstruct: Scaling up Multimodal Instruction Data through Web
Search
Paper
•
2503.10582
•
Published
•
23