Data and other things - a Stalin16 Collection

Stalin16 's Collections

Edu

Agents

Model Evaluation

Reasoning Models

Data and other things

Gen AI Diffusion

Data and other things

updated 9 days ago

MegaPairs: Massive Data Synthesis For Universal Multimodal Retrieval

Paper • 2412.14475 • Published Dec 19, 2024 • 55
How to Synthesize Text Data without Model Collapse?

Paper • 2412.14689 • Published Dec 19, 2024 • 53
Token-Budget-Aware LLM Reasoning

Paper • 2412.18547 • Published Dec 24, 2024 • 47
WavePulse: Real-time Content Analytics of Radio Livestreams

Paper • 2412.17998 • Published Dec 23, 2024 • 11
Bridging the Data Provenance Gap Across Text, Speech and Video

Paper • 2412.17847 • Published Dec 19, 2024 • 10
No More Adam: Learning Rate Scaling at Initialization is All You Need

Paper • 2412.11768 • Published Dec 16, 2024 • 44
2.5 Years in Class: A Multimodal Textbook for Vision-Language Pretraining

Paper • 2501.00958 • Published Jan 1 • 107
URSA: Understanding and Verifying Chain-of-thought Reasoning in Multimodal Mathematics

Paper • 2501.04686 • Published Jan 8 • 54
MLLM-as-a-Judge for Image Safety without Human Labeling

Paper • 2501.00192 • Published Dec 31, 2024 • 32
OmniThink: Expanding Knowledge Boundaries in Machine Writing through Thinking

Paper • 2501.09751 • Published Jan 16 • 49
WILDCHAT-50M: A Deep Dive Into the Role of Synthetic Data in Post-Training

Paper • 2501.18511 • Published Jan 30 • 20
LIMO: Less is More for Reasoning

Paper • 2502.03387 • Published Feb 5 • 61
Scaling Pre-training to One Hundred Billion Data for Vision Language Models

Paper • 2502.07617 • Published Feb 11 • 29
QuEST: Stable Training of LLMs with 1-Bit Weights and Activations

Paper • 2502.05003 • Published Feb 7 • 44
TextAtlas5M: A Large-scale Dataset for Dense Text Image Generation

Paper • 2502.07870 • Published Feb 11 • 45
Jailbreaking to Jailbreak

Paper • 2502.09638 • Published Feb 9 • 5
Scaling Text-Rich Image Understanding via Code-Guided Synthetic Multimodal Data Generation

Paper • 2502.14846 • Published Feb 20 • 14
Referring to Any Person

Paper • 2503.08507 • Published Mar 11 • 7
"Principal Components" Enable A New Language of Images

Paper • 2503.08685 • Published Mar 11 • 12
YuE: Scaling Open Foundation Models for Long-Form Music Generation

Paper • 2503.08638 • Published Mar 11 • 69
Crowdsource, Crawl, or Generate? Creating SEA-VL, a Multicultural Vision-Language Dataset for Southeast Asia

Paper • 2503.07920 • Published Mar 10 • 100
Any2Caption:Interpreting Any Condition to Caption for Controllable Video Generation

Paper • 2503.24379 • Published Mar 31 • 77
Chapter-Llama: Efficient Chaptering in Hour-Long Videos with LLMs

Paper • 2504.00072 • Published Mar 31 • 7
Advances and Challenges in Foundation Agents: From Brain-Inspired Intelligence to Evolutionary, Collaborative, and Safe Systems

Paper • 2504.01990 • Published Mar 31 • 301
URECA: Unique Region Caption Anything

Paper • 2504.05305 • Published Apr 7 • 36
SIFT-50M: A Large-Scale Multilingual Dataset for Speech Instruction Fine-Tuning

Paper • 2504.09081 • Published Apr 12 • 17
BookWorld: From Novels to Interactive Agent Societies for Creative Story Generation

Paper • 2504.14538 • Published Apr 20 • 29
Towards Understanding Camera Motions in Any Video

Paper • 2504.15376 • Published Apr 21 • 159
Alchemist: Turning Public Text-to-Image Data into Generative Gold

Paper • 2505.19297 • Published May 25 • 83
PrismLayers: Open Data for High-Quality Multi-Layer Transparent Image Generative Models

Paper • 2505.22523 • Published May 28 • 7
Large Language Models for Data Synthesis

Paper • 2505.14752 • Published May 20 • 50
HardTests: Synthesizing High-Quality Test Cases for LLM Coding

Paper • 2505.24098 • Published May 30 • 44
OpenThoughts: Data Recipes for Reasoning Models

Paper • 2506.04178 • Published Jun 4 • 44
SynthRL: Scaling Visual Reasoning with Verifiable Data Synthesis

Paper • 2506.02096 • Published Jun 2 • 51
One Missing Piece for Open-Source Reasoning Models: A Dataset to Mitigate Cold-Starting Short CoT LLMs in RL

Paper • 2506.02338 • Published Jun 3 • 4
Peer-Ranked Precision: Creating a Foundational Dataset for Fine-Tuning Vision Models from DataSeeds' Annotated Imagery

Paper • 2506.05673 • Published Jun 6 • 10
Institutional Books 1.0: A 242B token dataset from Harvard Library's collections, refined for accuracy and usability

Paper • 2506.08300 • Published Jun 10 • 8
VRBench: A Benchmark for Multi-Step Reasoning in Long Narrative Videos

Paper • 2506.10857 • Published Jun 12 • 31
Domain2Vec: Vectorizing Datasets to Find the Optimal Data Mixture without Training

Paper • 2506.10952 • Published Jun 12 • 23
ShareGPT-4o-Image: Aligning Multimodal Models with GPT-4o-Level Image Generation

Paper • 2506.18095 • Published Jun 22 • 65
Skywork-SWE: Unveiling Data Scaling Laws for Software Engineering in LLMs

Paper • 2506.19290 • Published Jun 24 • 50
NoHumansRequired: Autonomous High-Quality Image Editing Triplet Mining

Paper • 2507.14119 • Published Jul 18 • 55
MegaScience: Pushing the Frontiers of Post-Training Datasets for Science Reasoning

Paper • 2507.16812 • Published Jul 22 • 61
PUSA V1.0: Surpassing Wan-I2V with $500 Training Cost by Vectorized Timestep Adaptation

Paper • 2507.16116 • Published Jul 22 • 10
GPT-IMAGE-EDIT-1.5M: A Million-Scale, GPT-Generated Image Dataset

Paper • 2507.21033 • Published 25 days ago • 20
HPSv3: Towards Wide-Spectrum Human Preference Score

Paper • 2508.03789 • Published 17 days ago • 18
Echo-4o: Harnessing the Power of GPT-4o Synthetic Images for Improved Image Generation

Paper • 2508.09987 • Published 9 days ago • 24