A collection of pre-training datasets samples of sizes 10M, 100M and 1B tokens. Ideal for use in quick experimentation and ablations.
Asankhaya Sharma
codelion
AI & ML interests
Creator of OptiLLM, OpenEvolve, Adaptive Classifier, and Ellora. Pioneering a new category in AI infrastructure: inference-time compute for LLMs.
Recent Activity
liked
a Space
3 days ago
ibm-granite/Granite-4.0-WebGPU
upvoted
a
paper
4 days ago
Less is More: Recursive Reasoning with Tiny Networks
upvoted
an
article
4 days ago
mem-agent: Equipping LLM Agents with Memory Using RL