AI & ML interests

A one-year long research workshop on large language models: the Summer of Language Models 21 ๐ŸŒธ

Recent Activity

monsoon-nlpย 
posted an update about 1 month ago
christopherย 
in bigscience/petals-api about 2 months ago

Bloom

#2 opened about 2 months ago by
Raz-Test
monsoon-nlpย 
posted an update 3 months ago
view post
Post
458
Bio LLMs train on many genomes, but can we encode differences within a species? TomatoTomato adds pangenome tokens to represent a domestic tomato and a wild tomato in one sequence ๐Ÿ… ๐Ÿงฌ
monsoon-nlp/tomatotomato-gLM2-150M-v0.1
yjerniteย 
posted an update 4 months ago
view post
Post
2627
Tremendous quality of life upgrade on the Hugging Face Hub - we now have auto-complete emojis ๐Ÿค— ๐Ÿฅณ ๐Ÿ‘ ๐Ÿ™Œ ๐ŸŽ‰

Get ready for lots more very serious analysis on a whole range of topics from yours truly now that we have unlocked this full range of expression ๐Ÿ˜„ ๐Ÿค” ๐Ÿ—ฃ ๐Ÿ™Š
christopherย 
in bigscience/bloom 5 months ago
yjerniteย 
posted an update 5 months ago
view post
Post
4211
๐—™๐—ถ๐—ฟ๐˜€๐˜ ๐—š๐—ฃ๐—”๐—œ ๐— ๐—ผ๐—ฑ๐—ฒ๐—น ๐˜„๐—ถ๐˜๐—ต ๐—˜๐—จ ๐——๐—ฎ๐˜๐—ฎ ๐—ง๐—ฟ๐—ฎ๐—ป๐˜€๐—ฝ๐—ฎ๐—ฟ๐—ฒ๐—ป๐—ฐ๐˜† ๐—ง๐—ฒ๐—บ๐—ฝ๐—น๐—ฎ๐˜๐—ฒ? ๐Ÿ‡ช๐Ÿ‡บ

With the release of the EU data transparency template this week, we finally got to see one of the most meaningful artifacts to come out of the AI Act implementation so far (haven't you heard? AI's all about the data! ๐Ÿ“Š๐Ÿ“š)

The impact of the template will depend on how effectively it establishes a minimum meaningful transparency standard for companies that don't otherwise offer any transparency into their handling of e.g. personal data or (anti?-)competitive practices in commercial licensing - we'll see how those play out as new models are released after August 2nd ๐Ÿ‘€


In the meantime, I wanted to see how the template works for a fully open-source + commercially viable model, so I filled it out for the SmolLM3 - which my colleagues at Hugging Face earlier this month ๐Ÿค— ICYMI, it's fully open-source with 3B parameters and performance matching the best similar-size models (I've switched all my local apps from Qwen3 to it, you should too ๐Ÿ’ก)

Verdict: congrats to the European Commission AI Office for making it so straightforward! Fully open and transparent models remain a cornerstone of informed regulation and governance, but the different organizational needs of their developers aren't always properly accounted for in new regulation. In this case, it took me all of two hours to fill out and publish the template (including reading the guidelines) - so kudos for making it feasible for smaller and distributed organizations ๐Ÿ™Œ Definitely a step forward for transparency ๐Ÿ”

To learn more have a look at:

- The SmolLM3 model: HuggingFaceTB/SmolLM3-3B
- Its filled out Public Summary of Training Content: hfmlsoc/smollm3-eu-data-transparency
- And if you're interested, some previous remarks on regulatory minimum meaningful standards for data disclosure: https://huggingface.co/blog/yjernite/naiac-data-transparency