BigScience Workshop

non-profit

https://bigscience.huggingface.co

bigscienceW

bigscience-workshop

Activity Feed

AI & ML interests

A one-year long research workshop on large language models: the Summer of Language Models 21 🌸

Recent Activity

christopher new activity 27 days ago

bigscience/bloomz-560m:Fails to load with transformers v4.57+

christopher new activity about 2 months ago

bigscience/petals-api:Bloom

rabiulawal authored a paper about 2 months ago

Grounding Computer Use Agents on Human Demonstrations

View all activity

craffel

authored a paper 10 days ago

TokSuite: Measuring the Impact of Tokenizer Choice on Language Model Behavior

Paper • 2512.20757 • Published 12 days ago • 16

christopher

in bigscience/bloomz-560m 27 days ago

Fails to load with transformers v4.57+

#14 opened 27 days ago by

qgallouedec

nihalnayak

authored a paper about 1 month ago

Revisiting Generalization Across Difficulty Levels: It's Not So Easy

Paper • 2511.21692 • Published Nov 26, 2025 • 15

monsoon-nlp

posted an update about 1 month ago

Post

391

PatchDNA, a DNA foundation model based on Meta's BLT tokenization strategy https://www.biorxiv.org/content/10.1101/2025.11.28.691095v1

christopher

in bigscience/petals-api about 2 months ago

Bloom

#2 opened about 2 months ago by

Raz-Test

rabiulawal

authored a paper about 2 months ago

Grounding Computer Use Agents on Human Demonstrations

Paper • 2511.07332 • Published Nov 10, 2025 • 105

thomwolf

authored a paper 3 months ago

Robot Learning: A Tutorial

Paper • 2510.12403 • Published Oct 14, 2025 • 119

nihalnayak

authored a paper 3 months ago

Boomerang Distillation Enables Zero-Shot Model Size Interpolation

Paper • 2510.05064 • Published Oct 6, 2025 • 1

sasha

authored 3 papers 3 months ago

Estimating the Carbon Footprint of BLOOM, a 176B Parameter Language Model

Paper • 2211.02001 • Published Nov 3, 2022

Hype, Sustainability, and the Price of the Bigger-is-Better Paradigm in AI

Paper • 2409.14160 • Published Sep 21, 2024 • 3

From Efficiency Gains to Rebound Effects: The Problem of Jevons' Paradox in AI's Polarized Environmental Debate

Paper • 2501.16548 • Published Jan 27, 2025

monsoon-nlp

posted an update 3 months ago

Post

458

Bio LLMs train on many genomes, but can we encode differences within a species? TomatoTomato adds pangenome tokens to represent a domestic tomato and a wild tomato in one sequence 🍅 🧬
monsoon-nlp/tomatotomato-gLM2-150M-v0.1

yjernite

posted an update 4 months ago

Post

2627

Tremendous quality of life upgrade on the Hugging Face Hub - we now have auto-complete emojis 🤗 🥳 👏 🙌 🎉

Get ready for lots more very serious analysis on a whole range of topics from yours truly now that we have unlocked this full range of expression 😄 🤔 🗣 🙊

christopher

in bigscience/bloom 5 months ago

Let's talk about the model

#284 opened 5 months ago by

kalashshah19

yjernite

authored 3 papers 5 months ago

The Responsible Foundation Model Development Cheatsheet: A Review of Tools & Resources

Paper • 2406.16746 • Published Jun 24, 2024

In-House Evaluation Is Not Enough: Towards Robust Third-Party Flaw Disclosure for General-Purpose AI

Paper • 2503.16861 • Published Mar 21, 2025 • 1

A Different Approach to AI Safety: Proceedings from the Columbia Convening on Openness in Artificial Intelligence and AI Safety

Paper • 2506.22183 • Published Jun 27, 2025 • 1

w11wo

authored a paper 5 months ago

Multi-Stage Verification-Centric Framework for Mitigating Hallucination in Multi-Modal RAG

Paper • 2507.20136 • Published Jul 27, 2025

yjernite

posted an update 5 months ago

Post

4211

𝗙𝗶𝗿𝘀𝘁 𝗚𝗣𝗔𝗜 𝗠𝗼𝗱𝗲𝗹 𝘄𝗶𝘁𝗵 𝗘𝗨 𝗗𝗮𝘁𝗮 𝗧𝗿𝗮𝗻𝘀𝗽𝗮𝗿𝗲𝗻𝗰𝘆 𝗧𝗲𝗺𝗽𝗹𝗮𝘁𝗲? 🇪🇺

With the release of the EU data transparency template this week, we finally got to see one of the most meaningful artifacts to come out of the AI Act implementation so far (haven't you heard? AI's all about the data! 📊📚)

The impact of the template will depend on how effectively it establishes a minimum meaningful transparency standard for companies that don't otherwise offer any transparency into their handling of e.g. personal data or (anti?-)competitive practices in commercial licensing - we'll see how those play out as new models are released after August 2nd 👀

In the meantime, I wanted to see how the template works for a fully open-source + commercially viable model, so I filled it out for the SmolLM3 - which my colleagues at Hugging Face earlier this month 🤗 ICYMI, it's fully open-source with 3B parameters and performance matching the best similar-size models (I've switched all my local apps from Qwen3 to it, you should too 💡)

Verdict: congrats to the European Commission AI Office for making it so straightforward! Fully open and transparent models remain a cornerstone of informed regulation and governance, but the different organizational needs of their developers aren't always properly accounted for in new regulation. In this case, it took me all of two hours to fill out and publish the template (including reading the guidelines) - so kudos for making it feasible for smaller and distributed organizations 🙌 Definitely a step forward for transparency 🔍

To learn more have a look at:

- The SmolLM3 model: HuggingFaceTB/SmolLM3-3B
- Its filled out Public Summary of Training Content: hfmlsoc/smollm3-eu-data-transparency
- And if you're interested, some previous remarks on regulatory minimum meaningful standards for data disclosure: https://huggingface.co/blog/yjernite/naiac-data-transparency

NohTow

authored a paper 6 months ago

Seq vs Seq: An Open Suite of Paired Encoders and Decoders

Paper • 2507.11412 • Published Jul 15, 2025 • 30

AI & ML interests

Recent Activity

Team members 328

bigscience's activity

Fails to load with transformers v4.57+

Bloom

Let's talk about the model