ONNXConfig for all

non-profit

AI & ML interests

Make all hub models available for conversion to ONNX format.

Recent Activity

OWG's activity

lewtunΒ 
posted an update 5 days ago
view post
Post
9366
We are reproducing the full DeepSeek R1 data and training pipeline so everybody can use their recipe. Instead of doing it in secret we can do it together in the open!

πŸ§ͺ Step 1: replicate the R1-Distill models by distilling a high-quality reasoning corpus from DeepSeek-R1.

🧠 Step 2: replicate the pure RL pipeline that DeepSeek used to create R1-Zero. This will involve curating new, large-scale datasets for math, reasoning, and code.

πŸ”₯ Step 3: show we can go from base model -> SFT -> RL via multi-stage training.

Follow along: https://github.com/huggingface/open-r1
Β·
lewtunΒ 
posted an update 24 days ago
view post
Post
3663
I was initially pretty sceptical about Meta's Coconut paper [1] because the largest perf gains were reported on toy linguistic problems. However, these results on machine translation are pretty impressive!

https://x.com/casper_hansen_/status/1875872309996855343

Together with the recent PRIME method [2] for scaling RL, reasoning for open models is looking pretty exciting for 2025!

[1] Training Large Language Models to Reason in a Continuous Latent Space (2412.06769)
[2] https://huggingface.co/blog/ganqu/prime
lewtunΒ 
posted an update about 1 month ago
view post
Post
2253
This paper ( HuatuoGPT-o1, Towards Medical Complex Reasoning with LLMs (2412.18925)) has a really interesting recipe for inducing o1-like behaviour in Llama models:

* Iteratively sample CoTs from the model, using a mix of different search strategies. This gives you something like Stream of Search via prompting.
* Verify correctness of each CoT using GPT-4o (needed because exact match doesn't work well in medicine where there are lots of aliases)
* Use GPT-4o to reformat the concatenated CoTs into a single stream that includes smooth transitions like "hmm, wait" etc that one sees in o1
* Use the resulting data for SFT & RL
* Use sparse rewards from GPT-4o to guide RL training. They find RL gives an average ~3 point boost across medical benchmarks and SFT on this data already gives a strong improvement.

Applying this strategy to other domains could be quite promising, provided the training data can be formulated with verifiable problems!
  • 1 reply
Β·
lewtunΒ 
posted an update about 1 month ago
view post
Post
6810
We outperform Llama 70B with Llama 3B on hard math by scaling test-time compute πŸ”₯

How? By combining step-wise reward models with tree search algorithms :)

We show that smol models can match or exceed the performance of their much larger siblings when given enough "time to think"

We're open sourcing the full recipe and sharing a detailed blog post.

In our blog post we cover:

πŸ“ˆ Compute-optimal scaling: How we implemented DeepMind's recipe to boost the mathematical capabilities of open models at test-time.

πŸŽ„ Diverse Verifier Tree Search (DVTS): An unpublished extension we developed to the verifier-guided tree search technique. This simple yet effective method improves diversity and delivers better performance, particularly at large test-time compute budgets.

🧭 Search and Learn: A lightweight toolkit for implementing search strategies with LLMs and built for speed with vLLM

Here's the links:

- Blog post: HuggingFaceH4/blogpost-scaling-test-time-compute

- Code: https://github.com/huggingface/search-and-learn

Enjoy!
  • 2 replies
Β·
mkluczekΒ 
posted an update about 2 months ago
view post
Post
1645
First Global and Dense Open Embedding Dataset of Earth! 🌍 πŸ€—

Introducing the Major TOM embeddings dataset, created in collaboration with CloudFerro S.A. πŸ”Ά and Ξ¦-lab at the European Space Agency (ESA) πŸ›°οΈ. Together with @mikonvergence and JΔ™drzej S. Bojanowski, we present the first open-access dataset of Copernicus embeddings, offering dense, global coverage across the full acquisition areas of Sentinel-1 and Sentinel-2 sensors.

πŸ’‘ Highlights:
πŸ“Š Data: Over 8 million Sentinel-1 & Sentinel-2 images processed, distilling insights from 9.368 trillion pixels of raw data.
🧠 Models: Foundation models include SigLIP, DINOv2, and SSL4EO.
πŸ“¦ Scale: 62 TB of raw satellite data processed into 170M+ embeddings.

This project delivers open and free vectorized expansions of Major-TOM/README datasets, setting a new standard for embedding releases and enabling lightweight, scalable ingestion of Earth Observation (EO) data for countless applications.

πŸ€— Explore the datasets:
Major-TOM/Core-S2L1C-SSL4EO
Major-TOM/Core-S1RTC-SSL4EO
Major-TOM/Core-S2RGB-DINOv2
Major-TOM/Core-S2RGB-SigLIP

πŸ“– Check paper: Global and Dense Embeddings of Earth: Major TOM Floating in the Latent Space (2412.05600)
πŸ’» Code notebook: https://github.com/ESA-PhiLab/Major-TOM/blob/main/05-Generate-Major-TOM-Embeddings.ipynb
  • 1 reply
Β·
louisbrulenaudetΒ 
posted an update 3 months ago
view post
Post
1896
I’ve published a new dataset to simplify model merging πŸ€—

This dataset facilitates the search for compatible architectures for model merging with @arcee_ai’s mergekit, streamlining the automation of high-performance merge searches πŸ“–

Dataset : louisbrulenaudet/mergekit-configs
  • 1 reply
Β·
louisbrulenaudetΒ 
posted an update 3 months ago
view post
Post
1279
Introducing Lemone-router, a series of classification models designed to produce an optimal multi-agent system for different branches of tax law.

Trained on a base of 49k lines comprising a set of synthetic questions generated by GPT-4 Turbo and Llama 3.1 70B, which have been further refined through evol-instruction tuning and manual curation and authority documents, these models are based on an 8-category decomposition of the classification scheme derived from the Bulletin officiel des finances publiques - impΓ΄ts :

label2id = {
    "BΓ©nΓ©fices professionnels": 0,
    "ContrΓ΄le et contentieux": 1,
    "Dispositifs transversaux": 2,
    "FiscalitΓ© des entreprises": 3,
    "Patrimoine et enregistrement": 4,
    "Revenus particuliers": 5,
    "Revenus patrimoniaux": 6,
    "Taxes sur la consommation": 7
}
	
id2label = {
    0: "BΓ©nΓ©fices professionnels",
    1: "ContrΓ΄le et contentieux",
    2: "Dispositifs transversaux",
    3: "FiscalitΓ© des entreprises",
    4: "Patrimoine et enregistrement",
    5: "Revenus particuliers",
    6: "Revenus patrimoniaux",
    7: "Taxes sur la consommation"
}

It achieves the following results on the evaluation set:
- Loss: 0.4734
- Accuracy: 0.9191

Link to the collection: louisbrulenaudet/lemone-router-671cce21d6410f3570514762
louisbrulenaudetΒ 
posted an update 4 months ago
view post
Post
3125
🚨 I have $3,500 in Azure credits, including access to an H100 (96 Go), expiring on November 12, 2024.

I won’t be able to use it all myself, so I’m reaching out to the @huggingface community: Are there any open-source projets with data ready for some compute power?

Let’s collaborate and make the most of it together πŸ”—
Β·
louisbrulenaudetΒ 
posted an update 4 months ago
view post
Post
2130
My biggest release of the year: a series of 7 specialized embedding models for information retrieval within tax documents, is now available for free on Hugging Face πŸ€—

These new models aim to offer an open source alternative for in-domain semantic search from largeΒ text corpora and will improve RAG systems and context addition for large language models.

Trained on more than 43 million tax tokens derived from semi-synthetic and raw-synthetic data, enriched by various methods (in particular MSFT's evol-instruct by @intfloat ), and corrected by humans, this project is the fruit of hundreds of hours of work and is the culmination of a global effort to open up legal technologies that has only just begun.

A big thank you to Microsoft for Startups for giving me access to state-of-the-art infrastructure to train these models, and to @julien-c , @clem πŸ€—, @thomwolf and the whole HF team for the inference endpoint API and the generous provision of Meta LLama-3.1-70B. Special thanks also to @tomaarsen for his invaluable advice on training embedding models and Loss functions ❀️

Models are available on my personal HF page, into the Lemone-embed collection: louisbrulenaudet/lemone-embed-66fdc24000df732b395df29b
  • 1 reply
Β·
louisbrulenaudetΒ 
posted an update 5 months ago
view post
Post
2608
The Romulus model series has been released on Hugging Face, continually pre-trained on 34,864,949 tokens of French laws and intended to serve as a foundation for fine-tuning on labeled data πŸ€—

The training code, dataset and model weights are open and available free on HF and the training was based on H100 provided by Microsoft for Startups using Unsloth AI by @danielhanchen and @shimmyshimmer πŸ¦₯

Link to the base model: louisbrulenaudet/Romulus-cpt-Llama-3.1-8B-v0.1

Link to the instruct model: louisbrulenaudet/Romulus-cpt-Llama-3.1-8B-v0.1-Instruct

Link to the dataset: louisbrulenaudet/Romulus-cpt-fr

Please note that these models have not been aligned for the production of usable texts as they stand, and will certainly need to be refined for the desired tasks in order to produce satisfactory results.
  • 1 reply
Β·
louisbrulenaudetΒ 
posted an update 5 months ago
view post
Post
1577
An example of the application of LegalKit is the production of knowledge graphs, here is a demo Space πŸ”—

With the update of the French legal code data model uploaded to πŸ€— and the introduction of a column dedicated to HTML text, it's now easy to extract links between different articles and produce complex graphs with just a few lines of Python.

This simplified demo highlights the ease of implementation and creative potential, and enables the generation of complete data sets, although requiring a powerful graphics card for display. The framework used for the moment is D3.js, but perhaps other solutions are possible. I'd be delighted to hear your suggestions, and look forward to hearing from the community.

Link to the πŸ€— Space: louisbrulenaudet/legalkit-knowledge-graph
  • 2 replies
Β·
louisbrulenaudetΒ 
posted an update 5 months ago
view post
Post
2055
Understanding the json format response with HF's Serverless Inference API πŸ€—

As it stands, there seems to be an inconsistency with the OpenAI documentation on the question of implementing the JSON response format using the InferenceClient completion API.

After investigating the InferenceClient source code, I share the official solution using a JSON Schema. This consolidates the structure of the response and simplifies parsing as part of an automated process for extracting metadata, information:
from huggingface_hub import InferenceClient

client = InferenceClient("meta-llama/Meta-Llama-3-70B-Instruct")

messages = [
    {
        "role": "user",
        "content": "I saw a puppy a cat and a raccoon during my bike ride in the park. What did I saw and when?",
    },
]

response_format = {
    "type": "json",
    "value": {
        "properties": {
            "location": {"type": "string"},
            "activity": {"type": "string"},
            "animals_seen": {"type": "integer", "minimum": 1, "maximum": 5},
            "animals": {"type": "array", "items": {"type": "string"}},
        },
        "required": ["location", "activity", "animals_seen", "animals"],
    },
}

response = client.chat_completion(
    messages=messages,
    response_format=response_format,
    max_tokens=500,
)

print(response.choices[0].message.content)

As a reminder, json mode is activated with the OpenAI client as follows:
response = client.chat.completions.create(
     model="gpt-3.5-turbo-0125",
     messages=[...],
     response_format={"type": "json_object"}
)

One question remains unanswered, however, and will perhaps be answered by the community: it seems that an incompatibility persists for list of dictionaries generation, and currently, the production of simple dictionaries seems to be the only functional option.
  • 2 replies
Β·
louisbrulenaudetΒ 
posted an update 6 months ago
view post
Post
2764
πŸš€ RAGoon is now available on PyPI, GitHub, and as a Space on Hugging Face for batched embeddings generation πŸ€—

RAGoon is a set of NLP utilities for multi-model embedding production, high-dimensional vector visualization, and aims to improve language model performance by providing contextually relevant information through search-based querying, web scraping and data augmentation techniques.

At this stage, 5 major classes are available via RAGoon to facilitate:
- the production of chain embeddings for several models to simplify a continuous deployment process;
- production of LLM requests for web querying and content retrieval via the Google API;
- recursive chunking via tokens;
- data visualization and the function to load embeddings from a FAISS index, reduce their dimensionality using PCA and/or t-SNE, and visualize them in an interactive 3D graph;
- the creation of binary indexes for search with scalar (int8) rescoring.

Link to GitHub: https://github.com/louisbrulenaudet/ragoon
Link to the πŸ€— Space: louisbrulenaudet/ragoon
louisbrulenaudetΒ 
posted an update 6 months ago
view post
Post
871
You can now find the OBIS - Ocean Biodiversity Information System, on Hugging Face with 128M rows, via the Datasets package stream πŸ€—

The datasets are integrated, allowing seamless search and mapping by species name, higher taxonomic level, geographic area, depth, time, and environmental parameters. OBIS originates from the Census of Marine Life (2000-2010) and was adopted as a project under IOC-UNESCO’s International Oceanographic Data and Information (IODE) programme in 2009.

Collectively, they have provided over 45 million observations of nearly 120,000 marine species, ranging from bacteria to whales, from the surface to 10,900 meters depth, and from the tropics to the poles.

Link to the dataset: louisbrulenaudet/obis