Data is Better Together - Russian Language Team

community

AI & ML interests

Russian speakers working on prompt translation as a part of the Data is Better Together initiative, building impactful community datasets.

Recent Activity

DIBT-Russian's activity

ZennyKenny 
posted an update 11 days ago
view post
Post
418
Really pleased with the Bring Your Own Model (BYOM) feature in Brave Browser: https://brave.com/blog/byom-nightly/

Takes about 5 minutes to configure your own locally running LLM as an in-browser assistant. Totally local, totally private, totally yours.
  • 1 reply
·
ZennyKenny 
posted an update 13 days ago
view post
Post
395
On-demand audio transcription is an often-requested service without many good options on the market.

Using Hugging Face Spaces with Gradio SDK and the OpenAI Whisper model, I've put together a simple interface that supports the transcription and summarisation of audio files up to five minutes in length, completely open source and running on CPU upgrade. The cool thing is that it's built without a dedicated inference endpoint, completely on public infrastructure.

Check it out: ZennyKenny/AudioTranscribe

I wrote a short article about the backend mechanics for those who are interested: https://huggingface.co/blog/ZennyKenny/on-demand-public-transcription
  • 1 reply
·
dvilasuero 
posted an update about 2 months ago
view post
Post
2336
🌐 Announcing Global-MMLU: an improved MMLU Open dataset with evaluation coverage across 42 languages, built with Argilla and the Hugging Face community.

Global-MMLU is the result of months of work with the goal of advancing Multilingual LLM evaluation. It's been an amazing open science effort with collaborators from Cohere For AI, Mila - Quebec Artificial Intelligence Institute, EPFL, Massachusetts Institute of Technology, AI Singapore, National University of Singapore, KAIST, Instituto Superior Técnico, Carnegie Mellon University, CONICET, and University of Buenos Aires.

🏷️ +200 contributors used Argilla MMLU questions where regional, dialect, or cultural knowledge was required to answer correctly. 85% of the questions required Western-centric knowledge!

Thanks to this annotation process, the open dataset contains two subsets:

1. 🗽 Culturally Agnostic: no specific regional, cultural knowledge is required.
2. ⚖️ Culturally Sensitive: requires dialect, cultural knowledge or geographic knowledge to answer correctly.

Moreover, we provide high quality translations of 25 out of 42 languages, thanks again to the community and professional annotators leveraging Argilla on the Hub.

I hope this will ensure a better understanding of the limitations and challenges for making open AI useful for many languages.

Dataset: CohereForAI/Global-MMLU
ZennyKenny 
posted an update about 2 months ago
ZennyKenny 
posted an update 2 months ago
view post
Post
1217
I've joined the Bluesky community. Interested to see what decentralized social media looks like in action: https://bsky.app/profile/kghamilton.bsky.social

Looking forward to following other AI builders, tech enthusiasts, goth doomscrollers, and ironic meme creators.
ZennyKenny 
posted an update 2 months ago
view post
Post
361
Using AI to teach English as a Foreign Language? EFL teachers often have busy schedules, variable class sizes, and unexpected cancellations. Introducting VocabSova: ZennyKenny/VocabSova

VocabSova is a simple chatbot interface that helps teachers create topical vocabulary lists, custom worksheets using that vocabulary, and group activities on a defined theme for a specific English-speaking level (according to CEFR international standards).

There is a great use case for AI in nearly every field, and language learning is a particularly apt domain in my opinion. VocabSova is in active development during its Alpha release, all feedback welcome.
dvilasuero 
posted an update 2 months ago
dvilasuero 
posted an update 3 months ago
view post
Post
687
Build datasets for AI on the Hugging Face Hub—10x easier than ever!

Today, I'm excited to share our biggest feature since we joined Hugging Face.

Here’s how it works:

1. Pick a dataset—upload your own or choose from 240K open datasets.
2. Paste the Hub dataset ID into Argilla and set up your labeling interface.
3. Share the URL with your team or the whole community!

And the best part? It’s:
- No code – no Python needed
- Integrated – all within the Hub
- Scalable – from solo labeling to 100s of contributors

I am incredibly proud of the team for shipping this after weeks of work and many quick iterations.

Let's make this sentence obsolete: "Everyone wants to do the model work, not the data work."


Read, share, and like the HF blog post:
https://huggingface.co/blog/argilla-ui-hub
dvilasuero 
posted an update 3 months ago
ZennyKenny 
updated a Space 4 months ago
dvilasuero 
posted an update 4 months ago
view post
Post
410
Explore FinePersonas, visually with Argilla and black-forest-labs/FLUX.1-schnell


Excited to share this space where the community can explore a tiny subset of FinePersonas

argilla/finepersonas


Dataset built with distilabel and Free Serveless endpoints

This is just a first step towards more interesting experiments with FinePersonas, for example can we use it to assess biases in text2image models?

If you have ideas I'd love to hear them in the comments!

ZennyKenny 
posted an update 5 months ago
view post
Post
693
Very excited to have made the list and been invited to OpenAI DevDay 2024 at the London event 30 October! Looking forward to seeing what the future of AI dev holds, connecting with other professionals in the field, and advocating for open source AI!

https://openai.com/devday/
dvilasuero 
posted an update 8 months ago
view post
Post
8130
Today is a huge day in Argilla’s history. We couldn’t be more excited to share this with the community: we’re joining Hugging Face!

We’re embracing a larger mission, becoming part of a brilliant and kind team and a shared vision about the future of AI.

Over the past year, we’ve been collaborating with Hugging Face on countless projects: launching partner of Docker Spaces, empowering the community to clean Alpaca translations into Spanish and other languages, launching argilla/notus-7b-v1 building on Zephyr’s learnings, the Data is Better Together initiative with hundreds of community contributors, or releasing argilla/OpenHermesPreferences, one of the largest open preference tuning datasets

After more than 2,000 Slack messages and over 60 people collaborating for over a year, it already felt like we were part of the same team, pushing in the same direction. After a week of the smoothest transition you can imagine, we’re now the same team.

To those of you who’ve been following us, this won’t be a huge surprise, but it will be a big deal in the coming months. This acquisition means we’ll double down on empowering the community to build and collaborate on high quality datasets, we’ll bring full support for multimodal datasets, and we’ll be in a better place to collaborate with the Open Source AI community. For enterprises, this means that the Enterprise Hub will unlock highly requested features like single sign-on and integration with Inference Endpoints.

As a founder, I am proud of the Argilla team. We're now part of something bigger and a larger team but with the same values, culture, and goals. Grateful to have shared this journey with my beloved co-founders Paco and Amélie.

Finally, huge thanks to the Chief Llama Officer @osanseviero for sparking this and being such a great partner during the acquisition process.

Would love to answer any questions you have so feel free to add them below!
·
ZennyKenny 
posted an update 8 months ago
view post
Post
1173
Thanks to the incredible collaboration of 14 community annotators, @davanstrien of HF and @dvilasuero et. al of Argilla, DIBT (https://huggingface.co/DIBT) is pleased to make available a Russian-language dataset of 500 of the best curated LLM prompts translated to Russian and available for use: https://huggingface.co/datasets/DIBT/MPEP_RUSSIAN.

More to come from the MPEP initiative! Interested in annotating or leading a language team? https://github.com/huggingface/data-is-better-together/tree/main/prompt_translation
  • 2 replies
·
ZennyKenny 
posted an update 11 months ago
view post
Post
2036
Are you interested in contributing to open source multilingual AI with Hugging Face and Argilla?

The MPEP initiative (https://github.com/huggingface/data-is-better-together/tree/main/prompt_translation) of the Data is Better Together project offers the opportunity to do just that by helping to create multilingual model checkpoints.

If you're interested in contributing to the Russian-language dataset, please get in touch as I am the Russian-language lead. If you're interested in contributing to another language, the MPEP link above has all the information you need to do so. 🤗
  • 2 replies
·
dvilasuero 
posted an update 11 months ago
view post
Post
🔥 Community and Data Quality Are More For Alignment

A recipe to replicate SPIN (Self-Play Fine Tuning) with 30x less data:

🗣️ 50K samples vs 1.8K prompts curated by the 350+ amazing DIBT contributors.
⚗️ Distillation of Mistral Large instead of OpenAI
🙌 Open data & code with ⚗️distilabel

SPIN Paper:
Self-Play Fine-Tuning Converts Weak Language Models to Strong Language Models (2401.01335)

SPIN DIBT Collection with datasets and models:
argilla/dibt-prompt-collective-spin-65ef59062518776024395fc3

Repo:
https://github.com/argilla-io/distilabel-spin-dibt

Joint work with the amazing DIBT community 👇
@aashish1904 , @flozi00 , @sayhan , @munish0838 , @0-hero , @dvilasuero , @eren23 , @davanstrien , @ahnz , @BlackKakapo , @kitano-o , @mmhamdy , @sdiazlor , @Stopwolf , @gabrielmbmb , @tculler91 , @plaguss , @ignacioct , @Hugi-R , @davidberenstein1957 , @Korla , @alvarobartt , @Hugs4Llamas , @Sumandora , @nataliaElv , @jfcalvo , @Averill , @steventrouble , @vasilis , @aeros93 , @kayyshf , @thomasgauthier , @jeromebas , @Ameeeee , @ayoubelmhamdi , @TuringsSolutions , @efels , @Haleyok , @abrazador , @emessy , @Nindaleth , @burtenshaw , @vicgalle , @CortexPE , @casey-martin , @Leire-aguirre-eguiluz , @mrfakename , @Portias600kNeurons , @nathaliepett , @Filippo
·
dvilasuero 
posted an update 11 months ago
view post
Post
🚀🧙🏼‍♂️Introducing OpenHermesPreferences: the largest open AI feedback dataset for RLHF & DPO

> Using LLMs to improve other LLMs, at scale!

Built in collaboration with the H4 Hugging Face team, it's a 1M preferences dataset on top of the amazing @teknium 's dataset.

Dataset:
argilla/OpenHermesPreferences

The dataset is another example of open collaboration:

> The H4 team created responses with Mixtral using llm-swarm

> Argilla created responses with NousResearch Hermes-2-Yi-34B using distilabel

> The H4 ranked these responses + original response with PairRM from AllenAI, University of Southern California, Zhejiang University ( @yuchenlin @DongfuTingle and colleagues)

We hope this dataset will help the community's research efforts towards understanding the role of AI feedback for LLM alignment.

We're particularly excited about the ability of filtering specific subsets to improve LLM skills like math or reasoning.

Here's how easy it is to filter by subset:

ds = load_dataset("HuggingFaceH4/OpenHermesPreferences", split="train")

# Get the categories of the source dataset
# ['airoboros2.2', 'CamelAI', 'caseus_custom', ...]
sources = ds.unique("source")

# Filter for a subset
ds_filtered = ds.filter(lambda x : x["source"] in ["metamath", "EvolInstruct_70k"], num_proc=6)


As usual, all the scripts to reproduce this work are available and open to the community!

argilla/OpenHermesPreferences

So fun collab between @vwxyzjn , @plaguss , @kashif , @philschmid & @lewtun !

Open Source AI FTW!
·