1 9 571

R T

dingo-actual

AI & ML interests

None yet

Recent Activity

liked a model about 21 hours ago

open-thoughts/OpenThinker-7B

liked a dataset about 21 hours ago

open-thoughts/OpenThoughts-114k

liked a dataset 1 day ago

Intel/orca_dpo_pairs

View all activity

Organizations

dingo-actual's activity

liked a model about 21 hours ago

open-thoughts/OpenThinker-7B

Text Generation • Updated 2 days ago • 163 • 21

liked a dataset about 21 hours ago

open-thoughts/OpenThoughts-114k

Viewer • Updated 1 day ago • 114k • 5.18k • 139

liked a dataset 1 day ago

Intel/orca_dpo_pairs

Viewer • Updated Nov 29, 2023 • 12.9k • 1.29k • 294

liked a model 1 day ago

HuggingFaceTB/SmolLM2-MagPiePro

Text Generation • Updated 1 day ago • 123 • 1

liked a model 5 days ago

MTDoven/Recurrent-Parameter-Generation

Any-to-Any • Updated 8 days ago • 4

reacted to lewtun's post with 🔥 5 days ago

Post

9323

We are reproducing the full DeepSeek R1 data and training pipeline so everybody can use their recipe. Instead of doing it in secret we can do it together in the open!

🧪 Step 1: replicate the R1-Distill models by distilling a high-quality reasoning corpus from DeepSeek-R1.

🧠 Step 2: replicate the pure RL pipeline that DeepSeek used to create R1-Zero. This will involve curating new, large-scale datasets for math, reasoning, and code.

🔥 Step 3: show we can go from base model -> SFT -> RL via multi-stage training.

Follow along: https://github.com/huggingface/open-r1

5 replies

replied to singhsidhukuldeep's post 5 days ago

https://arxiv.org/abs/2501.09749

reacted to singhsidhukuldeep's post with 👍 5 days ago

Post

562

Exciting breakthrough in Text Embeddings: Introducing LENS (Lexicon-based EmbeddiNgS)!

A team of researchers from University of Amsterdam, University of Technology Sydney, and Tencent have developed a groundbreaking approach that outperforms dense embeddings on the Massive Text Embedding Benchmark (MTEB).

>> Key Technical Innovations:
- LENS consolidates vocabulary space through token embedding clustering, addressing the inherent redundancy in LLM tokenizers
- Implements bidirectional attention and innovative pooling strategies to unlock the full potential of LLMs
- Each dimension corresponds to token clusters instead of individual tokens, creating more coherent and compact embeddings
- Achieves competitive performance with just 4,000-8,000 dimensional embeddings, matching the size of dense counterparts

>> Under the Hood:
The framework applies KMeans clustering to token embeddings from the language modeling head, replacing original embeddings with cluster centroids. This reduces dimensionality while preserving semantic relationships.

>> Results:
- Outperforms dense embeddings on MTEB benchmark
- Achieves state-of-the-art performance when combined with dense embeddings on BEIR retrieval tasks
- Demonstrates superior performance across clustering, classification, and retrieval tasks

This work opens new possibilities for more efficient and interpretable text embeddings. The code will be available soon.