Aymeric Roucher

m-ric

AI & ML interests

MLE at Hugging Face 🤗 LLMs, Agents, RAG, Multimodal.

Articles

Organizations

Posts 42

view post
Post
460
🧠 Stanford paper might be the key to OpenAI o1’s performance: What’s so effective about Chain of Thought? ⇒ it unlocks radically different sequential tasks!

💭 Reminder: A Chain of Thought (CoT) means that you instruct the model to “think step by step”. Often it’s literally just putting in the prompt “let’s think step by step.”

🤔 This method has been shown to be unreasonably effective to increase perf on benchmarks. However why it works so well remains unclear.

Here's the scoop: Transformers are amazing at parallel processing, but they've always struggled with tasks that require sequential reasoning.

⛔️ For instance if you ask them the result of 3^2^2^2^…, with 20 iterations, they’ll nearly always fail.

💡 Indeed, researchers prove mathematically, by assimilating transformers networks to logical circuits, that effectively they cannot solve sequential tasks that require more than a certain threshold of sequences.

But CoT enables sequential reasoning:

- 🧱 Each step in the CoT corresponds to simulating one operation in a complex circuit.
- 🔄 This allows the transformer to "reset" the depth of intermediate outputs, overcoming previous limitations.
- 🚀 Thus, with CoT, constant-depth transformers can now solve ANY problem computable by polynomial-size circuits! (That's a huge class of problems in computer science.)
- 🔑 Transformers can now handle tricky tasks like iterated squares (computing 3^2^2^2^2) composed permutations and evaluating circuits - stuff that requires serial computation.
- 📊 The improvement is especially dramatic for transformers with a limited depth. Empirical tests on four arithmetic problems showed massive accuracy gains with CoT on inherently serial tasks.

Main takeaway: Chain-of-thought isn't just a neat trick - it fundamentally expands what transformer models can do!

Read the paper 👉  Chain of Thought Empowers Transformers to Solve Inherently Serial Problems (2402.12875)
view post
Post
184
Anthropic just released a chunk improvement technique that vastly improves RAG performance! 🔥

Crash reminder: Retrieval Augmented Generation (RAG) is a widely-used technique for improving your LLM chatbot's answers to user questions.

It goes like this: instead of generating an LLM answer straight away, it just adds a previous step called Retrieval, that retrieves relevant documents from your knowledge base through semantic search, and just appends the top K documents to the prompt. ➡️ As a result, the LLM answer is grounded in context.

⛔️ The difficulty with this retrieval step is that when you split your documents into chunks that will be retrieved, you lose context. So importance chunks could be missed.

💡 Anthropic's just released blog post shows that you can add some context to each chunk, with one LLM call. Then you embed the original chunk + a bit of added context, so that the embedding is much more representative of the document in its context!

🤔 Isn't that crazy expensive? Well it would have been before, but not so much anymore with their new Prompt caching feature that makes duplicating thousands of requests with the same prompt much less expensive. They give an indicative price tag of only $1.02 per million chunks processed!

✅ And this vastly improves performance on their benchmark!

Read their blog post 👉 https://www.anthropic.com/news/contextual-retrieval