💬🔥Releasing idefics2-8b-chatty, the chat-optimized version of Idefics2!
It is a very efficient (8B parameters) state-of-the-art VLM, has been red-teamed, and comes with a few surprises: - 📖Paper dissecting a lot of the experimental insights we learned building Idefics2: - 🏎️TGI integration for blazing-fast inference (you can already run it locally with < 24GB GPU memory) - 🏆 Ranking 2nd in its category (< 10B, open weights) in the awesome Open VLM Leaderboard, and now appearing in the incredible Vision Arena
We released a resource that might come in handy: The Cauldron 🍯
The Cauldron is a massive manually-curated collection of 50 vision-language sets for instruction fine-tuning. 3.6M images, 30.3M query/answer pairs.
It covers a large variety of downstream uses: visual question answering on natural images, OCR, document/charts/figures/tables understanding, textbooks/academic question, reasoning, captioning, spotting differences between 2 images, and screenshot-to-code. HuggingFaceM4/the_cauldron
💪 Strong 8B-parameters model: often on par with open 30B counterparts. 🔓Open license: Apache 2.0. 🚀 Strong improvement over Idefics1: +12 points on VQAv2, +30 points on TextVQA while having 10x fewer parameters. 📚 Better data: boosting OCR capabilities with 6TB of documents to transcribe, and improving QA capabilities on charts/figures/diagrams. 🕵️♀️ Transparent training data: inspect and build upon all the data (10s of TB of data) we trained on. 🔲 More natural image processing: Incorporating strategies to treat images in their native resolution and native aspect ratio. 📸 High-resolution images: image resolutions up to 980 x 980 and integrating strategies that allow to trade computational efficiency for performance. 😎 2 checkpoints: Releasing both base checkpoint and instruction fine-tuned checkpoint. Chat version to come.
When Greg Brockman demo-ed GPT4 by hand-sketching a joke website on a piece of paper and asking the system to convert that into an HTML webpage, it blew my mind.
Can you build your own Screenshot-to-HTML system with much fewer resources?
With this new resource, most likely yes! Current vision-language models can learn this task with the right data (and the right tricks).
We have iterated on WebSight-v0.1 and are releasing its v0.2. WebSight is an open dataset of synthetically generated webpages with their corresponding rendered screenshot.
A few noticeable improvements: - 💨From traditional CSS to Tailwind CSS. Tailwind is CSS directly embedded in the HTML attribute class and is much more compact - 🚛2M pairs of synthetic HTML webpages with their associated rendered screenshot, along with the prompt generated by an LLM to create that webpage - 🖼️Much more visually appealing pages with the integration of real images
But when properly trained, a small ~8B model can be very accurate at these IQ tests, solely based on visual inputs!
Raven's Progressive Matrices are visual intelligence tests invented in the 1930s designed to measure abstract reasoning and problem-solving ability. The test consists of a series of matrices or patterns with one part missing. The task for the test-taker is to identify the missing piece from a set of options.
Such puzzles can be procedurally generated at scale. HuggingFaceM4/RAVEN is one example. The complexity of the puzzles is then controlled by the complexity of the generation procedure.
We fine-tuned an early checkpoint of our upcoming vision-and-language model idefics2 on that dataset. The resulting checkpoint yields ~91% accuracy! No chain of thoughts, no pre-processing of the image, no additional inputs or metadata, just the RAVEN problem fed to the model as a standalone image (and a short instruction to the model “Which figure should complete the logical sequence?”), with the training objective being the standard cross-entropy.
Just another evidence that in a lot of cases, for a given well-scoped problem, you will be better off paying to collect & annotate data, and fine-tune a model on that data (i.e. build your own AI) than wastefully trying to solve that problem with a gigantic general-purpose model you call through a paid API!
An increasing number of engineers and researchers are developing foundational models. Navigating the tools, resources, codebases, and best practices guides is daunting for new contributors.
Introducing the Foundation Model Development Cheatsheet, a succinct guide with 250+ resources & tools for: 📖 sourcing data 🔍 documenting & audits 🌍 environmental impact 🥊 risks & harms eval 🎮 release & monitoring