hackshackers-ai-hackathon-brown-2024 (Hacks/Hackers AI sandbox)

GWEfsdf

authored a paper 2 months ago

ShieldGemma 2: Robust and Tractable Image Content Moderation

Paper • 2504.01081 • Published Apr 1 • 3

VictorSanh

posted an update about 1 year ago

Post

3042

💬🔥Releasing idefics2-8b-chatty, the chat-optimized version of Idefics2!

It is a very efficient (8B parameters) state-of-the-art VLM, has been red-teamed, and comes with a few surprises:
- 📖Paper dissecting a lot of the experimental insights we learned building Idefics2:
- 🏎️TGI integration for blazing-fast inference (you can already run it locally with < 24GB GPU memory)
- 🏆 Ranking 2nd in its category (< 10B, open weights) in the awesome Open VLM Leaderboard, and now appearing in the incredible Vision Arena

Ressources:
⏯️Playground: HuggingFaceM4/idefics2_playground
📖Paper: What matters when building vision-language models? (2405.02246)
🏋️‍♂️Model and red-teaming analysis: HuggingFaceM4/idefics2-8b-chatty
👀Ressources to get started: HuggingFaceM4/idefics2-8b-chatty
🏆Open VLM Leaderboard: opencompass/open_vlm_leaderboard
🏟️Vision arena: WildVision/vision-arena

1 reply

·

VictorSanh

authored a paper about 1 year ago

What matters when building vision-language models?

Paper • 2405.02246 • Published May 3, 2024 • 104

VictorSanh

posted an update about 1 year ago

Post

2856

Glad to see Idefics2 making its way into the awesome OpenVLM Leaderboard which ranks VLMs. 🏆
2nd in its category (<10B parameters and open weights)!

While InternLM-XComposer2 uses proprietary data, Idefics2 is built solely using openly available data.

Leaderboard: opencompass/open_vlm_leaderboard
Model: HuggingFaceM4/idefics2-8b

9 replies

·

VictorSanh

authored 5 papers about 1 year ago

TransferTransfo: A Transfer Learning Approach for Neural Network Based Conversational Agents

Paper • 1901.08149 • Published Jan 23, 2019 • 3

Movement Pruning: Adaptive Sparsity by Fine-Tuning

Paper • 2005.07683 • Published May 15, 2020

PromptSource: An Integrated Development Environment and Repository for Natural Language Prompts

Paper • 2202.01279 • Published Feb 2, 2022

Learning from others' mistakes: Avoiding dataset biases without modeling them

Paper • 2012.01300 • Published Dec 2, 2020

A Hierarchical Multi-task Approach for Learning Embeddings from Semantic Tasks

Paper • 1811.06031 • Published Nov 14, 2018

VictorSanh

posted an update about 1 year ago

Post

2541

Can't wait to see multimodal LLama 3!

We released a resource that might come in handy: The Cauldron 🍯

The Cauldron is a massive manually-curated collection of 50 vision-language sets for instruction fine-tuning. 3.6M images, 30.3M query/answer pairs.

It covers a large variety of downstream uses: visual question answering on natural images, OCR, document/charts/figures/tables understanding, textbooks/academic question, reasoning, captioning, spotting differences between 2 images, and screenshot-to-code.

HuggingFaceM4/the_cauldron

1 reply

·

VictorSanh

posted an update about 1 year ago

Post

2752

New open multimodal model in town: Idefics2!

💪 Strong 8B-parameters model: often on par with open 30B counterparts.
🔓Open license: Apache 2.0.
🚀 Strong improvement over Idefics1: +12 points on VQAv2, +30 points on TextVQA while having 10x fewer parameters.
📚 Better data: boosting OCR capabilities with 6TB of documents to transcribe, and improving QA capabilities on charts/figures/diagrams.
🕵️‍♀️ Transparent training data: inspect and build upon all the data (10s of TB of data) we trained on.
🔲 More natural image processing: Incorporating strategies to treat images in their native resolution and native aspect ratio.
📸 High-resolution images: image resolutions up to 980 x 980 and integrating strategies that allow to trade computational efficiency for performance.
😎 2 checkpoints: Releasing both base checkpoint and instruction fine-tuned checkpoint. Chat version to come.

Ressources: HuggingFaceM4/idefics2-661d1971b7c50831dd3ce0fe
Blogpost: https://huggingface.co/blog/idefics2

HL111

updated a Space about 1 year ago

AutoTrain Advanced

🚀

VictorSanh

posted an update about 1 year ago

Post

When Greg Brockman demo-ed GPT4 by hand-sketching a joke website on a piece of paper and asking the system to convert that into an HTML webpage, it blew my mind.

Can you build your own Screenshot-to-HTML system with much fewer resources?

With this new resource, most likely yes! Current vision-language models can learn this task with the right data (and the right tricks).

We have iterated on WebSight-v0.1 and are releasing its v0.2.
WebSight is an open dataset of synthetically generated webpages with their corresponding rendered screenshot.

A few noticeable improvements:
- 💨From traditional CSS to Tailwind CSS. Tailwind is CSS directly embedded in the HTML attribute class and is much more compact
- 🚛2M pairs of synthetic HTML webpages with their associated rendered screenshot, along with the prompt generated by an LLM to create that webpage
- 🖼️Much more visually appealing pages with the integration of real images

👀Blog: https://huggingface.co/blog/websight
💽Dataset: HuggingFaceM4/WebSight
📜Technical report: Unlocking the conversion of Web Screenshots into HTML Code with the WebSight Dataset (2403.09029)
🎮Want to create your own synthetic data pipelines? A starting point: https://colab.research.google.com/drive/1LdamGKR2oacrDk-kYwz_Wfc1-RBUdzcO?usp=sharing

Built with @HugoLaurencon & @Leyo

VictorSanh

authored a paper about 1 year ago

Unlocking the conversion of Web Screenshots into HTML Code with the WebSight Dataset

Paper • 2403.09029 • Published Mar 14, 2024 • 56

VictorSanh

posted an update about 1 year ago

Post

Can you beat an AI at Raven puzzles?

HuggingFaceM4/ai_raven

The most powerful vision+language AI systems like Gemini or GPT4V struggle with this problem when used out-of-the-box ( How Far Are We from Intelligent Visual Deductive Reasoning? (2403.04732)).

But when properly trained, a small ~8B model can be very accurate at these IQ tests, solely based on visual inputs!

Raven's Progressive Matrices are visual intelligence tests invented in the 1930s designed to measure abstract reasoning and problem-solving ability. The test consists of a series of matrices or patterns with one part missing. The task for the test-taker is to identify the missing piece from a set of options.

Such puzzles can be procedurally generated at scale. HuggingFaceM4/RAVEN is one example. The complexity of the puzzles is then controlled by the complexity of the generation procedure.

We fine-tuned an early checkpoint of our upcoming vision-and-language model idefics2 on that dataset. The resulting checkpoint yields ~91% accuracy! No chain of thoughts, no pre-processing of the image, no additional inputs or metadata, just the RAVEN problem fed to the model as a standalone image (and a short instruction to the model “Which figure should complete the logical sequence?”), with the training objective being the standard cross-entropy.

Just another evidence that in a lot of cases, for a given well-scoped problem, you will be better off paying to collect & annotate data, and fine-tune a model on that data (i.e. build your own AI) than wastefully trying to solve that problem with a gigantic general-purpose model you call through a paid API!

1 reply

·

VictorSanh

posted an update over 1 year ago

Post

An increasing number of engineers and researchers are developing foundational models. Navigating the tools, resources, codebases, and best practices guides is daunting for new contributors.

Introducing the Foundation Model Development Cheatsheet, a succinct guide with 250+ resources & tools for:
📖 sourcing data
🔍 documenting & audits
🌍 environmental impact
🥊 risks & harms eval
🎮 release & monitoring

https://fmcheatsheet.org/

👐 What tools & resources should appear in that cheatsheet? Contributions encouraged!

This is the result of a large collaboration between many organizations promoting open-science, and spearheaded by @Shayne 🔥

2 replies

·

VictorSanh

authored a paper almost 2 years ago

OBELICS: An Open Web-Scale Filtered Dataset of Interleaved Image-Text Documents

Paper • 2306.16527 • Published Jun 21, 2023 • 46

VictorSanh

authored 3 papers about 2 years ago

Hacks/Hackers AI sandbox

AI & ML interests

hackshackers-ai-hackathon-brown-2024's activity

ShieldGemma 2: Robust and Tractable Image Content Moderation

What matters when building vision-language models?

TransferTransfo: A Transfer Learning Approach for Neural Network Based Conversational Agents

Movement Pruning: Adaptive Sparsity by Fine-Tuning

PromptSource: An Integrated Development Environment and Repository for Natural Language Prompts

Learning from others' mistakes: Avoiding dataset biases without modeling them

A Hierarchical Multi-task Approach for Learning Embeddings from Semantic Tasks

AutoTrain Advanced

Unlocking the conversion of Web Screenshots into HTML Code with the WebSight Dataset

OBELICS: An Open Web-Scale Filtered Dataset of Interleaved Image-Text Documents

What Language Model to Train if You Have One Million GPU Hours?

Block Pruning For Faster Transformers

Avoiding Inference Heuristics in Few-shot Prompt-based Finetuning

AI & ML interests

Team members 30

hackshackers-ai-hackathon-brown-2024's activity

AutoTrain Advanced