Fine-tune Gemma3n on videos with audios inside with Colab A100 π₯ Just dropped the notebook where you can learn how to fine-tune Gemma3n on images+audio+text at the same time!
keep in mind, it's made for educational purposes π«‘ we do LoRA, audio resampling & video downsampling to be able to train <40GB VRAM stretch modalities and unfreeze layers as you wish! ππ» merve/smol-vision
multimodal π¬πΌοΈ > new moondream (VLM) is out: it's 4-bit quantized (with QAT) version of moondream-2b, runs on 2.5GB VRAM at 184 tps with only 0.6% drop in accuracy (OS) π > ByteDance released BAGEL-7B, an omni model that understands and generates both image + text. they also released Dolphin, a document parsing VLM π¬ (OS) > Google DeepMind dropped MedGemma in I/O, VLM that can interpret medical scans, and Gemma 3n, an omni model with competitive LLM performance
> MMaDa is a new 8B diffusion language model that can generate image and text
Just completed the AI Agents course and wow, that capstone project really makes you understand how to build agents that can handle real-world complexity!
The final project uses the GAIA dataset - your agent has to solve tasks like analyzing Excel files, processing audio recordings, answering questions about YouTube videos, and diving into research papers. This isn't toy examples, it's the messy, multimodal stuff agents need to handle in practice.
Whether youβre just getting started with agents or want to go deeper with tools like LangChain, LlamaIndex, and SmolAgents, this course has tons of useful stuff. A few key insights: - Code agents are incredibly versatile once you get the architecture right - The sweet spot is finding the right balance of guidance vs autonomy for each use case - Once the logic clicks, the possibilities really are endless - it's like letting LLMs break free from the chatbox
The course is free and the certification deadline is July 1st, 2025.
They released Seed1.5-VL, A vision-language model for general-purpose multimodal reasoning. Itβs not open-source, but the paper and demo are available hereπ