DmitryRyumin (Dmitry Ryumin)

reacted to merve's post with 🔥 9 days ago

Post

2423

#CVPR2025 Paper Picks #1
VisionZip is a compression technique that reduces number of visual tokens to improve performance AND prefill time for vision language models
demo: Senqiao/VisionZip
paper: VisionZip: Longer is Better but Not Necessary in Vision Language Models (2412.04467)
most of the image tokens are redundant for the LLM, so the authors ask "are all visual tokens necessary?"

the method is simple:
find which tokens have the highest attention score, merge rest of the tokens based on similarity, then merge both

their method is both training-free and for fine-tuning
the authors report 5 point improvement on average of vision language tasks + 8x improvement in prefilling time for Llava-Next 7B and 13B 🤯

removing redundant tokens improve image token quality too 🥹

reacted to merve's post with ❤️ 9 days ago

Post

3574

Releases of the past week are here merve/releases-june-13-6852c3c1eaf1e0c24c958860

Here's our picks 🤓
So many interesting models released past week in open AI! 🤖

🖼️ Computer Vision/VLMs
> nanonets/Nanonets-OCR-s is the new state-of-the-art OCR model that can handle checkboxes, watermarks, tables (OS)
> Meta released facebook/v-jepa-2-6841bad8413014e185b497a6, new sota video embeddings with two new classification models (OS)
> ByteDance-Seed/SeedVR2-3B is a new 3B video restoration model (OS)

Audio
> Stepfun released stepfun-ai/Step-Audio-AQAA, new large (137B 🤯) audio language model that takes in audio and generates audio (OS)

🤖 Robotics
> nvidia released nvidia/GR00T-N1.5-3B, new open foundation vision language action model

3D
> tencent/Hunyuan3D-2.1 is the new version of Hunyuan by Tencent that can generate 3D assets from text and image prompts

reacted to AdinaY's post with 🚀 about 1 month ago

Post

2805

ByteDance is absolutely cooking lately🔥

BAGEL 🥯 7B active parameter open multimodal foundation model by Bytedance Seed team.

ByteDance-Seed/BAGEL-7B-MoT

✨ Apache 2.0
✨ Outperforms top VLMs (Qwen2.5-VL & InternVL-2.5)
✨ Mixture-of-Transformer-Experts + dual encoders
✨ Trained on trillions of interleaved tokens

reacted to merterbak's post with 🔥 about 2 months ago

Post

4876

Qwen 3 models released🔥
It offers 2 MoE and 6 dense models with following parameter sizes: 0.6B, 1.7B, 4B, 8B, 14B, 30B(MoE), 32B, and 235B(MoE).
Models: Qwen/qwen3-67dd247413f0e2e4f653967f
Blog: https://qwenlm.github.io/blog/qwen3/
Demo: Qwen/Qwen3-Demo
GitHub: https://github.com/QwenLM/Qwen3

✅ Pre-trained 119 languages(36 trillion tokens) and dialects with strong translation and instruction following abilities. (Qwen2.5 was pre-trained on 18 trillion tokens.)
✅Qwen3 dense models match the performance of larger Qwen2.5 models. For example, Qwen3-1.7B/4B/8B/14B/32B perform like Qwen2.5-3B/7B/14B/32B/72B.
✅ Three stage done while pretraining:
• Stage 1: General language learning and knowledge building.
• Stage 2: Reasoning boost with STEM, coding, and logic skills.
• Stage 3: Long context training
✅ It supports MCP in the model
✅ Strong agent skills
✅ Supports seamless between thinking mode (for hard tasks like math and coding) and non-thinking mode (for fast chatting) inside chat template.
✅ Better human alignment for creative writing, roleplay, multi-turn conversations, and following detailed instructions.

reacted to openfree's post with 🔥 2 months ago

Post

4580

📊 Papers Impact: Instant AI Grading for Your Research Papers! 🚀

🌟 Introduction
Hello, AI research community! 🎉
Introducing Papers Impact - the revolutionary AI tool that automatically grades and predicts the potential impact of research papers! 🧠💡

VIDraft/PapersImpact

✨ Key Feature: Instant Paper Grading
The core functionality is brilliantly simple: Just enter an arXiv paper ID or URL, and our AI instantly analyzes and grades the paper's potential academic impact! No need to read through the entire paper yourself - our system automatically evaluates the title and abstract to generate a normalized impact score between 0 and 1.
🎯 How It Works

Enter Paper ID or URL: Simply paste an arXiv ID (e.g., "2504.11651") or full URL
Automatic Fetching: The system retrieves the paper's title and abstract
AI Analysis: Our advanced LLaMA-based transformer model analyzes the content
Instant Grading: Receive an impact score and corresponding letter grade in seconds!

💡 Who Can Benefit?

🔬 Researchers: Pre-assess your paper before submission
📚 Students: Quickly gauge the quality of papers for literature reviews
🏫 Educators: Objectively evaluate student research
📊 Research Managers: Prioritize which papers to read in depth
🧩 Journal Editors: Get an AI second opinion on submissions

🚀 Technical Details
Our model is trained on an extensive dataset of published papers in CS.CV, CS.CL, and CS.AI fields, using NDCG optimization with Sigmoid activation and MSE loss. It's been rigorously cross-validated against historical citation data to ensure accurate impact predictions.

2 replies

·

reacted to seawolf2357's post with 🔥 2 months ago

Post

5801

📚 Papers Leaderboard - See the Latest AI Research Trends at a Glance! ✨

Hello, AI research community! Today I'm introducing a new tool for exploring research papers. Papers Leaderboard is an open-source dashboard that makes it easy to find and filter the latest AI research papers.

Heartsync/Papers-Leaderboard

🌟 Key Features

Date Filtering: View only papers published within a specific timeframe (from May 5, 2023 to present)
Title Search: Quickly find papers containing your keywords of interest
Abstract Search: Explore paper content more deeply by searching for keywords within abstracts
Automatic Updates: The database is updated with the latest papers every hour

💡 How to Use It?

Select a start date and end date
Enter keywords you want to find in titles or abstracts
Adjust the maximum number of search results for abstract searches
Results are displayed neatly in table format

reacted to openfree's post with 🔥 2 months ago

Post

5097

🧠 ThinkFlow: The Revolutionary Platform That Gives LLMs the Power to Think 🚀

Hello AI community! We're excited to introduce you to ThinkFlow, an innovative service that transforms how language models solve problems. 🎉
VIDraft/ThinkFlow-llama

✨ What is ThinkFlow?
ThinkFlow is a groundbreaking platform that automatically applies step-by-step reasoning capabilities to existing LLM models without any modifications. It makes complex problem-solving transparent, allowing you to witness the model's thought process in real-time.

🔍 Key Features

Reasoning Without Model Modifications: Add step-by-step reasoning while utilizing existing LLMs as they are ⚙️
Visualized Thinking Process: See exactly how the model analyzes and solves problems 👁️
Before & After Comparison: Compare standard responses with reasoning-enhanced outputs in real-time 📊
Improved Accuracy: Deliver more accurate solutions for complex math and logic problems 📈
Educational Value: Teach students systematic approaches to problem-solving 👨‍🏫
User-Friendly Interface: Intuitive and easy-to-use UI for seamless experience 🖥️

💡 What Problems Can It Solve?
ThinkFlow is particularly effective for various domains including:

Complex mathematical problems 🧮
Logic puzzles 🧩
Questions requiring multi-step reasoning 🤔
Scientific analysis challenges 🔬
Complex decision-making processes 📝

👨‍💻 Technical Details
ThinkFlow is built on the meta-llama/Llama-3.1-8B-Instruct model and uses carefully designed prompt chains to guide the model through step-by-step thinking. Each reasoning step builds upon the results of previous steps, culminating in a comprehensive final answer.

💬 Join Our Community!
If you have questions or suggestions about ThinkFlow, join our Discord community: https://discord.gg/openfreeai
Let's build better AI reasoning experiences together! 💪

#AI #LLM #ReasoningAI #ThinkFlow #HuggingFace #OpenSource #AIEducation

9 replies

·

reacted to seawolf2357's post with 🔥 3 months ago

Post

6695

🔥 AgenticAI: The Ultimate Multimodal AI with 16 MBTI Girlfriend Personas! 🔥

Hello AI community! Today, our team is thrilled to introduce AgenticAI, an innovative open-source AI assistant that combines deep technical capabilities with uniquely personalized interaction. 💘

🛠️ MBTI 16 Types SPACES Collections link
seawolf2357/heartsync-mbti-67f793d752ef1fa542e16560

✨ 16 MBTI Girlfriend Personas

Complete MBTI Implementation: All 16 MBTI female personas modeled after iconic characters (Dana Scully, Lara Croft, etc.)
Persona Depth: Customize age groups and thinking patterns for hyper-personalized AI interactions
Personality Consistency: Each MBTI type demonstrates consistent problem-solving approaches, conversation patterns, and emotional expressions

🚀 Cutting-Edge Multimodal Capabilities

Integrated File Analysis: Deep analysis and cross-referencing of images, videos, CSV, PDF, and TXT files
Advanced Image Understanding: Interprets complex diagrams, mathematical equations, charts, and tables
Video Processing: Extracts key frames from videos and understands contextual meaning
Document RAG: Intelligent analysis and summarization of PDF/CSV/TXT files

💡 Deep Research & Knowledge Enhancement

Real-time Web Search: SerpHouse API integration for latest information retrieval and citation
Deep Reasoning Chains: Step-by-step inference process for solving complex problems
Academic Analysis: In-depth approach to mathematical problems, scientific questions, and data analysis
Structured Knowledge Generation: Systematic code, data analysis, and report creation

🖼️ Creative Generation Engine

FLUX Image Generation: Custom image creation reflecting the selected MBTI persona traits
Data Visualization: Automatic generation of code for visualizing complex datasets
Creative Writing: Story and scenario writing matching the selected persona's style

1 reply

·

reacted to AdinaY's post with 🔥 3 months ago

Post

1968

AReal-Boba 🔥 a fully open RL Frameworks released by AntGroup, an affiliate company of Alibaba.
inclusionAI/areal-boba-67e9f3fa5aeb74b76dcf5f0a
✨ 7B/32B - Apache2.0
✨ Outperform on math reasoning
✨ Replicating QwQ-32B with 200 data under $200
✨ All-in-one: weights, datasets, code & tech report

1 reply

·

reacted to KaiChen1998's post with 👍 3 months ago

Post

4916

📢 Our EMOVA paper has been accepted by CVPR 2025, and we are glad to release all resources, including code (training & inference), datasets (training & evaluation), and checkpoints (EMOVA-3B/7B/72B)!

🤗 EMOVA is a novel end-to-end omni-modal LLM that can see, hear and speak. Given omni-modal (i.e., textual, visual and speech) inputs, EMOVA can generate both textual and speech responses with vivid emotional controls by utilizing the speech decoder and a style controller.

✨ EMOVA Highlights
✅ State-of-the-art omni-modality: EMOVA achieves SoTA comparable results on both vision-language and speech benchmarks simultaneously.
✅ Device adaptation: our codebase supports training/inference on both NVIDIA GPUs (e.g., A800 & H20) and Ascend NPUs (e.g., 910B3)!
✅ Modular design: we integrate multiple implementations of vision encoder, vision projector, and language model, even including the most recent DeepSeekMoE-tiny!

🔥 You are all welcome to try and star!
- Project page: https://emova-ollm.github.io/
- Github: https://github.com/emova-ollm/EMOVA
- Demo: Emova-ollm/EMOVA-demo

reacted to singhsidhukuldeep's post with 🔥 4 months ago

Post

6963

Exciting New Tool for Knowledge Graph Extraction from Plain Text!

I just came across a groundbreaking new tool called KGGen that's solving a major challenge in the AI world - the scarcity of high-quality knowledge graph data.

KGGen is an open-source Python package that leverages language models to extract knowledge graphs (KGs) from plain text. What makes it special is its innovative approach to clustering related entities, which significantly reduces sparsity in the extracted KGs.

The technical approach is fascinating:

1. KGGen uses a multi-stage process involving an LLM (GPT-4o in their implementation) to extract entities and relations from source text
2. It aggregates graphs across sources to reduce redundancy
3. Most importantly, it applies iterative LM-based clustering to refine the raw graph

The clustering stage is particularly innovative - it identifies which nodes and edges refer to the same underlying entities or concepts. This normalizes variations in tense, plurality, stemming, and capitalization (e.g., "labors" clustered with "labor").

The researchers from Stanford and University of Toronto also introduced MINE (Measure of Information in Nodes and Edges), the first benchmark for evaluating KG extractors. When tested against existing methods like OpenIE and GraphRAG, KGGen outperformed them by up to 18%.

For anyone working with knowledge graphs, RAG systems, or KG embeddings, this tool addresses the fundamental challenge of data scarcity that's been holding back progress in graph-based foundation models.

The package is available via pip install kg-gen, making it accessible to everyone. This could be a game-changer for knowledge graph applications!

reacted to m-ric's post with 🚀 4 months ago

Post

4880

We now have a Deep Research for academia: SurveyX automatically writes academic surveys nearly indistinguishable from human-written ones 🔥

Researchers from Beijing and Shanghai just published the first application of a deep research system to academia: their algorithm, given a question, can give you a survey of all papers on the subject.

To make a research survey, you generally follow two steps, preparation (collect and organize papers) and writing (outline creation, writing, polishing). Researchers followed the same two steps and automated them.

🎯 For the preparation part, a key part is find all the important references on the given subject.
Researchers first cast a wide net of all relevant papers. But then finding the really important ones is like distilling knowledge from a haystack of information. To solve this challenge, they built an “AttributeTree” object that structures key information from citations. Ablating these AttributeTrees significantly decreased structure and synthesis scores, so they were really useful!

📝 For the writing part, key was to get a synthesis that's both short and true. This is not easy to get with LLMs! So they used methods like LLM-based deduplication to shorten the too verbose listings made by LLMs, and RAG to grab original quotes instead of made-up ones.

As a result, their system outperforms previous approaches by far!

As assessed by LLM-judges, the quality score os SurveyX even approaches this of human experts, with 4.59/5 vs 4.75/5 🏆

I advise you to read the paper, it's a great overview of the kind of assistants that we'll get in the short future! 👉 SurveyX: Academic Survey Automation via Large Language Models (2502.14776)
Their website shows examples of generated surveys 👉 http://www.surveyx.cn/

reacted to their post with 🔥 4 months ago

Post

3942

🚀🎭🌟 New Research Alert - WACV 2025 (Avatars Collection)! 🌟🎭🚀
📄 Title: EmoVOCA: Speech-Driven Emotional 3D Talking Heads 🔝

📝 Description: EmoVOCA is a data-driven method for generating emotional 3D talking heads by combining speech-driven lip movements with expressive facial dynamics. This method has been developed to overcome the limitations of corpora and to achieve state-of-the-art animation quality.

👥 Authors: @FedeNoce , Claudio Ferrari, and Stefano Berretti

📅 Conference: WACV, 28 Feb – 4 Mar, 2025 | Arizona, USA 🇺🇸

📄 Paper: https://arxiv.org/abs/2403.12886

🌐 Github Page: https://fedenoce.github.io/emovoca/
📁 Repository: https://github.com/miccunifi/EmoVOCA

🚀 CVPR-2023-24-Papers: https://github.com/DmitryRyumin/CVPR-2023-24-Papers

🚀 WACV-2024-Papers: https://github.com/DmitryRyumin/WACV-2024-Papers

🚀 ICCV-2023-Papers: https://github.com/DmitryRyumin/ICCV-2023-Papers

📚 More Papers: more cutting-edge research presented at other conferences in the DmitryRyumin/NewEraAI-Papers curated by @DmitryRyumin

🚀 Added to the Avatars Collection: DmitryRyumin/avatars-65df37cdf81fec13d4dbac36

🔍 Keywords: #EmoVOCA #3DAnimation #TalkingHeads #SpeechDriven #FacialExpressions #MachineLearning #ComputerVision #ComputerGraphics #DeepLearning #AI #WACV2024

1 reply

·

posted an update 4 months ago

Post

3942

🚀🎭🌟 New Research Alert - WACV 2025 (Avatars Collection)! 🌟🎭🚀
📄 Title: EmoVOCA: Speech-Driven Emotional 3D Talking Heads 🔝

📝 Description: EmoVOCA is a data-driven method for generating emotional 3D talking heads by combining speech-driven lip movements with expressive facial dynamics. This method has been developed to overcome the limitations of corpora and to achieve state-of-the-art animation quality.

👥 Authors: @FedeNoce , Claudio Ferrari, and Stefano Berretti

📅 Conference: WACV, 28 Feb – 4 Mar, 2025 | Arizona, USA 🇺🇸

📄 Paper: https://arxiv.org/abs/2403.12886

🌐 Github Page: https://fedenoce.github.io/emovoca/
📁 Repository: https://github.com/miccunifi/EmoVOCA

🚀 CVPR-2023-24-Papers: https://github.com/DmitryRyumin/CVPR-2023-24-Papers

🚀 WACV-2024-Papers: https://github.com/DmitryRyumin/WACV-2024-Papers

🚀 ICCV-2023-Papers: https://github.com/DmitryRyumin/ICCV-2023-Papers

📚 More Papers: more cutting-edge research presented at other conferences in the DmitryRyumin/NewEraAI-Papers curated by @DmitryRyumin

🚀 Added to the Avatars Collection: DmitryRyumin/avatars-65df37cdf81fec13d4dbac36

🔍 Keywords: #EmoVOCA #3DAnimation #TalkingHeads #SpeechDriven #FacialExpressions #MachineLearning #ComputerVision #ComputerGraphics #DeepLearning #AI #WACV2024

1 reply

·

reacted to not-lain's post with 🔥 5 months ago

Post

4501

I have just released a new blogpost about kv caching and its role in inference speedup 🚀
🔗 https://huggingface.co/blog/not-lain/kv-caching/
some takeaways :

4 replies

·

reacted to nyuuzyou's post with 🤯 7 months ago

Post

2991

its over

41 replies

·

reacted to TuringsSolutions's post with 🔥 8 months ago

Post

3143

Sentence Transformers received huge updates today! Do you like giving your model access to web search and document search? That's Sentence Transformers. Hugging Face makes it beyond easy to add this functionality to any model. You can be up and running with Sentence Transformers in seconds. Check out this video for a deeper explanation and sample code: https://youtu.be/2hR3D8_kqZE

reacted to tomaarsen's post with 🔥 8 months ago

Post

5847

I just released Sentence Transformers v3.3.0 & it's huge! 4.5x speedup for CPU with OpenVINO int8 static quantization, training with prompts for a free perf. boost, PEFT integration, evaluation on NanoBEIR, and more! Details:

1. We integrate Post-Training Static Quantization using OpenVINO, a very efficient solution for CPUs that processes 4.78x as many texts per second on average, while only hurting performance by 0.36% on average. There's a new export_static_quantized_openvino_model method to quantize a model.

2. We add the option to train with prompts, e.g. strings like "query: ", "search_document: " or "Represent this sentence for searching relevant passages: ". It's as simple as using the prompts argument in SentenceTransformerTrainingArguments. Our experiments show that you can easily reach 0.66% to 0.90% relative performance improvement on NDCG@10 at no extra cost by adding "query: " before each training query and "document: " before each training answer.

3. Sentence Transformers now supports training PEFT adapters via 7 new methods for adding new adapters or loading pre-trained ones. You can also directly load a trained adapter with SentenceTransformer as if it's a normal model. Very useful for e.g. 1) training multiple adapters on 1 base model, 2) training bigger models than otherwise possible, or 3) cheaply hosting multiple models by switching multiple adapters on 1 base model.

4. We added easy evaluation on NanoBEIR, a subset of BEIR a.k.a. the MTEB Retrieval benchmark. It contains 13 datasets with 50 queries and up to 10k documents each. Evaluation is fast, and can easily be done during training to track your model's performance on general-purpose information retrieval tasks.

Additionally, we also deprecate Python 3.8, add better compatibility with Transformers v4.46.0, and more. Read the full release notes here: https://github.com/UKPLab/sentence-transformers/releases/tag/v3.3.0

reacted to singhsidhukuldeep's post with 🔥 8 months ago

Post

1869

Good folks at @nvidia have released exciting new research on normalized Transformers (nGPT) for faster and more efficient language modeling!

Here is what they are proposing:

1. Remove all normalization layers, like RMSNorm or LayerNorm, from the standard Transformer architecture.

2. Normalize all matrices along their embedding dimension after each training step. This includes input and output embeddings, attention matrices (Q, K, V), output projection matrices, and MLP matrices.

3. Replace the standard residual connections with normalized update equations using learnable eigen learning rates for the attention and MLP blocks.

4. Change the softmax scaling factor in the attention mechanism from 1/sqrt of d_k to sqrt of d_k.

5. Implement rescaling and optional normalization of query (q) and key (k) vectors in the attention mechanism using learnable scaling factors.

6. Rescale the intermediate states of the MLP block using learnable scaling factors.

7. Implement rescaling of the output logits using learnable scaling factors.

8. Remove weight decay and learning rate warmup from the optimization process.

9. Initialize the eigen learning rates and scaling factors with appropriate values as specified in the paper.

10. During training, treat all vectors and matrices as residing on a unit hypersphere, interpreting matrix-vector multiplications as cosine similarities.

11. Implement the update equations for the hidden states using the normalized outputs from attention and MLP blocks, controlled by the eigen learning rates.

12. After each forward pass, normalize all parameter matrices to ensure they remain on the unit hypersphere.

13. Use the Adam optimizer without weight decay for training the model.

14. When computing loss, apply the learnable scaling factor to the logits before the softmax operation.

15. During inference, follow the same normalization and scaling procedures as in training.

Excited to see how it scales to larger models and datasets!

reacted to merve's post with 🔥 8 months ago

Post

1701

Tencent released a new depth model that generates temporally consistent depth maps over videos ⏯️

Model: tencent/DepthCrafter
Demo: tencent/DepthCrafter
Paper: DepthCrafter: Generating Consistent Long Depth Sequences for Open-world Videos (2409.02095)

You don't need to input anything other than video itself, no need for optical flow or camera poses! 🤩

Dmitry Ryumin

AI & ML interests

Recent Activity

Organizations

Dmitry Ryumin

AI & ML interests

Recent Activity

Organizations

DmitryRyumin's activity