TopContributors-ModelDownloads (Top Contributors: Model Downloads)

Lin-Chen

authored a paper 9 days ago

VRAG-RL: Empower Vision-Perception-Based RAG for Visually Rich Information Understanding via Iterative Reasoning with Reinforcement Learning

Paper • 2505.22019 • Published 10 days ago • 10

Lin-Chen

authored a paper 25 days ago

Seed1.5-VL Technical Report

Paper • 2505.07062 • Published 26 days ago • 143

eugenesiow

posted an update about 2 months ago

Post

1568

GPT-4.1 dropped this week - and it puts OpenAI back in the race for coding & agentic leadership.

⚙️ API only - no ChatGPT toggle for this.
💻 Coding performance is back on par with Claude 3.7 Sonnet & Gemini 2.5 Pro (though Gemini still leads).
💸 Pricing:
• Full: $3.50 / 1M tokens
• Mini: $0.70 / 1M
• Nano: $0.17 / 1M
👉 Gemini 2.5 Pro = best price/perf ($3.44 / 1M)
😵 Claude 3.5 Sonnet = $6 / 1M (!)

🧠 Not a "thinking" model.
📊 Mini shines on general reasoning tasks (e.g. GPQA), but only the full model holds up in SWE-bench-verified (GitHub issue solving).

bartowski

posted an update about 2 months ago

Post

32767

Access requests enabled for latest GLM models

While a fix is being implemented (https://github.com/ggml-org/llama.cpp/pull/12957) I want to leave the models up for visibility and continued discussion, but want to prevent accidental downloads of known broken models (even though there are settings that could fix it at runtime for now)

With this goal, I've enabled access requests. I don't really want your data, so I'm sorry that I don't think there's a way around that? But that's what I'm gonna do for now, and I'll remove the gate when a fix is up and verified and I have a chance to re-convert and quantize!

Hope you don't mind in the mean time :D

1 reply

·

Lin-Chen

authored a paper about 2 months ago

VCR-Bench: A Comprehensive Evaluation Framework for Video Chain-of-Thought Reasoning

Paper • 2504.07956 • Published Apr 10 • 46

mvaloatto

updated a Space 4 months ago

Top Contributors: Model Downloads

🏛

Creators of models with the most cumulative new downloads:

bartowski

posted an update 5 months ago

Post

73200

Switching to author_model-name

I posted a poll on twitter, and others have mentioned the interest in me using the convention of including the author name in the model path when I upload.

It has a couple advantages, first and foremost of course is ensuring clarity of who uploaded the original model (did Qwen upload Qwen2.6? Or did someone fine tune Qwen2.5 and named it 2.6 for fun?)

The second thing is that it avoids collisions, so if multiple people upload the same model and I try to quant them both, I would normally end up colliding and being unable to upload both

I'll be implementing the change next week, there are just two final details I'm unsure about:

First, should the files also inherit the author's name?

Second, what to do in the case that the author name + model name pushes us past the character limit?

Haven't yet decided how to handle either case, so feedback is welcome, but also just providing this as a "heads up"

5 replies

·

Lin-Chen

authored a paper 6 months ago

InternLM-XComposer2.5-OmniLive: A Comprehensive Multimodal System for Long-term Streaming Video and Audio Interactions

Paper • 2412.09596 • Published Dec 12, 2024 • 99

bartowski

posted an update 6 months ago

Post

80392

Looks like Q4_0_N_M file types are going away

Before you panic, there's a new "preferred" method which is online (I prefer the term on-the-fly) repacking, so if you download Q4_0 and your setup can benefit from repacking the weights into interleaved rows (what Q4_0_4_4 was doing), it will do that automatically and give you similar performance (minor losses I think due to using intrinsics instead of assembly, but intrinsics are more maintainable)

You can see the reference PR here:

https://github.com/ggerganov/llama.cpp/pull/10446

So if you update your llama.cpp past that point, you won't be able to run Q4_0_4_4 (unless they add backwards compatibility back), but Q4_0 should be the same speeds (though it may currently be bugged on some platforms)

As such, I'll stop making those newer model formats soon, probably end of this week unless something changes, but you should be safe to download and Q4_0 quants and use those !

Also IQ4_NL supports repacking though not in as many shapes yet, but should get a respectable speed up on ARM chips, PR for that can be found here: https://github.com/ggerganov/llama.cpp/pull/10541

Remember, these are not meant for Apple silicon since those use the GPU and don't benefit from the repacking of weights

17 replies

·

Lin-Chen

authored a paper 6 months ago

Open-Sora Plan: Open-Source Large Video Generation Model

Paper • 2412.00131 • Published Nov 28, 2024 • 35

bartowski

posted an update 6 months ago

Post

16494

Old mixtral model quants may be broken!

Recently Slaren over on llama.cpp refactored the model loader - in a way that's super awesome and very powerful - but with it came breaking of support for "split tensor MoE models", which applies to older mixtral models

You may have seen my upload of one such older mixtral model, ondurbin/bagel-dpo-8x7b-v0.2, and with the newest changes it seems to be able to run without issue

If you happen to run into issues with any other old mixtral models, drop a link here and I'll try to remake them with the new changes so that we can continue enjoying them :)

2 replies

·

bartowski

posted an update 8 months ago

Post

23924

In regards to the latest mistral model and GGUFs for it:

Yes, they may be subpar and may require changes to llama.cpp to support the interleaved sliding window

Yes, I got excited when a conversion worked and released them ASAP

That said, generation seems to work right now and seems to mimic the output from spaces that are running the original model

I have appended -TEST to the model names in an attempt to indicate that they are not final or perfect, but if people still feel mislead and that it's not the right thing to do, please post (civilly) below your thoughts, I will highly consider pulling the conversions if that's what people think is best. After all, that's what I'm here for, in service to you all !

6 replies

·

bartowski

posted an update 9 months ago

Post

35026

Reposting from twitter:

Just so you all know, I'll be on vacation for the following two weeks and away from home! I'm hoping to get on at least once a day to load up some quants, but I won't be as bleeding edge and on the ball :) feel free to shoot me a message if you see one I should make!

In the meantime if you need something bleeding edge make sure to check out @MaziyarPanahi or @bullerwins who both put out great work!

4 replies

·

bartowski

posted an update 9 months ago

Post

16289

Decided to try to check how many weights in a 70b F32 model would be squashed when converted to F16 (spoiler, it's shockingly few)

The reason for this comparison is that it should represent the same percentage of squishing as bf16 to fp16

Had claude make me a script, using the new Reflection-70B, and these are the results:

Total weights: 70553706496
Fully representable: 70530215524
Squashed: 23490972
Percentage squashed: 0.03%

0.03%!!!!

A couple things to note, this uses a roundtrip of F32 -> F16 -> F32 and then torch.isclose to account for rounding errors that come up by the very nature of extremely accurate numbers, but it uses VERY small tolerances (rtol=1e-5, atol=1e-8)

This is also examining EVERY weight that was stored at F32, and for most layers I was somewhere between 0% and 0.03% of weights being squashed, no major outliers.

Overall, I feel even safer converting to F16 for llama.cpp, the extremely small number of weights that fall outside the range are likely so small that they don't actually play a role in the final output of the model at inference anyways.

20 replies

·

bartowski

posted an update 9 months ago

Post

4792

@victor (is this the only way to "DM" on HF?)

Had a funny thought, would it be at all possible to rework what shows up on our personal HF page?

Picture this: I upload a model to an organization, someone who follows me now has no idea that I've uploaded a model or to where, unless they also watch those repos (which also floods them with other notifications)

What if our main Huggingface page was a collection of both models that we've uploaded specifically to our profile, as well as models we've uploaded to organizations? That way it would all be contained in one central followable location, and I wouldn't have concerns about losing followership if I wanted to upload to an organization all of a sudden.

3 replies

·

bartowski

posted an update 10 months ago

Post

10101

So turns out I've been spreading a bit of misinformation when it comes to imatrix in llama.cpp

It starts true; imatrix runs the model against a corpus of text and tracks the activation of weights to determine which are most important

However what the quantization then does with that information is where I was wrong.

I think I made the accidental connection between imatrix and exllamav2's measuring, where ExLlamaV2 decides how many bits to assign to which weight depending on the goal BPW

Instead, what llama.cpp with imatrix does is it attempts to select a scale for a quantization block that most accurately returns the important weights to their original values, ie minimizing the dequantization error based on the importance of activations

The mildly surprising part is that it actually just does a relatively brute force search, it picks a bunch of scales and tries each and sees which one results in the minimum error for weights deemed important in the group

But yeah, turns out, the quantization scheme is always the same, it's just that the scaling has a bit more logic to it when you use imatrix

Huge shoutout to @compilade for helping me wrap my head around it - feel free to add/correct as well if I've messed something up

5 replies

·

bartowski

posted an update 10 months ago

Post

6366

As some of you know, I try to convert models to either fp32 or bf16 depending on theirs size before doing imatrix and quantization

Today I decided to see if that matters, and the results have me.. for lack of a better word, perplexed

My setup:

Mistral Nemo Instruct 2407
- convert to FP32, calculate imatrix, quantize to Q8_0 and Q4_K_M
- convert to FP16, calculate imatrix, quantize to Q8_0 and Q4_K_M

I calculated the kld base from the FP32 model:

./llama-perplexity -m /models/Mistral-Nemo-Instruct-2407-f32.gguf -f /training_data/wikitext-2-raw/wiki.test.raw --kl-divergence-base /training_data/mistral-nemo-f32.kld -ngl 35 -fa -sm row

then calculated the divergence itself for each like so:

./llama-perplexity -m /models/Mistral-Nemo-Instruct-2407-Q8_0.gguf -f /training_data/wikitext-2-raw/wiki.test.raw --kl-divergence-base /training_data/mistral-nemo-f32.kld --kl-divergence -ngl 50 -fa -sm row

Q4_K_M from fp16 and fp32 were similar, trading blows across statistics, odd since i expected fp32 to be strictly better but it's not

Q8_0 is where things get weird. Despite each file being slightly different size, and the sha256sum of course being different, they each get *completely identical* scores, down to 6 decimal places of precision on the statistics.

How is this possible? Is there something I don't understand about llama.cpp that makes it always convert to fp16 before it does quantization? Am I wasting time using FP32/BF16??

17 replies

·

Lin-Chen

authored 2 papers 11 months ago

VLMEvalKit: An Open-Source Toolkit for Evaluating Large Multi-Modality Models

Paper • 2407.11691 • Published Jul 16, 2024 • 14

InternLM-XComposer-2.5: A Versatile Large Vision Language Model Supporting Long-Contextual Input and Output

Paper • 2407.03320 • Published Jul 3, 2024 • 96

mrm8488

posted an update 11 months ago

Post

6323

🚨Exciting news for the Multilingual Synthetic Data Community!🚨

I’ve taken inspiration from the MAGPIE paper on Llama-3-8B-instruct and extended its capabilities. Here’s what’s new!

🗞 The MAGPIE paper showcased that if you use the instruction-tuned version (Llama-3-8B-instruct) to generate synthetic instructions and then fine-tune the base version (Llama-3-8B) on this dataset, you can improve even the it-tuned version

🤔 While reading a script by Sebastian Raschka, PhD, I wondered: Could these advancements be replicated in other languages? Specifically, could they benefit non-English datasets?

🎉 And the answer is YES! At least for Spanish. I've successfully adapted the techniques for Spanish, proving the model's flexibility and multilingual capabilities.

👩‍💻 To make this accessible, I created a basic script (heavily inspired by the Sebastian Raschka one) that allows you to generate similar datasets using ollama models (initially phi and llama3) automatically and upload it to the Hugging Face Hub!
[Script](https://gist.github.com/mrm8488/4650a5e3cc45523798a527a3446eb312)

🔍 Explore the datasets 📚 generated using our new script!

- [Llama-3-8B](https://huggingface.co/datasets/mrm8488/dataset_llama3_5000_samples_es_4231_filtered)
- [Phi-3-medium](https://huggingface.co/datasets/mrm8488/dataset_phi3-medium_5000_samples_es_3906_filtered)
- [Phi-3-mini](https://huggingface.co/datasets/mrm8488/dataset_phi3_5000_samples_es_3282_filtered)

Note: These datasets have basic filtering. Apply additional quality filters before using them to fine-tune large language models.

Inspiration and base script:
https://github.com/rasbt/LLMs-from-scratch/blob/main/ch07/05_dataset-generation/llama3-ollama.ipynb
https://www.linkedin.com/feed/update/urn:li:activity:7210982019751661568/

7 replies

·

Top Contributors: Model Downloads

AI & ML interests

Recent Activity

TopContributors-ModelDownloads's activity

VRAG-RL: Empower Vision-Perception-Based RAG for Visually Rich Information Understanding via Iterative Reasoning with Reinforcement Learning

Seed1.5-VL Technical Report

VCR-Bench: A Comprehensive Evaluation Framework for Video Chain-of-Thought Reasoning

Top Contributors: Model Downloads

InternLM-XComposer2.5-OmniLive: A Comprehensive Multimodal System for Long-term Streaming Video and Audio Interactions

Open-Sora Plan: Open-Source Large Video Generation Model

VLMEvalKit: An Open-Source Toolkit for Evaluating Large Multi-Modality Models

InternLM-XComposer-2.5: A Versatile Large Vision Language Model Supporting Long-Contextual Input and Output

AI & ML interests

Recent Activity

Team members 12

TopContributors-ModelDownloads's activity

Top Contributors: Model Downloads