Eval numbers for Llama 3.2 1B in Table 1 don't match Meta's results

#28
by AlexA5432 - opened

The eval numbers in Table 1 in paper https://arxiv.org/pdf/2504.12285 for Llama 3.2 1B don't match Meta's published results in https://ai.meta.com/blog/llama-3-2-connect-2024-vision-edge-mobile-devices/. For example, in Table 1 you quote 37.8 for ARC Challenge, but Meta reports 59.4. There are discrepancies in all tasks.

Microsoft org

Thank you for noting the difference in Llama 3.2 1B evaluation scores. Evaluation results for LLMs can indeed vary significantly based on the specific framework, prompts, few-shot settings, and dataset versions used.

In our study, the priority was a consistent comparison across all models evaluated. To achieve this, we used the widely adopted lm-evaluation-harness (https://github.com/EleutherAI/lm-evaluation-harness) with uniform settings for all models. The scores in Table 1 reflect performance under this specific, unified evaluation setup.

Therefore, while our results facilitate fair relative comparisons within our paper, they may understandably diverge from Meta's figures, which could be based on different internal protocols, specific harness configurations, or prompt engineering.

The main claim from your paper is that you can build a 1.58bit LLM with accuracy comparable to models that are not quantized or using at least 4 bits. But if the numbers on competing models reported in your paper are lower than the numbers they report, that claim is in question. I do like the ternary quantization idea and would like to see more convincing evidence. I find the claim (I suspect others may too) that this is "a fair relative comparison" unconvincing, as a different evaluation harness may introduce artifacts (e.g. mismatches in prompt formatting or tokenization/normalization, API quirks, strict/custom error calculations) that don't handicap your model but handicap other models. I suggest you run Meta's weights and runtime on the same exact test set and see if you can reproduce their results (the same applies to Qwen and other models) and, if they reproduce, figure out why the EleutherAI evaluation harness doesn't.

Wow yeah Meta Ai couldn't explain it and it is a Machine! I think Meta is hiding something should have betterDiagnostics!

If Meta were to defend the possible discrepancies in the evaluation numbers, they could provide additional details to clarify the situation. Here are some potential points they could address:

Potential Points for Clarification

  1. Evaluation Methodology: Meta could explain the evaluation methodology used to obtain the published results, including any specific settings, hyperparameters, or data preprocessing steps.
  2. Model Configuration: They could provide more information about the Llama 3.2 1B model configuration, including any differences in architecture, training data, or training procedures that might affect performance.
  3. Task-Specific Details: Meta could offer more context about the specific tasks used for evaluation, such as the ARC Challenge, and how they were implemented.
  4. Error Margins and Variability: They might discuss potential error margins or variability in the evaluation results, which could help explain the discrepancies.
  5. Comparison to Other Models: Meta could compare the performance of Llama 3.2 1B to other models, both in terms of absolute performance and relative performance differences.

Additional Information

To further support their defense, Meta could provide:

  1. Detailed Experimental Results: They could release more detailed experimental results, including raw scores, standard deviations, and other relevant metrics.
  2. Code and Data Availability: Making the code and data used for evaluation publicly available could help facilitate independent verification and replication of the results.
  3. Collaboration and Feedback: Meta could invite researchers to collaborate on further evaluation and analysis, fostering a more transparent and open discussion about the results.

By providing these additional details, Meta could help clarify the discrepancies and demonstrate the reliability of their published results.

Your need to confirm your account before you can post a new comment.

Sign up or log in to comment