File size: 2,436 Bytes
359f755
 
 
63076cf
359f755
 
1191811
359f755
1191811
359f755
 
 
4864926
359f755
 
 
63076cf
1dd4b6a
4864926
 
 
 
 
1191811
4864926
359f755
 
77c0f20
 
359f755
 
1191811
4864926
 
 
 
 
1dd4b6a
1191811
4864926
 
 
1dd4b6a
 
4864926
1dd4b6a
 
4864926
77c0f20
359f755
77c0f20
1191811
4864926
1191811
4864926
 
 
 
359f755
1191811
359f755
 
 
77c0f20
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
from dataclasses import dataclass
from enum import Enum

# NO TASKS - ONLY P-VALUES
# ---------------------------------------------------
class Tasks(Enum):
    pass

NUM_FEWSHOT = 0 # Not used
# ---------------------------------------------------

# Your leaderboard name
TITLE = """<h1 align="center" id="space-title">Model Tracing Leaderboard</h1>"""

# What does your leaderboard evaluate?
INTRODUCTION_TEXT = """
This leaderboard evaluates specific language models based on their structural similarity to Llama-2-7B using model tracing analysis.

**Models Evaluated:**
- `lmsys/vicuna-7b-v1.5` - Vicuna 7B v1.5
- `ibm-granite/granite-7b-base` - IBM Granite 7B Base  
- `EleutherAI/llemma_7b` - LLeMa 7B

**Metric:**
- **Match P-Value**: Lower p-values indicate the model preserves structural similarity to Llama-2-7B after fine-tuning (neuron organization is maintained).
"""

# Which evaluations are you running?
LLM_BENCHMARKS_TEXT = """
## How it works

The evaluation runs model tracing analysis on the supported language models:

### Supported Models
- **Vicuna 7B v1.5** (`lmsys/vicuna-7b-v1.5`) - Chat-optimized LLaMA variant
- **IBM Granite 7B** (`ibm-granite/granite-7b-base`) - IBM's foundational language model
- **LLeMa 7B** (`EleutherAI/llemma_7b`) - EleutherAI's mathematical language model

### Model Tracing Analysis
Compares each model's internal structure to Llama-2-7B using the "match" statistic:
- **Base Model**: Llama-2-7B (`meta-llama/Llama-2-7b-hf`)
- **Comparison Models**: The 3 supported models listed above
- **Method**: Neuron matching analysis across transformer layers
- **Alignment**: Models are aligned before comparison using the Hungarian algorithm
- **Output**: P-value indicating structural similarity (lower = more similar to Llama-2-7B)

The match statistic tests whether neurons in corresponding layers maintain similar functional roles 
between the base model and the comparison models.
"""

EVALUATION_QUEUE_TEXT = """
## Model Analysis

This leaderboard analyzes structural similarity between specific models and Llama-2-7B:

1. **Vicuna 7B v1.5** - Chat-optimized variant of LLaMA
2. **IBM Granite 7B Base** - IBM's foundational language model  
3. **LLeMa 7B** - EleutherAI's mathematical language model

The p-values are computed automatically using the model tracing analysis.
"""

CITATION_BUTTON_LABEL = "Copy the following snippet to cite these results"
CITATION_BUTTON_TEXT = ""