Spaces:

ahmedsqrd
/

model_trace

Runtime error

model_trace / src /about.py

Ahmed Ahmed

lets see

1dd4b6a 17 days ago

3.03 kB

	from dataclasses import dataclass
	from enum import Enum

	@dataclass
	class Task:
	benchmark: str
	metric: str
	col_name: str

	# Select your tasks here
	# ---------------------------------------------------
	class Tasks(Enum):
	# task_key in the json file, metric_key in the json file, name to display in the leaderboard
	task0 = Task("perplexity", "perplexity", "Perplexity")

	NUM_FEWSHOT = 0 # Not used for perplexity
	# ---------------------------------------------------

	# Your leaderboard name
	TITLE = """<h1 align="center" id="space-title">Model Perplexity Leaderboard</h1>"""

	# What does your leaderboard evaluate?
	INTRODUCTION_TEXT = """
	This leaderboard evaluates language models based on their perplexity scores on a fixed test passage and
	structural similarity to GPT-2 using model tracing analysis.

	- Perplexity: Lower perplexity scores indicate better performance - it means the model is better at predicting the next token in the text.
	- Match P-Value: Lower p-values indicate the model preserves structural similarity to GPT-2 after fine-tuning (neuron organization is maintained).
	"""

	# Which evaluations are you running?
	LLM_BENCHMARKS_TEXT = """
	## How it works

	The evaluation runs two types of analysis on language models:

	### 1. Perplexity Evaluation
	Perplexity tests using a fixed test passage about artificial intelligence.
	Perplexity measures how well a model predicts text - lower scores mean better predictions.

	### 2. Model Tracing Analysis
	Compares each model's internal structure to GPT-2 using the "match" statistic with alignment:
	- Base Model: GPT-2 (`openai-community/gpt2`)
	- Comparison: Each model on the leaderboard
	- Method: Neuron matching analysis across transformer layers
	- Alignment: Models are aligned before comparison using the Hungarian algorithm
	- Output: P-value indicating structural similarity (lower = more similar to GPT-2)

	The match statistic tests whether neurons in corresponding layers maintain similar functional roles
	between the base model and fine-tuned variants.

	## Test Text

	The evaluation uses the following passage:
	```
	Artificial intelligence has transformed the way we live and work, bringing both opportunities and challenges.
	From autonomous vehicles to language models that can engage in human-like conversation, AI technologies are becoming increasingly
	sophisticated. However, with this advancement comes the responsibility to ensure these systems are developed and deployed ethically,
	with careful consideration for privacy, fairness, and transparency. The future of AI will likely depend on how well we balance innovation
	with these important social considerations.
	```
	"""

	EVALUATION_QUEUE_TEXT = """
	## Before submitting a model

	1. Make sure your model is public on the Hugging Face Hub
	2. The model should be loadable with AutoModelForCausalLM
	3. The model should support text generation tasks
	"""

	CITATION_BUTTON_LABEL = "Copy the following snippet to cite these results"
	CITATION_BUTTON_TEXT = ""