# Model Trace - Hugging Face Space Explanation ## Overview This repository hosts a **Hugging Face Space** that creates a dynamic leaderboard for evaluating language models. The space provides a web interface where users can submit models for evaluation and view results in a ranked leaderboard format. ## How It Works ### Architecture The system consists of several key components: 1. **Frontend Interface** (`app.py`): A Gradio web application with three main tabs: - **๐Ÿ… LLM Benchmark**: Displays the main leaderboard - **๐Ÿ“ About**: Shows information about the evaluation process - **๐Ÿš€ Submit here!**: Allows users to submit models for evaluation 2. **Data Storage**: Uses Hugging Face datasets to store: - **Evaluation Requests**: Models waiting to be evaluated - **Evaluation Results**: Completed evaluation results 3. **Evaluation Queue System**: Models go through different states: - **PENDING**: Submitted but not yet evaluated - **RUNNING**: Currently being evaluated - **FINISHED**: Evaluation completed ### Data Flow 1. **Model Submission**: Users submit models through the web interface 2. **Validation**: System checks if the model exists on Hugging Face Hub and has proper metadata 3. **Queue Management**: Valid models are added to the evaluation queue 4. **Evaluation**: External evaluation system processes the models (not included in this repo) 5. **Results Display**: Completed evaluations appear in the leaderboard ### Configuration The main configuration files are: - **`src/envs.py`**: Repository settings and API tokens - **`src/about.py`**: Task definitions and leaderboard metadata - **`src/display/utils.py`**: Column definitions and display settings ## Current Evaluation Tasks The system is currently configured to evaluate models on: - **ANLI** (Adversarial NLI) - accuracy metric - **LogiQA** - normalized accuracy metric ## Adding Dynamic Perplexity Testing To add perplexity evaluation as a dynamic test, you'll need to make several modifications: ### 1. Update Task Configuration First, modify `src/about.py` to add perplexity as a new task: ```python class Tasks(Enum): # Existing tasks task0 = Task("anli_r1", "acc", "ANLI") task1 = Task("logiqa", "acc_norm", "LogiQA") # Add perplexity task task2 = Task("perplexity", "perplexity", "Perplexity") ``` ### 2. Create Perplexity Evaluation Script Create a new file `src/evaluation/perplexity_eval.py`: ```python import torch from transformers import AutoModelForCausalLM, AutoTokenizer import numpy as np def evaluate_perplexity(model_name, revision="main", test_text=None): """ Evaluate perplexity on a fixed piece of text. Args: model_name: Hugging Face model identifier revision: Model revision/commit hash test_text: Text to evaluate perplexity on (default if None) Returns: float: Perplexity score (lower is better) """ # Default test text if none provided if test_text is None: test_text = """The quick brown fox jumps over the lazy dog. This is a standard test sentence that contains all the letters of the English alphabet. It is commonly used for testing fonts and keyboards.""" # Load model and tokenizer model = AutoModelForCausalLM.from_pretrained( model_name, revision=revision, torch_dtype=torch.float16, device_map="auto" ) tokenizer = AutoTokenizer.from_pretrained(model_name, revision=revision) # Tokenize the text inputs = tokenizer(test_text, return_tensors="pt") # Move to same device as model inputs = {k: v.to(model.device) for k, v in inputs.items()} # Calculate loss with torch.no_grad(): outputs = model(**inputs, labels=inputs["input_ids"]) loss = outputs.loss # Calculate perplexity perplexity = torch.exp(loss).item() return perplexity def create_perplexity_result(model_name, revision, precision, perplexity_score): """ Create a result file in the expected format. """ return { "config": { "model_dtype": f"torch.{precision}", "model_name": model_name, "model_sha": revision, }, "results": { "perplexity": { "perplexity": perplexity_score, } } } ``` ### 3. Add Dynamic Evaluation Endpoint Create a new file `src/evaluation/dynamic_eval.py`: ```python import json import os from datetime import datetime from src.evaluation.perplexity_eval import evaluate_perplexity, create_perplexity_result from src.envs import EVAL_RESULTS_PATH, API, RESULTS_REPO def run_dynamic_perplexity_eval(model_name, revision="main", precision="float16"): """ Run perplexity evaluation and save results. """ try: # Run evaluation perplexity_score = evaluate_perplexity(model_name, revision) # Create result structure result = create_perplexity_result(model_name, revision, precision, perplexity_score) # Save result file timestamp = datetime.now().strftime("%Y%m%d_%H%M%S") result_filename = f"results_{model_name.replace('/', '_')}_{timestamp}.json" # Create directory structure org, model = model_name.split("/") if "/" in model_name else ("", model_name) result_dir = os.path.join(EVAL_RESULTS_PATH, org) if org else EVAL_RESULTS_PATH os.makedirs(result_dir, exist_ok=True) result_path = os.path.join(result_dir, result_filename) with open(result_path, "w") as f: json.dump(result, f, indent=2) # Upload to Hugging Face dataset API.upload_file( path_or_fileobj=result_path, path_in_repo=result_path.split("eval-results/")[1], repo_id=RESULTS_REPO, repo_type="dataset", commit_message=f"Add perplexity results for {model_name}", ) return True, perplexity_score except Exception as e: return False, str(e) ``` ### 4. Add Dynamic Testing Interface Modify `app.py` to add a new tab for dynamic testing: ```python # Add this import from src.evaluation.dynamic_eval import run_dynamic_perplexity_eval # Add this function def run_perplexity_test(model_name, revision, precision): """Run perplexity evaluation on demand.""" if not model_name: return "Please enter a model name." success, result = run_dynamic_perplexity_eval(model_name, revision, precision) if success: return f"โœ… Perplexity evaluation completed!\nPerplexity: {result:.4f}\n\nResults have been saved and will appear in the leaderboard shortly." else: return f"โŒ Evaluation failed: {result}" # Add this to the demo interface (inside the gr.Blocks) with gr.TabItem("๐Ÿงช Dynamic Testing", elem_id="dynamic-testing-tab", id=4): gr.Markdown("## Run Perplexity Evaluation") with gr.Row(): with gr.Column(): dynamic_model_name = gr.Textbox(label="Model name", placeholder="org/model-name") dynamic_revision = gr.Textbox(label="Revision", placeholder="main", value="main") dynamic_precision = gr.Dropdown( choices=["float16", "bfloat16"], label="Precision", value="float16" ) with gr.Column(): dynamic_test_button = gr.Button("๐Ÿš€ Run Perplexity Test", variant="primary") dynamic_result = gr.Markdown() dynamic_test_button.click( run_perplexity_test, [dynamic_model_name, dynamic_revision, dynamic_precision], dynamic_result ) ``` ### 5. Update Requirements Add any additional dependencies to `requirements.txt`: ```txt # Add if not already present torch transformers accelerate ``` ### 6. Configure Environment Update `src/envs.py` to point to your repositories: ```python OWNER = "your-org-name" # Change this ``` You'll need to create two Hugging Face datasets: - `your-org-name/requests` - for evaluation requests - `your-org-name/results` - for evaluation results ## How to Use the Dynamic Testing 1. **Deploy the Space**: Push your changes to a Hugging Face Space 2. **Set Environment Variables**: Add `HF_TOKEN` with write permissions 3. **Test Models**: Use the "Dynamic Testing" tab to evaluate models on demand 4. **View Results**: Results will appear in the main leaderboard ## Key Features of Dynamic Testing - **On-Demand Evaluation**: Test models immediately without queue - **Fixed Text**: Uses consistent test text for fair comparison - **Automatic Ranking**: Lower perplexity scores rank higher - **Real-time Results**: See results immediately after evaluation - **Integration**: Results automatically appear in the main leaderboard ## Customization Options You can customize the perplexity evaluation by: 1. **Changing Test Text**: Modify the default text in `perplexity_eval.py` 2. **Adding Multiple Texts**: Evaluate on multiple texts and average results 3. **Different Metrics**: Add other metrics like BLEU, ROUGE, etc. 4. **Model Loading Options**: Customize model loading parameters 5. **Batch Processing**: Process multiple models in sequence ## Security Considerations - Models must be public on Hugging Face Hub - Evaluation runs in the Space's environment - Results are publicly visible - Consider rate limiting for dynamic testing This setup provides a complete dynamic testing system that integrates seamlessly with the existing leaderboard infrastructure. # MODELS TO TEST: 'openai-community/gpt2' 'EleutherAI/gpt-neo-1.3B' 'openai-community/gpt2-large'