Spaces:

a-ghorbani
/

ai-phone-leaderboard

Running

App Files Files Community

agh123 commited on Apr 3

Commit

94a1f00

1 Parent(s): 097ec8a

feat: add Glicko2 ranking

Browse files

Files changed (7) hide show

docs/ranking_system.md +134 -57
requirements.txt +3 -0
requirements/base.txt +2 -1
src/app.py +17 -3
src/components/device_comparison.py +186 -0
src/components/visualizations.py +144 -303
src/core/glicko2_ranking.py +618 -0

docs/ranking_system.md CHANGED Viewed

@@ -1,77 +1,154 @@
-# Device Ranking System
 ## Overview
-The ranking system implements a multi-dimensional approach to evaluate and compare device performance across different aspects of LLM (GGUF) model runs.
-## Scoring Algorithm
-### Standard Benchmark Conditions
-```python
-PP_CONFIG = 512  # Standard prompt processing token count
-TG_CONFIG = 128  # Standard token generation count
-# Component Weights
-TG_WEIGHT = 0.6  # Token generation weight (60%)
-PP_WEIGHT = 0.4  # Prompt processing weight (40%)
-```
-- PP given 40% weight as it's a one-time cost per prompt
-- TG given higher weight (60%) as it represents ongoing performance
-### Quantization Quality Factors
 ```python
-QUANT_TIERS = {
-    "F16": 1.0,
-    "F32": 1.0,
-    "Q8": 0.8,
-    "Q6": 0.6,
-    "Q5": 0.5,
-    "Q4": 0.4,
-    "Q3": 0.3,
-    "Q2": 0.2,
-    "Q1": 0.1,
-}
 ```
-- Linear scale from 0.1 to 1.0 based on quantization level
-- F16/F32 are considered 1.0 (this skews the results a bit towards quantization)
-### Performance Score Formula
-The final performance score is calculated as follows:
-1. **Base Performance**:
-   ```
-   base_score = (TG_speed * TG_WEIGHT + PP_speed * PP_WEIGHT)
    ```
-2. **Size and Quantization Adjustment**:
    ```
-   # Direct multiplication by model size (in billions)
-   performance_score = base_score * model_size * quant_factor
-   ```
-   - Linear multiplier by model size
-3. **Normalization**:
-   ```
-   normalized_score = (performance_score / max_performance_score) * 100
-   ```
-### Filtering
-- Only benchmarks matching standard conditions are considered:
-  - PP_CONFIG (512) tokens for prompt processing
-  - TG_CONFIG (128) tokens for token generation
-## Data Aggregation Strategy
-### Primary Grouping
-- Groups data by `Normalized Device ID` and `Platform`
-- Uses normalized device IDs to ensure consistent device identification across different submissions
-```python
-def normalize_device_id(device_info: dict) -> str:
-    if device_info["systemName"].lower() == "ios":
-        return f"iOS/{device_info['model']}"
-    memory_tier = f"{device_info['totalMemory'] // (1024**3)}GB"
-    return f"{device_info['brand']}/{device_info['model']}/{memory_tier}"
-```

+# Glicko-2 Ranking System Implementation
 ## Overview
+The Glicko-2 ranking system is used in this project to rank devices based on their performance in benchmark tests, specifically measuring token generation speed (tokens/second) and prompt processing speed (tokens/second). This document explains both the theoretical foundations of Glicko-2 and its specific implementation in our system.
+## Glicko-2 Theory
+Glicko-2 is an improvement over the original Glicko system, which itself was an improvement over the Elo rating system. It was developed by Mark Glicko and is particularly well-suited for situations where:
+1. Devices have different numbers of benchmark runs
+2. There's uncertainty about a device's true performance capabilities
+3. Performance metrics need to be compared across different model sizes and configurations
+### Key Components
+1. **Rating (μ)**: A numerical value representing a device's relative performance level (higher is better)
+2. **Rating Deviation (RD)**: The uncertainty in the performance rating
+3. **Volatility (σ)**: A measure of how consistent a device's performance is across different benchmarks
+### Rating System Parameters
+- **Initial Rating**: 1500 (standard starting point on the Glicko-2 scale)
+- **Initial RD**: 350 (high uncertainty for new devices)
+- **Volatility**: 0.06 (controls how quickly performance ratings can change)
+- **Tau**: 0.5 (system constant that limits the change in volatility)
+Note: The rating numbers themselves are on a relative scale and don't directly correspond to tokens/second. Instead, they represent relative performance levels where higher numbers indicate better performance. The actual token generation and prompt processing speeds (in tokens/second) are used to determine the relative performance outcomes that update these ratings.
+## Implementation Details
+### Data Preparation
+Before applying Glicko-2, we preprocess the benchmark data:
+1. Filter out emulators and iOS devices with insufficient GPU layers, so that we are consistent among iOS devices.
+2. Normalize scores within each model group to account for different model difficulties
+3. Convert continuous performance metrics into relative comparisons:
+   - For each pair of devices running the same model, we compare their token generation and prompt processing speeds
+   - If a device is faster in both metrics, it "wins" the comparison (outcome = 1)
+   - If a device is slower in both metrics, it "loses" the comparison (outcome = 0)
+   - If one device is faster in one metric but slower in the other, it's considered a "draw" (outcome = 0.5)
+   - This conversion is necessary because Glicko-2 works with discrete outcomes (win/loss/draw) rather than continuous performance values
+For example, if:
+- Device A: Token Generation = 50 tokens/sec, Prompt Processing = 30 tokens/sec
+- Device B: Token Generation = 45 tokens/sec, Prompt Processing = 25 tokens/sec
+Then Device A "wins" this comparison because it's faster in both metrics. This relative outcome (1 for Device A, 0 for Device B) is what's used to update the Glicko-2 ratings.
+### Match Processing
+For each model, we compare devices pairwise based on their token generation and prompt processing speeds:
 ```python
+# Example of match processing
+for model, group in df.groupby("Model ID"):
+    devices = group["Normalized Device ID"].unique()
+    for i in range(len(devices)):
+        for j in range(i + 1, len(devices)):
+            device1 = devices[i]
+            device2 = devices[j]
+            # Compare performance metrics
+            token_speed1 = group[group["Normalized Device ID"] == device1]["Token Generation"].iloc[0]
+            token_speed2 = group[group["Normalized Device ID"] == device2]["Token Generation"].iloc[0]
+            prompt_speed1 = group[group["Normalized Device ID"] == device1]["Prompt Processing"].iloc[0]
+            prompt_speed2 = group[group["Normalized Device ID"] == device2]["Prompt Processing"].iloc[0]
+            # Determine performance outcome
+            if token_speed1 > token_speed2 and prompt_speed1 > prompt_speed2:
+                outcome = 1  # device1 performs better
+            elif token_speed1 < token_speed2 and prompt_speed1 < prompt_speed2:
+                outcome = 0  # device2 performs better
+            else:
+                outcome = 0.5  # mixed performance
 ```
+### Rating Updates
+The Glicko-2 system updates performance ratings after each benchmark comparison:
+1. **Calculate Expected Performance**:
+   ```python
+   def expected_performance(rating1, rating2, rd1, rd2):
+       q = math.log(10) / 400
+       g_rd = 1 / math.sqrt(1 + 3 * q**2 * (rd2**2) / math.pi**2)
+       return 1 / (1 + 10**(-g_rd * (rating1 - rating2) / 400))
    ```
+2. **Update Performance Rating and RD**:
+   ```python
+   def update_performance(rating, rd, outcome, expected):
+       q = math.log(10) / 400
+       d_squared = 1 / (q**2 * g_rd**2 * expected * (1 - expected))
+       new_rd = math.sqrt(1 / (1 / rd**2 + 1 / d_squared))
+       new_rating = rating + q / (1 / rd**2 + 1 / d_squared) * g_rd * (outcome - expected)
+       return new_rating, new_rd
    ```
+### Confidence Thresholds
+We implement several confidence thresholds:
+1. **Minimum Benchmarks**: Devices must have at least 5 benchmark runs to be included in confident rankings
+2. **Performance Deviation**: Devices with RD > 100 tokens/second are considered less reliable
+3. **Performance Consistency**: High volatility indicates inconsistent performance across benchmarks
+## Practical Considerations
+### Handling Sparse Data
+The system is designed to handle sparse benchmark data by:
+1. Using conservative initial performance ratings for new devices
+2. Increasing RD for devices with few benchmark runs
+3. Implementing a minimum benchmark threshold
+### Performance Metrics
+We track several performance metrics:
+- Combined performance rating (overall tokens/second)
+- Token generation rating (tokens/second)
+- Prompt processing rating (tokens/second)
+- Performance deviation (uncertainty in tokens/second)
+- Number of benchmark runs
+- Performance comparison statistics
+### Visualization
+The system provides:
+1. Overall performance rankings with confidence intervals
+2. Platform-specific performance statistics
+3. Head-to-head performance comparison tools
+4. Performance trend analysis across different model sizes
+## Advantages Over Other Systems
+1. **Better Handling of Performance Uncertainty**: Explicit modeling of performance measurement uncertainty
+2. **More Accurate with Fewer Benchmarks**: Can provide meaningful performance ratings with limited data
+3. **Dynamic Performance Updates**: Volatility parameter allows for appropriate rating changes
+4. **Transparent Confidence**: Performance deviations provide clear confidence measures
+## Limitations
+1. **Computational Complexity**: More complex than Elo, requiring more calculations
+2. **Parameter Sensitivity**: Results can be sensitive to system parameters
+3. **Continuous Metrics**: Requires conversion of continuous performance metrics (tokens/second) to relative comparisons
+## References
+1. Glicko, M. (2001). "The Glicko-2 Rating System"
+2. Glickman, M. E. (1999). "Parameter estimation in large dynamic paired comparison experiments"
+3. Glickman, M. E. (2001). "Dynamic paired comparison models with stochastic variances"

requirements.txt CHANGED Viewed

@@ -7,3 +7,6 @@ httpx>=0.25.1
 pydantic-settings>=2.0.3
 firebase-admin==6.6.0
 statsmodels>=0.14.1

 pydantic-settings>=2.0.3
 firebase-admin==6.6.0
 statsmodels>=0.14.1
+matplotlib>=3.7.0
+arviz>=0.17.0
+glicko2

requirements/base.txt CHANGED Viewed

@@ -5,4 +5,5 @@ pandas>=2.1.3
 plotly>=5.18.0
 httpx>=0.25.1
 pydantic-settings>=2.0.3
-firebase-admin==6.6.0

 plotly>=5.18.0
 httpx>=0.25.1
 pydantic-settings>=2.0.3
+firebase-admin==6.6.0
+glicko2

src/app.py CHANGED Viewed

@@ -10,6 +10,8 @@ from .components.visualizations import (
     render_device_rankings,
 )
 from .components.header import render_header, render_contribution_guide
 from .services.firebase import fetch_leaderboard_data
 from .core.styles import CUSTOM_CSS
 from .core.scoring import (
@@ -128,7 +130,13 @@ async def main():
     with main_col:
         # Create tabs for different views
-        tab1, tab2 = st.tabs(["Device Rankings", "Benchmark Results"])
         with tab1:
             # Device rankings view
@@ -139,11 +147,11 @@ async def main():
             st.info(
                 f"📊 Rankings are based on benchmarks with standard conditions: "
                 f"PP={std.PP_CONFIG} tokens, TG={std.TG_CONFIG} tokens. "
-                f"Scores factor in model size and quantization."
             )
             # Render performance metrics
-            render_performance_metrics(metrics)
             # Render device rankings
             render_device_rankings(df)
@@ -172,6 +180,12 @@ async def main():
             # Render performance plots with table filters
             render_performance_plots(df, table_filters)
     with guide_col:
         render_contribution_guide()

     render_device_rankings,
 )
 from .components.header import render_header, render_contribution_guide
+from .components.rankings import render_algorithm_rankings
+from .components.device_comparison import render_device_comparison
 from .services.firebase import fetch_leaderboard_data
 from .core.styles import CUSTOM_CSS
 from .core.scoring import (
     with main_col:
         # Create tabs for different views
+        tab1, tab2, tab3 = st.tabs(
+            [
+                "Device Rankings",
+                "Benchmark Results",
+                "⚔️ Device Duel",
+            ]
+        )
         with tab1:
             # Device rankings view
             st.info(
                 f"📊 Rankings are based on benchmarks with standard conditions: "
                 f"PP={std.PP_CONFIG} tokens, TG={std.TG_CONFIG} tokens. "
+                f"The rankings are based on the Glicko-2 algorithm."
             )
             # Render performance metrics
+            # render_performance_metrics(metrics)
             # Render device rankings
             render_device_rankings(df)
             # Render performance plots with table filters
             render_performance_plots(df, table_filters)
+        with tab3:
+            # Device comparison view
+            # Get list of normalized device IDs for the device comparison
+            normalized_device_ids = sorted(df["Normalized Device ID"].unique().tolist())
+            render_device_comparison(df, normalized_device_ids)
     with guide_col:
         render_contribution_guide()

src/components/device_comparison.py ADDED Viewed

	@@ -0,0 +1,186 @@

+import streamlit as st
+import pandas as pd
+from typing import List, Optional
+from ..core.elo_ranking import analyze_device_matches
+from ..core.trueskill_ranking import analyze_device_trueskill_matches
+from ..core.glicko2_ranking import analyze_device_glicko2_matches
+from ..components.visualizations import clean_device_id
+def render_device_comparison(df: pd.DataFrame, normalized_device_ids: List[str]):
+    """
+    Render a component for comparing two devices and analyzing their matches.
+    Args:
+        df: DataFrame containing benchmark data
+        normalized_device_ids: List of normalized device IDs to select from
+    """
+    st.title("⚔️ Device Duel Arena")
+    # Create mapping of normalized IDs to display names
+    device_display_names = {
+        device_id: clean_device_id(device_id) for device_id in normalized_device_ids
+    }
+    # Create two columns for device selection
+    col1, col2 = st.columns(2)
+    with col1:
+        device1 = st.selectbox(
+            "Select First Device",
+            options=normalized_device_ids,
+            format_func=lambda x: device_display_names[x],
+            key="device_compare_1",
+        )
+    with col2:
+        # Filter second device dropdown to exclude the first selected device
+        remaining_devices = [d for d in normalized_device_ids if d != device1]
+        device2 = st.selectbox(
+            "Select Second Device",
+            options=remaining_devices,
+            format_func=lambda x: device_display_names[x],
+            key="device_compare_2",
+        )
+    # Button to analyze matches
+    if st.button("Start Duel", key="analyze_matches_btn"):
+        st.markdown("### Match Analysis Results")
+        # Ensure we have both devices
+        if device1 and device2:
+            with st.spinner(
+                f"Analyzing matches between {device_display_names[device1]} and {device_display_names[device2]}..."
+            ):
+                try:
+                    # Analyze matches using Glicko-2
+                    matches_df = analyze_device_glicko2_matches(df, device1, device2)
+                    if not matches_df.empty:
+                        # Show summary statistics
+                        total_matches = len(matches_df)
+                        # Set up metrics
+                        col1, col2, col3 = st.columns(3)
+                        with col1:
+                            st.metric("Total Matches", total_matches)
+                        # Check for required columns before calculating metrics
+                        if (
+                            "Token Winner" in matches_df.columns
+                            and "Prompt Winner" in matches_df.columns
+                        ):
+                            token_wins_1 = sum(matches_df["Token Winner"] == device1)
+                            prompt_wins_1 = sum(matches_df["Prompt Winner"] == device1)
+                            with col2:
+                                st.metric(
+                                    f"{device_display_names[device1]}'s Token Wins",
+                                    f"{token_wins_1} ({token_wins_1/total_matches*100:.1f}%)",
+                                )
+                                with col3:
+                                    st.metric(
+                                        f"{device_display_names[device1]}'s Prompt Wins",
+                                        f"{prompt_wins_1} ({prompt_wins_1/total_matches*100:.1f}%)",
+                                    )
+                            # Add Combined Winner metric if available
+                            if "Combined Winner" in matches_df.columns:
+                                combined_wins_1 = sum(
+                                    matches_df["Combined Winner"] == device1
+                                )
+                                st.metric(
+                                    f"{device_display_names[device1]}'s Combined Wins",
+                                    f"{combined_wins_1} ({combined_wins_1/total_matches*100:.1f}%)",
+                                )
+                        else:
+                            st.warning(
+                                "Winner information is missing from the match data."
+                            )
+                        # Show the detailed match table
+                        st.markdown("#### Detailed Match Results")
+                        # Define display columns for Glicko-2
+                        display_cols = [
+                            "Model",
+                            "Token Generation 1",
+                            "Token Generation 2",
+                            "Token Winner",
+                            "Token Win Prob",
+                            "Prompt Processing 1",
+                            "Prompt Processing 2",
+                            "Prompt Winner",
+                            "Prompt Win Prob",
+                            "Combined Winner",
+                            "Combined Win Prob",
+                            "Platform 1",
+                            "Platform 2",
+                        ]
+                        # Ensure all columns exist in the dataframe
+                        valid_cols = [
+                            col for col in display_cols if col in matches_df.columns
+                        ]
+                        if valid_cols:
+                            # Rename some columns for better display
+                            matches_display = matches_df[valid_cols].copy()
+                            # Define a rename mapping but only apply for columns that exist
+                            rename_mapping = {
+                                "Token Generation 1": f"{device_display_names[device1]} Token Gen",
+                                "Token Generation 2": f"{device_display_names[device2]} Token Gen",
+                                "Prompt Processing 1": f"{device_display_names[device1]} Prompt Proc",
+                                "Prompt Processing 2": f"{device_display_names[device2]} Prompt Proc",
+                                "Platform 1": f"{device_display_names[device1]} Platform",
+                                "Platform 2": f"{device_display_names[device2]} Platform",
+                                "Token Win Prob": "Device 1 Token Win Prob",
+                                "Prompt Win Prob": "Device 1 Prompt Win Prob",
+                                "Combined Win Prob": "Device 1 Combined Win Prob",
+                            }
+                            # Only rename columns that exist in the dataframe
+                            rename_filtered = {
+                                k: v
+                                for k, v in rename_mapping.items()
+                                if k in matches_display.columns
+                            }
+                            matches_display = matches_display.rename(
+                                columns=rename_filtered
+                            )
+                            # Round any numeric columns for better display
+                            for col in matches_display.columns:
+                                if matches_display[col].dtype in ["float64", "float32"]:
+                                    matches_display[col] = matches_display[col].round(2)
+                            st.dataframe(
+                                matches_display,
+                                use_container_width=True,
+                                height=400,
+                            )
+                        else:
+                            st.warning(
+                                "No valid columns found for display in the match data."
+                            )
+                        # Platform breakdown if available
+                        if "Platform 2" in matches_df.columns:
+                            st.markdown("#### Platform Distribution")
+                            platform_counts = matches_df["Platform 2"].value_counts()
+                            st.bar_chart(platform_counts)
+                    else:
+                        st.warning(
+                            f"No matches found between {device_display_names[device1]} and {device_display_names[device2]}."
+                        )
+                        st.info(
+                            "Try selecting different devices or checking if they both have benchmark data for the same models."
+                        )
+                except Exception as e:
+                    st.error(f"An error occurred during match analysis: {str(e)}")
+                    st.info("Please try with different devices.")
+        else:
+            st.error("Please select two different devices to compare.")

src/components/visualizations.py CHANGED Viewed

@@ -8,6 +8,7 @@ import pandas as pd
 from typing import Optional, Dict, List, Set
 import plotly.graph_objects as go
 from ..core.scoring import get_quantization_tier
 def clean_device_id(device_id: str) -> str:
@@ -576,318 +577,158 @@ def render_leaderboard_table(df: pd.DataFrame, filters: Dict):
 def render_device_rankings(df: pd.DataFrame):
-    """Render device rankings with detailed performance metrics."""
     if df.empty:
         st.warning("No data available for device rankings.")
         return
-    # Create device summary
-    device_summary = (
-        df.groupby(["Normalized Device ID", "Platform"])
-        .agg(
-            {
-                "performance_score": "max",  # Best score achieved
-                "Model Size": ["min", "max"],  # Size range
-                "tg_score": "max",  # Use normalized TG score
-                "pp_score": "max",  # Use normalized PP score
-                "Model ID": lambda x: ", ".join(sorted(set(x))),  # All models tested
-                "quant_factor": lambda x: sorted(set(x)),  # Quantization levels tested
-            }
-        )
-        .reset_index()
-    )
-    # Flatten column names
-    device_summary.columns = [
-        "Device ID",  # Normalized Device ID for grouping
-        "Platform",
-        "Best Score",
-        "Min Model Size",
-        "Max Model Size",
-        "TG Score",
-        "PP Score",
-        "Tested Models",
-        "Tested Quantizations",
-    ]
-    # Add clean device name
-    device_summary["Device"] = device_summary["Device ID"].apply(clean_device_id)
-    # Create three tabs for different ranking views
-    rank_tab1, rank_tab2, rank_tab3 = st.tabs(
-        ["Overall Rankings", "Rankings by Model Size", "Rankings by Quantization"]
-    )
-    with rank_tab1:
-        st.subheader("📱 Overall Device Rankings")
-        # Sort by best score
-        overall_rankings = device_summary.sort_values("Best Score", ascending=False)
-        # Add ranking column
-        overall_rankings = overall_rankings.reset_index(drop=True)
-        overall_rankings.index = overall_rankings.index + 1
-        overall_rankings = overall_rankings.rename_axis("Rank")
-        # Format the display columns
-        display_df = overall_rankings.copy()
-        display_df["Best Score"] = display_df["Best Score"].round(2)
-        display_df["TG Score"] = display_df["TG Score"].round(2)
-        display_df["PP Score"] = display_df["PP Score"].round(2)
-        display_df["Model Size Range"] = display_df.apply(
-            lambda x: f"{x['Min Model Size']:.1f}B - {x['Max Model Size']:.1f}B", axis=1
-        )
-        # Select and reorder columns for display
-        display_cols = [
-            "Device",  # Use clean device name for display
-            "Platform",
-            "Best Score",
-            "TG Score",
-            "PP Score",
-            "Model Size Range",
-        ]
-        st.dataframe(
-            display_df[display_cols],
-            use_container_width=True,
-            height=min(
-                800, (len(display_df) + 1) * 35 + 40
-            ),  # Dynamic height based on content
-            hide_index=False,
-            column_config={
-                "Rank": st.column_config.NumberColumn(
-                    "Rank",
-                    help="Device ranking based on performance score",
-                ),
-                "Device": st.column_config.TextColumn(
-                    "Device",
-                    help="Device brand and model",
-                ),
-                "Best Score": st.column_config.NumberColumn(
-                    "Score", help="Overall performance score (0-100)", format="%.2f"
-                ),
-                "TG Score": st.column_config.NumberColumn(
-                    "TG Score",
-                    help="Normalized Token Generation score (0-100)",
-                    format="%.2f",
-                ),
-                "PP Score": st.column_config.NumberColumn(
-                    "PP Score",
-                    help="Normalized Prompt Processing score (0-100)",
-                    format="%.2f",
-                ),
-            },
-        )
-    with rank_tab2:
-        st.subheader("📊 Rankings by Model Size")
-        # Define model size categories
-        def get_size_category(size):
-            if size < 1:
-                return "Tiny (<1B)"
-            elif size < 2:
-                return "Small (1-2B)"
-            elif size < 4:
-                return "Medium (2-4B)"
-            elif size < 8:
-                return "Large (4-8B)"
-            else:
-                return "Extra Large (>8B)"
-        # Create size-based rankings
-        size_rankings = df.copy()
-        size_rankings["Size Category"] = size_rankings["Model Size"].apply(
-            get_size_category
-        )
-        size_summary = (
-            size_rankings.groupby(["Normalized Device ID", "Platform", "Size Category"])
-            .agg(
-                {
-                    "performance_score": ["max", "mean"],
-                    "tg_score": "max",  # Use normalized scores
-                    "pp_score": "max",  # Use normalized scores
-                    "Model ID": lambda x: ", ".join(sorted(set(x))),
-                }
             )
-            .reset_index()
-        )
-        # Flatten and rename columns
-        size_summary.columns = [
-            "Device ID",
-            "Platform",
-            "Size Category",
-            "Best Score",
-            "Avg Score",
-            "TG Score",
-            "PP Score",
-            "Models",
-        ]
-        # Add clean device name
-        size_summary["Device"] = size_summary["Device ID"].apply(clean_device_id)
-        # Format and display each category
-        for size_cat in sorted(size_summary["Size Category"].unique()):
-            st.markdown(f"##### {size_cat}")
-            cat_data = size_summary[size_summary["Size Category"] == size_cat].copy()
-            cat_data = cat_data.sort_values("Best Score", ascending=False)
-            # Add ranking column
-            cat_data = cat_data.reset_index(drop=True)
-            cat_data.index = cat_data.index + 1
-            cat_data = cat_data.rename_axis("Rank")
-            # Format scores
-            cat_data["Best Score"] = cat_data["Best Score"].round(2)
-            cat_data["Avg Score"] = cat_data["Avg Score"].round(2)
-            cat_data["TG Score"] = cat_data["TG Score"].round(2)
-            cat_data["PP Score"] = cat_data["PP Score"].round(2)
-            display_cols = [
-                "Device",  # Use clean device name for display
-                "Platform",
-                "Best Score",
-                "Avg Score",
-                "TG Score",
-                "PP Score",
-            ]
-            st.dataframe(
-                cat_data[display_cols],
-                use_container_width=True,
-                height=min(
-                    300, (len(cat_data) + 1) * 35 + 40
-                ),  # Slightly smaller for category tables
-                hide_index=False,
-                column_config={
-                    "Rank": st.column_config.NumberColumn(
-                        "Rank",
-                        help="Device ranking within this size category",
-                    ),
-                    "Device": st.column_config.TextColumn(
-                        "Device",
-                        help="Device brand and model",
-                    ),
-                    "Best Score": st.column_config.NumberColumn(
-                        "Best Score",
-                        help="Best performance score achieved",
-                        format="%.2f",
-                    ),
-                    "Avg Score": st.column_config.NumberColumn(
-                        "Avg Score", help="Average performance score", format="%.2f"
-                    ),
-                    "TG Score": st.column_config.NumberColumn(
-                        "TG Score",
-                        help="Normalized Token Generation score (0-100)",
-                        format="%.2f",
-                    ),
-                    "PP Score": st.column_config.NumberColumn(
-                        "PP Score",
-                        help="Normalized Prompt Processing score (0-100)",
-                        format="%.2f",
-                    ),
-                },
             )
-    with rank_tab3:
-        st.subheader("🔍 Rankings by Quantization")
-        # Group by device and quantization level
-        quant_rankings = df.copy()
-        quant_summary = (
-            quant_rankings.groupby(["Normalized Device ID", "Platform", "quant_factor"])
-            .agg(
-                {
-                    "performance_score": ["max", "mean"],
-                    "tg_score": "max",
-                    "pp_score": "max",
-                    "Model ID": lambda x: ", ".join(sorted(set(x))),
                 }
-            )
-            .reset_index()
-        )
-        # Flatten and rename columns
-        quant_summary.columns = [
-            "Device ID",
-            "Platform",
-            "Quant Factor",
-            "Best Score",
-            "Avg Score",
-            "TG Score",
-            "PP Score",
-            "Models",
-        ]
-        # Add clean device name
-        quant_summary["Device"] = quant_summary["Device ID"].apply(clean_device_id)
-        # Format and display for each quantization tier
-        for quant_level in sorted(quant_summary["Quant Factor"].unique(), reverse=True):
-            quant_name = get_quant_name(quant_level)
-            st.markdown(f"##### Quantization Level: {quant_name}")
-            quant_data = quant_summary[
-                quant_summary["Quant Factor"] == quant_level
-            ].copy()
-            quant_data = quant_data.sort_values("Best Score", ascending=False)
-            # Add ranking column
-            quant_data = quant_data.reset_index(drop=True)
-            quant_data.index = quant_data.index + 1
-            quant_data = quant_data.rename_axis("Rank")
-            # Format scores
-            quant_data["Best Score"] = quant_data["Best Score"].round(2)
-            quant_data["Avg Score"] = quant_data["Avg Score"].round(2)
-            quant_data["TG Score"] = quant_data["TG Score"].round(2)
-            quant_data["PP Score"] = quant_data["PP Score"].round(2)
-            display_cols = [
-                "Device",
-                "Platform",
-                "Best Score",
-                "Avg Score",
-                "TG Score",
-                "PP Score",
-            ]
-            st.dataframe(
-                quant_data[display_cols],
-                use_container_width=True,
-                height=min(
-                    300, (len(quant_data) + 1) * 35 + 40
-                ),  # Slightly smaller for quantization tables
-                hide_index=False,
-                column_config={
-                    "Rank": st.column_config.NumberColumn(
-                        "Rank",
-                        help="Device ranking within this quantization level",
-                    ),
-                    "Device": st.column_config.TextColumn(
-                        "Device",
-                        help="Device brand and model",
-                    ),
-                    "Best Score": st.column_config.NumberColumn(
-                        "Best Score",
-                        help="Best performance score achieved",
-                        format="%.2f",
-                    ),
-                    "Avg Score": st.column_config.NumberColumn(
-                        "Avg Score", help="Average performance score", format="%.2f"
-                    ),
-                    "TG Score": st.column_config.NumberColumn(
-                        "TG Score",
-                        help="Normalized Token Generation score (0-100)",
-                        format="%.2f",
-                    ),
-                    "PP Score": st.column_config.NumberColumn(
-                        "PP Score",
-                        help="Normalized Prompt Processing score (0-100)",
-                        format="%.2f",
-                    ),
-                },
-            )

 from typing import Optional, Dict, List, Set
 import plotly.graph_objects as go
 from ..core.scoring import get_quantization_tier
+from ..core.glicko2_ranking import analyze_glicko2_rankings
 def clean_device_id(device_id: str) -> str:
 def render_device_rankings(df: pd.DataFrame):
+    """Render device rankings using Glicko-2 algorithm."""
     if df.empty:
         st.warning("No data available for device rankings.")
         return
+    # Calculate Glicko-2 rankings automatically
+    with st.spinner("Calculating Glicko-2 rankings..."):
+        try:
+            g2_all, g2_confident = analyze_glicko2_rankings(
+                df,
+                min_matches=5,  # Default minimum matches
+                min_gpu_layers=20,  # Default minimum GPU layers
             )
+            # Display performance overview
+            st.subheader("🏆 Performance Overview")
+            # Get top device from Glicko-2 rankings
+            top_device = g2_confident.index[0] if not g2_confident.empty else "N/A"
+            top_device_clean = (
+                clean_device_id(top_device) if top_device != "N/A" else "N/A"
             )
+            # Calculate total unique devices and models
+            total_devices = df["Normalized Device ID"].nunique()
+            total_models = df["Model ID"].nunique()
+            # Display metrics in columns
+            col1, col2, col3 = st.columns([3, 1, 1])
+            with col1:
+                st.metric("Top Device", top_device_clean)
+            with col2:
+                st.metric("Total Devices", total_devices)
+            with col3:
+                st.metric("Total Models", total_models)
+            st.markdown("---")
+            # Display confident rankings
+            if not g2_confident.empty:
+                st.subheader("📱 Device Rankings")
+                # Create a copy and handle the index
+                g2_confident_display = g2_confident.copy()
+                # Get the device ID column name
+                device_id_col = g2_confident_display.index.name or "device"
+                g2_confident_display = g2_confident_display.reset_index()
+                # Get platform information from the original dataframe
+                platform_map = (
+                    df.groupby("Normalized Device ID")["Platform"].first().to_dict()
+                )
+                g2_confident_display["Platform"] = g2_confident_display[
+                    device_id_col
+                ].map(platform_map)
+                # Get model size range from the original dataframe
+                model_sizes = df.groupby("Normalized Device ID")["Model Size"].agg(
+                    ["min", "max"]
+                )
+                g2_confident_display["Model Size Range"] = g2_confident_display[
+                    device_id_col
+                ].apply(
+                    lambda x: f"{model_sizes.loc[x, 'min']:.1f}B - {model_sizes.loc[x, 'max']:.1f}B"
+                )
+                # Add clean device name
+                g2_confident_display["Device"] = g2_confident_display[
+                    device_id_col
+                ].apply(clean_device_id)
+                # Round numeric columns to whole numbers
+                numeric_cols = [
+                    "combined_rating",
+                    "combined_rd",
+                    "token_rating",
+                    "prompt_rating",
+                ]
+                for col in numeric_cols:
+                    if col in g2_confident_display.columns:
+                        g2_confident_display[col] = (
+                            g2_confident_display[col].round(0).astype(int)
+                        )
+                # Select and order columns for display
+                display_cols = [
+                    "Device",
+                    "Platform",
+                    "combined_rating",
+                    "combined_rd",
+                    "token_rating",
+                    "prompt_rating",
+                    "Model Size Range",
+                ]
+                # Rename columns for better display
+                rename_map = {
+                    "combined_rating": "Rating",
+                    "combined_rd": "Rating Deviation",
+                    "token_rating": "Token Rating",
+                    "prompt_rating": "Prompt Rating",
                 }
+                g2_confident_display = g2_confident_display.rename(columns=rename_map)
+                # Sort by Rating
+                g2_confident_display = g2_confident_display.sort_values(
+                    "Rating", ascending=False
+                )
+                # Add rank column
+                g2_confident_display = g2_confident_display.reset_index(drop=True)
+                g2_confident_display.index = g2_confident_display.index + 1
+                g2_confident_display = g2_confident_display.rename_axis("Rank")
+                # Display the table
+                st.dataframe(
+                    g2_confident_display[
+                        [
+                            "Device",
+                            "Platform",
+                            "Rating",
+                            "Rating Deviation",
+                            "Token Rating",
+                            "Prompt Rating",
+                            "Model Size Range",
+                        ]
+                    ],
+                    use_container_width=True,
+                    height=min(600, (len(g2_confident_display) + 1) * 35 + 40),
+                    hide_index=False,
+                )
+                # Platform statistics
+                st.markdown("#### Platform Statistics")
+                platform_stats = (
+                    g2_confident_display.groupby("Platform")
+                    .agg(
+                        {
+                            "Rating": ["mean", "std"],
+                        }
+                    )
+                    .round(0)
+                    .astype(int)
+                )
+                st.dataframe(platform_stats, use_container_width=True)
+            else:
+                st.warning(
+                    "No confident rankings available. Try adjusting the minimum matches threshold."
+                )
+        except Exception as e:
+            st.error(f"Error calculating Glicko-2 rankings: {str(e)}")

src/core/glicko2_ranking.py ADDED Viewed

	@@ -0,0 +1,618 @@

+"""
+Glicko-2 Ranking System for Device Performance Comparison
+This module implements a Glicko-2 based ranking system for comparing device performance
+in benchmark tests. Glicko-2 is an improvement over the original Glicko system and Elo,
+providing better handling of rating uncertainty and volatility.
+The system:
+1. Filters out emulators and iOS devices with insufficient GPU layers
+2. Normalizes scores within each model group
+3. Computes Glicko-2 ratings for devices based on their performance
+4. Provides uncertainty metrics alongside ratings
+5. Supports both combined and separate analysis of Token Generation and Prompt Processing
+"""
+import numpy as np
+import pandas as pd
+from collections import defaultdict
+from typing import Tuple, Dict, List, Optional
+import glicko2
+def preprocess_benchmark_data(
+    df: pd.DataFrame,
+    min_gpu_layers: int = 20,
+    pp_config: int = 512,
+    tg_config: int = 128,
+) -> pd.DataFrame:
+    """
+    Preprocess benchmark data by filtering out invalid entries.
+    Args:
+        df: DataFrame containing benchmark data
+        min_gpu_layers: Minimum number of GPU layers required for iOS devices
+        pp_config: Prompt Processing configuration to filter for
+        tg_config: Token Generation configuration to filter for
+    Returns:
+        Filtered DataFrame containing only valid benchmark entries
+    """
+    # Create a mask for devices to keep
+    keep_device = (
+        # Keep non-iOS devices
+        (
+            (df["Platform"] != "iOS")
+            |
+            # Keep iOS devices with sufficient GPU layers
+            ((df["Platform"] == "iOS") & (df["n_gpu_layers"] >= min_gpu_layers))
+        )
+        &
+        # Remove emulators
+        (~df["Normalized Device ID"].str.contains("Emulator", case=False, na=False))
+        &
+        # Filter by configuration
+        (df["PP Config"] == pp_config)
+        & (df["TG Config"] == tg_config)
+    )
+    filtered_df = df[keep_device].copy()
+    # Print filtering statistics
+    total_devices = df["Normalized Device ID"].nunique()
+    filtered_devices = filtered_df["Normalized Device ID"].nunique()
+    emulator_devices = df[
+        df["Normalized Device ID"].str.contains("Emulator", case=False, na=False)
+    ]["Normalized Device ID"].nunique()
+    print("Filtering Statistics:")
+    print(f"Original devices: {total_devices}")
+    print(f"Emulator devices removed: {emulator_devices}")
+    print(
+        f"iOS devices with insufficient GPU layers removed: "
+        f"{total_devices - filtered_devices - emulator_devices}"
+    )
+    print(f"Final device count: {filtered_devices}")
+    # Print removed devices for verification
+    print(
+        f"Removed {set(df['Normalized Device ID'].unique()) - set(filtered_df['Normalized Device ID'].unique())} "
+    )
+    return filtered_df
+def compute_glicko2_rankings(
+    df: pd.DataFrame, token_weight: float = 0.6
+) -> pd.DataFrame:
+    """
+    Compute device rankings using Glicko-2 rating system.
+    Args:
+        df: DataFrame containing benchmark data
+        token_weight: Weight for Token Generation in combined score (0.0 to 1.0)
+    Returns:
+        DataFrame containing device rankings and statistics
+    """
+    # Initialize Glicko-2 ratings for all devices
+    ratings = {}
+    match_counts = defaultdict(int)
+    win_counts = defaultdict(int)
+    loss_counts = defaultdict(int)
+    # Default Glicko-2 settings
+    # Rating = 1500, RD (rating deviation) = 350, Volatility = 0.06
+    def create_glicko2_rating():
+        return glicko2.Player(rating=1500, rd=350, vol=0.06)
+    def normalize_scores(group: pd.DataFrame) -> pd.Series:
+        """Normalize and combine scores within a model group"""
+        # Normalize Token Generation (higher is better)
+        token_min = group["Token Generation"].min()
+        token_max = group["Token Generation"].max()
+        token_norm = (
+            (group["Token Generation"] - token_min) / (token_max - token_min)
+            if token_max > token_min
+            else 0
+        )
+        # Normalize Prompt Processing (higher is better)
+        prompt_min = group["Prompt Processing"].min()
+        prompt_max = group["Prompt Processing"].max()
+        prompt_norm = (
+            (group["Prompt Processing"] - prompt_min) / (prompt_max - prompt_min)
+            if prompt_max > prompt_min
+            else 0
+        )
+        # Combine scores
+        return token_weight * token_norm + (1 - token_weight) * prompt_norm
+    # Get all unique devices
+    all_devices = df["Normalized Device ID"].unique()
+    # Initialize ratings for all devices
+    for device in all_devices:
+        ratings[device] = create_glicko2_rating()
+    # Process each model separately
+    for model, group in df.groupby("Model ID"):
+        # Add normalized combined score
+        group.loc[:, "combined_score"] = normalize_scores(group)
+        devices = group["Normalized Device ID"].unique()
+        # In Glicko-2, we need to collect all results for a rating period before updating
+        # A rating period could be all matches for a specific model
+        device_matches = defaultdict(
+            lambda: {"opponent_ratings": [], "opponent_rds": [], "outcomes": []}
+        )
+        for i in range(len(devices)):
+            for j in range(i + 1, len(devices)):
+                device1 = devices[i]
+                device2 = devices[j]
+                score1 = group[group["Normalized Device ID"] == device1][
+                    "combined_score"
+                ].iloc[0]
+                score2 = group[group["Normalized Device ID"] == device2][
+                    "combined_score"
+                ].iloc[0]
+                # Update match counts
+                match_counts[device1] += 1
+                match_counts[device2] += 1
+                # Determine outcome (0 = loss, 1 = win, 0.5 = draw)
+                if score1 > score2:
+                    # Device 1 wins
+                    outcome = 1
+                    win_counts[device1] += 1
+                    loss_counts[device2] += 1
+                    # For device 1
+                    device_matches[device1]["opponent_ratings"].append(
+                        ratings[device2].rating
+                    )
+                    device_matches[device1]["opponent_rds"].append(ratings[device2].rd)
+                    device_matches[device1]["outcomes"].append(outcome)
+                    # For device 2
+                    device_matches[device2]["opponent_ratings"].append(
+                        ratings[device1].rating
+                    )
+                    device_matches[device2]["opponent_rds"].append(ratings[device1].rd)
+                    device_matches[device2]["outcomes"].append(0)  # Loss
+                elif score1 < score2:
+                    # Device 2 wins
+                    outcome = 0
+                    win_counts[device2] += 1
+                    loss_counts[device1] += 1
+                    # For device 1
+                    device_matches[device1]["opponent_ratings"].append(
+                        ratings[device2].rating
+                    )
+                    device_matches[device1]["opponent_rds"].append(ratings[device2].rd)
+                    device_matches[device1]["outcomes"].append(outcome)
+                    # For device 2
+                    device_matches[device2]["opponent_ratings"].append(
+                        ratings[device1].rating
+                    )
+                    device_matches[device2]["opponent_rds"].append(ratings[device1].rd)
+                    device_matches[device2]["outcomes"].append(1)  # Win
+                else:
+                    # It's a draw
+                    outcome = 0.5
+                    # For device 1
+                    device_matches[device1]["opponent_ratings"].append(
+                        ratings[device2].rating
+                    )
+                    device_matches[device1]["opponent_rds"].append(ratings[device2].rd)
+                    device_matches[device1]["outcomes"].append(outcome)
+                    # For device 2
+                    device_matches[device2]["opponent_ratings"].append(
+                        ratings[device1].rating
+                    )
+                    device_matches[device2]["opponent_rds"].append(ratings[device1].rd)
+                    device_matches[device2]["outcomes"].append(outcome)
+        # Update ratings after the model rating period
+        for device, matches in device_matches.items():
+            if matches[
+                "opponent_ratings"
+            ]:  # Only update if the device had matches in this period
+                # Update the rating with the three separate lists that the API requires
+                ratings[device].update_player(
+                    matches["opponent_ratings"],  # List of opponent ratings
+                    matches["opponent_rds"],  # List of opponent rating deviations
+                    matches["outcomes"],  # List of outcomes
+                )
+    # Convert to DataFrame
+    ranking_data = []
+    for device, rating in ratings.items():
+        if match_counts[device] > 0:  # Only include devices with matches
+            ranking_data.append(
+                {
+                    "device": device,
+                    "rating": rating.rating,
+                    "rd": rating.rd,  # rating deviation (uncertainty)
+                    "volatility": rating.vol,
+                    "matches": match_counts[device],
+                    "wins": win_counts[device],
+                    "losses": loss_counts[device],
+                    # Conservative rating (95% confidence lower bound)
+                    "conserv_rating": rating.rating - (2 * rating.rd),
+                }
+            )
+    # Create DataFrame
+    ranking_df = pd.DataFrame(ranking_data)
+    if len(ranking_df) > 0:
+        # Add win rate
+        ranking_df["win_rate"] = ranking_df["wins"] / ranking_df["matches"]
+        # Add platform information
+        ranking_df["Platform"] = pd.Series(
+            {
+                row["device"]: df[df["Normalized Device ID"] == row["device"]][
+                    "Platform"
+                ].iloc[0]
+                for _, row in ranking_df.iterrows()
+            }
+        )
+        # Set device as index
+        ranking_df = ranking_df.set_index("device")
+    return ranking_df
+def analyze_glicko2_rankings(
+    df: pd.DataFrame,
+    min_matches: int = 5,
+    min_gpu_layers: int = 20,
+    pp_config: int = 512,
+    tg_config: int = 128,
+) -> Tuple[pd.DataFrame, pd.DataFrame]:
+    """
+    Analyze and display ranking results with Glicko-2 ratings.
+    Args:
+        df: DataFrame containing benchmark data
+        min_matches: Minimum number of matches required for confident rankings
+        min_gpu_layers: Minimum number of GPU layers required for iOS devices
+        pp_config: Prompt Processing configuration to filter for
+        tg_config: Token Generation configuration to filter for
+    Returns:
+        Tuple of (all rankings DataFrame, confident rankings DataFrame)
+    """
+    # First filter the data
+    filtered_df = preprocess_benchmark_data(df, min_gpu_layers, pp_config, tg_config)
+    print(
+        f'Filtered number of devices: {filtered_df["Normalized Device ID"].nunique()}'
+    )
+    print(f"Filtered number of rows: {filtered_df.shape}")
+    print(f"Original number of rows: {df.shape}")
+    # Compute rankings for all three scenarios
+    combined_rankings = compute_glicko2_rankings(filtered_df, token_weight=0.6)
+    token_rankings = compute_glicko2_rankings(filtered_df, token_weight=1.0)
+    prompt_rankings = compute_glicko2_rankings(filtered_df, token_weight=0.0)
+    # Rename columns to avoid confusion
+    combined_rankings = combined_rankings.rename(
+        columns={
+            "rating": "combined_rating",
+            "rd": "combined_rd",
+            "volatility": "combined_vol",
+            "conserv_rating": "combined_conserv",
+            "wins": "combined_wins",
+            "losses": "combined_losses",
+            "win_rate": "combined_win_rate",
+        }
+    )
+    token_rankings = token_rankings.rename(
+        columns={
+            "rating": "token_rating",
+            "rd": "token_rd",
+            "volatility": "token_vol",
+            "conserv_rating": "token_conserv",
+            "wins": "token_wins",
+            "losses": "token_losses",
+            "win_rate": "token_win_rate",
+        }
+    )
+    prompt_rankings = prompt_rankings.rename(
+        columns={
+            "rating": "prompt_rating",
+            "rd": "prompt_rd",
+            "volatility": "prompt_vol",
+            "conserv_rating": "prompt_conserv",
+            "wins": "prompt_wins",
+            "losses": "prompt_losses",
+            "win_rate": "prompt_win_rate",
+        }
+    )
+    # Combine all rankings into one DataFrame
+    # We'll keep one set of match counts as they should be the same
+    rankings = combined_rankings.copy()
+    # Add token generation rankings
+    for col in [
+        "token_rating",
+        "token_rd",
+        "token_vol",
+        "token_conserv",
+        "token_wins",
+        "token_losses",
+        "token_win_rate",
+    ]:
+        rankings[col] = token_rankings[col]
+    # Add prompt processing rankings
+    for col in [
+        "prompt_rating",
+        "prompt_rd",
+        "prompt_vol",
+        "prompt_conserv",
+        "prompt_wins",
+        "prompt_losses",
+        "prompt_win_rate",
+    ]:
+        rankings[col] = prompt_rankings[col]
+    # Filter for minimum matches
+    confident_rankings = rankings[rankings["matches"] >= min_matches].sort_values(
+        "combined_rating", ascending=False
+    )
+    # Print statistics
+    print("\nRanking Statistics:")
+    print(f"Total devices ranked: {len(rankings)}")
+    print(f"Devices with {min_matches}+ matches: {len(confident_rankings)}")
+    print("\nTop 10 Devices:")
+    columns_to_show = [
+        "combined_rating",
+        "combined_rd",
+        "token_rating",
+        "prompt_rating",
+        "matches",
+        "Platform",
+    ]
+    print(confident_rankings[columns_to_show].head(10))
+    print("\nPlatform Statistics:")
+    platform_stats = confident_rankings.groupby("Platform").agg(
+        {
+            "combined_rating": ["count", "mean", "std"],
+            "token_rating": ["mean", "std"],
+            "prompt_rating": ["mean", "std"],
+            "matches": "mean",
+            "combined_win_rate": "mean",
+        }
+    )
+    print(platform_stats)
+    # Calculate correlations between different ratings
+    correlations = confident_rankings[
+        ["combined_rating", "token_rating", "prompt_rating"]
+    ].corr()
+    print("\nRating Correlations:")
+    print(correlations)
+    return rankings, confident_rankings
+def analyze_device_glicko2_matches(
+    df: pd.DataFrame,
+    device_id1: str,
+    device_id2: Optional[str] = None,
+    token_weight: float = 0.6,
+) -> pd.DataFrame:
+    """
+    Analyze all matches for one or two specific devices using the Glicko-2 methodology.
+    Args:
+        df: DataFrame containing benchmark data
+        device_id1: First device ID to analyze
+        device_id2: Optional second device ID to compare against
+        token_weight: Weight for Token Generation in combined score (0.0 to 1.0)
+    Returns:
+        DataFrame containing detailed match information with win probabilities
+    """
+    matches = []
+    def normalize_scores(group: pd.DataFrame) -> Dict[str, Dict]:
+        """Normalize scores within a model group and return as dict"""
+        # Normalize Token Generation (higher is better)
+        token_min = group["Token Generation"].min()
+        token_max = group["Token Generation"].max()
+        token_range = token_max - token_min
+        # Normalize Prompt Processing (higher is better)
+        prompt_min = group["Prompt Processing"].min()
+        prompt_max = group["Prompt Processing"].max()
+        prompt_range = prompt_max - prompt_min
+        # Calculate normalized scores for each device
+        result = {}
+        for _, row in group.iterrows():
+            device_id = row["Normalized Device ID"]
+            if token_range > 0 and prompt_range > 0:
+                token_norm = (row["Token Generation"] - token_min) / token_range
+                prompt_norm = (row["Prompt Processing"] - prompt_min) / prompt_range
+                combined = token_weight * token_norm + (1 - token_weight) * prompt_norm
+                result[device_id] = {
+                    "token_norm": token_norm,
+                    "prompt_norm": prompt_norm,
+                    "combined": combined,
+                }
+        return result
+    # Group by Model ID to compare within same models
+    for model, group in df.groupby("Model ID"):
+        if device_id1 not in group["Normalized Device ID"].values:
+            continue
+        device1_data = group[group["Normalized Device ID"] == device_id1].iloc[0]
+        # If device2 specified, only compare those two
+        if device_id2 is not None:
+            if device_id2 not in group["Normalized Device ID"].values:
+                continue
+            devices_to_compare = [device_id2]
+        else:
+            devices_to_compare = [
+                d for d in group["Normalized Device ID"].unique() if d != device_id1
+            ]
+        # Get normalized scores
+        norm_scores = normalize_scores(group)
+        # Compare with other devices
+        for other_device in devices_to_compare:
+            device2_data = group[group["Normalized Device ID"] == other_device].iloc[0]
+            # Skip if normalization failed
+            if device_id1 not in norm_scores or other_device not in norm_scores:
+                continue
+            # Get normalized scores
+            scores1 = norm_scores[device_id1]
+            scores2 = norm_scores[other_device]
+            # Initialize Glicko-2 players for demonstration purposes
+            p1 = glicko2.Player()  # Default rating (1500, 350, 0.06)
+            p2 = glicko2.Player()
+            # Calculate win probability using Glicko-2 formulas
+            # We need to use the expect_score method, which takes a single player as input
+            token_prob = p1.expect_score(p2.rating, p2.rd)  # Properly use the method
+            prompt_prob = p1.expect_score(p2.rating, p2.rd)
+            combined_prob = p1.expect_score(p2.rating, p2.rd)
+            # Determine winners
+            token_winner = (
+                device_id1
+                if device1_data["Token Generation"] > device2_data["Token Generation"]
+                else (
+                    other_device
+                    if device2_data["Token Generation"]
+                    > device1_data["Token Generation"]
+                    else "Tie"
+                )
+            )
+            prompt_winner = (
+                device_id1
+                if device1_data["Prompt Processing"] > device2_data["Prompt Processing"]
+                else (
+                    other_device
+                    if device2_data["Prompt Processing"]
+                    > device1_data["Prompt Processing"]
+                    else "Tie"
+                )
+            )
+            combined_winner = (
+                device_id1
+                if scores1["combined"] > scores2["combined"]
+                else (
+                    other_device if scores2["combined"] > scores1["combined"] else "Tie"
+                )
+            )
+            matches.append(
+                {
+                    "Model": model,
+                    "Device 1": device_id1,
+                    "Device 2": other_device,
+                    "n_gpu_layers 1": device1_data["n_gpu_layers"],
+                    "n_gpu_layers 2": device2_data["n_gpu_layers"],
+                    "Token Generation 1": device1_data["Token Generation"],
+                    "Token Generation 2": device2_data["Token Generation"],
+                    "Token Winner": token_winner,
+                    "Token Win Prob": token_prob,
+                    "Prompt Processing 1": device1_data["Prompt Processing"],
+                    "Prompt Processing 2": device2_data["Prompt Processing"],
+                    "Prompt Winner": prompt_winner,
+                    "Prompt Win Prob": prompt_prob,
+                    "Combined Winner": combined_winner,
+                    "Combined Win Prob": combined_prob,
+                    "Platform 1": device1_data["Platform"],
+                    "Platform 2": device2_data["Platform"],
+                }
+            )
+    matches_df = pd.DataFrame(matches)
+    if len(matches_df) > 0:
+        # Add summary statistics
+        print(f"\nMatch Summary for {device_id1}:")
+        print(f"n_gpu_layers for Device 1: {matches_df['n_gpu_layers 1'].iloc[0]}")
+        if device_id2:
+            print(f"Total matches against {device_id2}: {len(matches_df)}")
+            print(f"n_gpu_layers for Device 2: {matches_df['n_gpu_layers 2'].iloc[0]}")
+        else:
+            print(f"Total matches: {len(matches_df)}")
+            print("\nOpponent n_gpu_layers distribution:")
+            print(matches_df["n_gpu_layers 2"].value_counts().sort_index())
+        token_wins = sum(matches_df["Token Winner"] == device_id1)
+        prompt_wins = sum(matches_df["Prompt Winner"] == device_id1)
+        combined_wins = sum(matches_df["Combined Winner"] == device_id1)
+        print(
+            f"\nToken Generation Wins: {token_wins} ({token_wins/len(matches_df)*100:.1f}%)"
+        )
+        print(
+            f"Prompt Processing Wins: {prompt_wins} ({prompt_wins/len(matches_df)*100:.1f}%)"
+        )
+        print(
+            f"Combined Wins: {combined_wins} ({combined_wins/len(matches_df)*100:.1f}%)"
+        )
+        # Platform breakdown
+        print("\nMatches by Platform:")
+        platform_counts = matches_df["Platform 2"].value_counts()
+        print(platform_counts)
+        # Show detailed matches
+        print("\nDetailed Matches:")
+        display_cols = [
+            "Model",
+            "Device 2",
+            "Platform 2",
+            "n_gpu_layers 1",
+            "n_gpu_layers 2",
+            "Token Generation 1",
+            "Token Generation 2",
+            "Token Winner",
+            "Prompt Processing 1",
+            "Prompt Processing 2",
+            "Prompt Winner",
+        ]
+        print(matches_df[display_cols])
+        return matches_df
+    else:
+        print(
+            f"No matches found for device {device_id1}"
+            + (f" against {device_id2}" if device_id2 else "")
+        )
+        return pd.DataFrame()
+if __name__ == "__main__":
+    # Example usage
+    print("This module provides Glicko-2 ranking for device performance.")
+    print("Import and use the functions in your own code.")
+    print("Example:")
+    print("  from glicko2_ranking import analyze_glicko2_rankings")
+    print("  rankings, confident_rankings = analyze_glicko2_rankings(df)")