Zen0
Add persistent leaderboard feature - solves GPU timeout issue
2338c46
metadata
title: AusCyberBench Evaluation Dashboard
emoji: πŸ›‘οΈ
colorFrom: green
colorTo: yellow
sdk: gradio
sdk_version: 5.49.1
app_file: app.py
pinned: false
license: mit

πŸ‡¦πŸ‡Ί AusCyberBench Evaluation Dashboard

Australia's First LLM Cybersecurity Benchmark

An interactive dashboard for evaluating language models on Australian cybersecurity knowledge, regulations, and threat intelligence.

πŸ†• What's New (October 2025)

  • 32 Tested Models - Focus on proven, stable models for reliable evaluation
  • βœ… Recommended Category - 7 models with verified performance (DeepSeek 55%+, TinyLlama 33%+)
  • Enhanced Visuals - Australian color-coded charts with gold/green ranking system
  • Better Stability - Removed experimental models causing compatibility issues
  • Improved UI - Quick selection presets for recommended, security, and size-based filtering
  • Memory Optimized - Better GPU management for HuggingFace Spaces

About AusCyberBench

AusCyberBench is a comprehensive benchmark dataset containing 13,449 tasks across six critical categories:

πŸ“‹ Categories

  • πŸ›‘οΈ Regulatory: Essential Eight (2,558 tasks)

    • ACSC's baseline cybersecurity mitigation strategies
    • Maturity levels 1-3 across 8 mitigation strategies
    • Application whitelisting, patching, MFA, backups, etc.
  • πŸ“œ Regulatory: ISM Controls (7,200 tasks)

    • Information Security Manual control requirements
    • Commonwealth entity security obligations
    • Control effectiveness, implementation, and compliance
  • πŸ”’ Regulatory: Privacy Act (204 tasks)

    • Australian Privacy Principles (APPs)
    • Data protection and privacy obligations
    • Notifiable Data Breaches (NDB) scheme
  • ⚑ Regulatory: SOCI Act (240 tasks)

    • Security of Critical Infrastructure Act 2018
    • Critical infrastructure risk management
    • Sector-specific obligations
  • 🎯 Knowledge: Threat Intelligence (2,520 tasks)

    • ACSC threat reports and advisories
    • Australian threat landscape
    • Cyber incident response
  • πŸ“š Knowledge: Terminology (727 tasks)

    • Australian cybersecurity terminology
    • ACSC glossary and definitions
    • Industry-specific language

Features

πŸ€– 32 Pre-Configured Models (Tested & Stable)

Evaluate across diverse model categories with proven, reliable models:

βœ… Recommended (Tested) - 7 models

Models with verified performance on AusCyberBench:

  • Phi-3 & Phi-3.5 - Microsoft's efficient models (proven stable)
  • Gemma-2-2b - Google's compact model (tested)
  • Qwen2.5 (3B, 7B) - Alibaba's reliable models (good performance)
  • DeepSeek LLM-7B - Previously achieved 55.6% accuracy ⭐
  • TinyLlama-1.1B - Previously achieved 33.3% accuracy

πŸ›‘οΈ Cybersecurity-Focused - 5 models

  • DeepSeek Coder - Code-focused with security awareness
  • WizardCoder-Python - Advanced code understanding
  • StarCoder2 - BigCode's latest model
  • CodeLlama - Meta's code specialist
  • CodeGen25 - Salesforce's code model

Small Models (1-4B) - 7 models

Phi-3 series, Gemma-2, Qwen2.5, Llama 3.2, StableLM, TinyLlama

Medium Models (7-12B) - 6 models

Mistral, Qwen2.5, Llama 3.1, Gemma-2-9b, Mistral-Nemo, Yi

Reasoning & Analysis - 4 models

DeepSeek LLM, SOLAR, Hermes-3, Qwen2.5-14B

Multilingual & Diverse - 3 models

Falcon, OpenChat, OpenHermes

⚑ Quick Selection Presets

  • βœ… Recommended (7) - Tested models with verified performance
  • πŸ›‘οΈ Security Focus (5) - Code and cybersecurity specialists
  • Small/Medium - Size-based selection (7/6 models)
  • Select All (32) - Comprehensive evaluation
  • Clear All - Reset selection

🎯 Customisable Evaluation

  • Sample size: 10-500 tasks (default: 10 for testing multiple models)
  • ⚠️ GPU Limits: Free tier has 60s timeout - test 1-2 models at a time for best results
  • 4-bit quantisation: Reduce memory usage for larger models
  • Temperature: Control response randomness (0.1-1.0)
  • Max tokens: Limit response length (32-256)

πŸ“Š Real-Time Results

  • Live leaderboard with rankings (πŸ₯‡πŸ₯ˆπŸ₯‰)
  • Model comparison visualisation in Australian colours
  • Per-category performance breakdown
  • Downloadable results (JSON format)

Usage

πŸ’Ύ Persistent Leaderboard Feature

NEW: Results now persist across sessions! This solves the GPU timeout issue:

  • Run models one at a time to avoid timeouts
  • Each run merges with previous results
  • Best score per model is automatically kept
  • Build a comprehensive leaderboard incrementally
  • Perfect for the 60-second free tier limit

Workflow:

  1. Select 1-2 models and run evaluation
  2. Results automatically save and merge with leaderboard
  3. Select different models and run again
  4. Leaderboard updates with all results
  5. Use "Clear All Results" button to start fresh

Standard Usage

  1. Select Models: Use checkboxes or quick selection buttons
  2. Configure Settings: Adjust sample size, quantisation, temperature
  3. Run Evaluation: Click "πŸš€ Run Evaluation"
  4. Monitor Progress: Watch real-time progress and intermediate results
  5. Analyse Results: Review persistent leaderboard, charts, and category breakdowns
  6. Download: Export results for further analysis

Dataset

The benchmark is available on HuggingFace:

πŸ”— Zen0/AusCyberBench

Dataset Splits

  • Full: All 13,449 tasks across all categories
  • Australian: 4,899 Australia-specific tasks

Evaluation Methodology

Prompt Formatting

Model-specific chat templates ensure optimal performance:

  • Phi-3/Phi-3.5: <|user|>...<|end|>\n<|assistant|>
  • Gemma-2: <start_of_turn>user\n...<end_of_turn>\n<start_of_turn>model
  • Generic (Llama, Mistral, Qwen, etc.): [INST] ... [/INST]

Answer Extraction

Robust extraction for multiple-choice tasks:

  • Primary: Regex pattern \b([A-D])\b matching
  • Fallback: First character validation
  • Handles various response formats

Memory Management

Automatic cleanup between models:

  • Model and tokeniser deletion
  • CUDA cache clearing
  • Garbage collection
  • Prevents OOM errors on GPU instances

Performance Expectations

Based on verified benchmarking with tested models:

  • βœ… Recommended Models: 30-56% accuracy (DeepSeek LLM: 55.6%, TinyLlama: 33.3%)
  • Cybersecurity-Focused: 20-40% accuracy (code models show domain understanding)
  • Small Models (1-4B): 10-40% accuracy (Phi-3, Qwen2.5 perform well)
  • Medium Models (7-12B): 25-45% accuracy (Mistral, Llama 3.1 strong performers)
  • Reasoning Models: 30-50% accuracy (DeepSeek, SOLAR excel at complex tasks)

Performance varies significantly by category:

  • Essential Eight: Higher scores (25-50%) - well-documented standards
  • ISM Controls: Moderate scores (15-35%) - detailed technical requirements
  • Terminology: Good scores (20-40%) - definition-based tasks
  • Threat Intelligence: Variable (15-45%) - requires current knowledge
  • Privacy Act / SOCI Act: Challenging (15-35%) - complex regulatory understanding

Technical Requirements

This Space requires GPU hardware for model inference.

⚑ ZeroGPU Free Tier Limitations

60-Second Timeout: Free tier has a strict 60-second limit per evaluation session.

Best Practices:

  • βœ… Test 1-2 models at a time with 10 tasks each (~30-40 seconds total)
  • ⚠️ Avoid selecting 5+ models in one run (will timeout midway)
  • βœ… Use 4-bit quantization for 7B+ models to speed up inference
  • βœ… Run separate evaluations for thorough testing across many models

Example Timing:

  • 1 model Γ— 10 tasks: ~15-25 seconds βœ…
  • 2 models Γ— 10 tasks: ~30-50 seconds βœ…
  • 5 models Γ— 10 tasks: ~75-125 seconds ❌ Will timeout

For comprehensive multi-model benchmarking, run evaluations sequentially rather than all at once.

Citation

If you use AusCyberBench in your research, please cite:

@dataset{auscyberbench2025,
  title={AusCyberBench: Australia's First LLM Cybersecurity Benchmark},
  author={Zen0},
  year={2025},
  publisher={HuggingFace},
  url={https://huggingface.co/datasets/Zen0/AusCyberBench}
}

License

MIT License - See LICENSE file for details

Acknowledgements

  • Australian Cyber Security Centre (ACSC) for Essential Eight, ISM, and threat intelligence
  • Office of the Australian Information Commissioner (OAIC) for Privacy Act guidance
  • Department of Home Affairs for SOCI Act resources
  • HuggingFace for infrastructure and model hosting

Built with Australian orthography πŸ‡¦πŸ‡Ί

Visualise β€’ Analyse β€’ Optimise β€’ Quantisation