Spaces:

Zen0
/

auscyberbench-evaluator

Running on Zero

App Files Files Community

auscyberbench-evaluator / README.md

Zen0

Add persistent leaderboard feature - solves GPU timeout issue

2338c46 21 days ago

preview code

raw

history blame contribute delete

8.73 kB

metadata

title: AusCyberBench Evaluation Dashboard
emoji: 🛡️
colorFrom: green
colorTo: yellow
sdk: gradio
sdk_version: 5.49.1
app_file: app.py
pinned: false
license: mit

🇦🇺 AusCyberBench Evaluation Dashboard

Australia's First LLM Cybersecurity Benchmark

An interactive dashboard for evaluating language models on Australian cybersecurity knowledge, regulations, and threat intelligence.

🆕 What's New (October 2025)

32 Tested Models - Focus on proven, stable models for reliable evaluation
✅ Recommended Category - 7 models with verified performance (DeepSeek 55%+, TinyLlama 33%+)
Enhanced Visuals - Australian color-coded charts with gold/green ranking system
Better Stability - Removed experimental models causing compatibility issues
Improved UI - Quick selection presets for recommended, security, and size-based filtering
Memory Optimized - Better GPU management for HuggingFace Spaces

About AusCyberBench

AusCyberBench is a comprehensive benchmark dataset containing 13,449 tasks across six critical categories:

📋 Categories

🛡️ Regulatory: Essential Eight (2,558 tasks)
- ACSC's baseline cybersecurity mitigation strategies
- Maturity levels 1-3 across 8 mitigation strategies
- Application whitelisting, patching, MFA, backups, etc.
📜 Regulatory: ISM Controls (7,200 tasks)
- Information Security Manual control requirements
- Commonwealth entity security obligations
- Control effectiveness, implementation, and compliance
🔒 Regulatory: Privacy Act (204 tasks)
- Australian Privacy Principles (APPs)
- Data protection and privacy obligations
- Notifiable Data Breaches (NDB) scheme
⚡ Regulatory: SOCI Act (240 tasks)
- Security of Critical Infrastructure Act 2018
- Critical infrastructure risk management
- Sector-specific obligations
🎯 Knowledge: Threat Intelligence (2,520 tasks)
- ACSC threat reports and advisories
- Australian threat landscape
- Cyber incident response
📚 Knowledge: Terminology (727 tasks)
- Australian cybersecurity terminology
- ACSC glossary and definitions
- Industry-specific language

Features

🤖 32 Pre-Configured Models (Tested & Stable)

Evaluate across diverse model categories with proven, reliable models:

✅ Recommended (Tested) - 7 models

Models with verified performance on AusCyberBench:

Phi-3 & Phi-3.5 - Microsoft's efficient models (proven stable)
Gemma-2-2b - Google's compact model (tested)
Qwen2.5 (3B, 7B) - Alibaba's reliable models (good performance)
DeepSeek LLM-7B - Previously achieved 55.6% accuracy ⭐
TinyLlama-1.1B - Previously achieved 33.3% accuracy

🛡️ Cybersecurity-Focused - 5 models

DeepSeek Coder - Code-focused with security awareness
WizardCoder-Python - Advanced code understanding
StarCoder2 - BigCode's latest model
CodeLlama - Meta's code specialist
CodeGen25 - Salesforce's code model

Small Models (1-4B) - 7 models

Phi-3 series, Gemma-2, Qwen2.5, Llama 3.2, StableLM, TinyLlama

Medium Models (7-12B) - 6 models

Mistral, Qwen2.5, Llama 3.1, Gemma-2-9b, Mistral-Nemo, Yi

Reasoning & Analysis - 4 models

DeepSeek LLM, SOLAR, Hermes-3, Qwen2.5-14B

Multilingual & Diverse - 3 models

Falcon, OpenChat, OpenHermes

⚡ Quick Selection Presets

✅ Recommended (7) - Tested models with verified performance
🛡️ Security Focus (5) - Code and cybersecurity specialists
Small/Medium - Size-based selection (7/6 models)
Select All (32) - Comprehensive evaluation
Clear All - Reset selection

🎯 Customisable Evaluation

Sample size: 10-500 tasks (default: 10 for testing multiple models)
⚠️ GPU Limits: Free tier has 60s timeout - test 1-2 models at a time for best results
4-bit quantisation: Reduce memory usage for larger models
Temperature: Control response randomness (0.1-1.0)
Max tokens: Limit response length (32-256)

📊 Real-Time Results

Live leaderboard with rankings (🥇🥈🥉)
Model comparison visualisation in Australian colours
Per-category performance breakdown
Downloadable results (JSON format)

Usage

💾 Persistent Leaderboard Feature

NEW: Results now persist across sessions! This solves the GPU timeout issue:

Run models one at a time to avoid timeouts
Each run merges with previous results
Best score per model is automatically kept
Build a comprehensive leaderboard incrementally
Perfect for the 60-second free tier limit

Workflow:

Select 1-2 models and run evaluation
Results automatically save and merge with leaderboard
Select different models and run again
Leaderboard updates with all results
Use "Clear All Results" button to start fresh

Standard Usage

Select Models: Use checkboxes or quick selection buttons
Configure Settings: Adjust sample size, quantisation, temperature
Run Evaluation: Click "🚀 Run Evaluation"
Monitor Progress: Watch real-time progress and intermediate results
Analyse Results: Review persistent leaderboard, charts, and category breakdowns
Download: Export results for further analysis

Dataset

The benchmark is available on HuggingFace:

🔗 Zen0/AusCyberBench

Dataset Splits

Full: All 13,449 tasks across all categories
Australian: 4,899 Australia-specific tasks

Evaluation Methodology

Prompt Formatting

Model-specific chat templates ensure optimal performance:

Phi-3/Phi-3.5: <|user|>...<|end|>\n<|assistant|>
Gemma-2: <start_of_turn>user\n...<end_of_turn>\n<start_of_turn>model
Generic (Llama, Mistral, Qwen, etc.): [INST] ... [/INST]

Answer Extraction

Robust extraction for multiple-choice tasks:

Primary: Regex pattern \b([A-D])\b matching
Fallback: First character validation
Handles various response formats

Memory Management

Automatic cleanup between models:

Model and tokeniser deletion
CUDA cache clearing
Garbage collection
Prevents OOM errors on GPU instances

Performance Expectations

Based on verified benchmarking with tested models:

✅ Recommended Models: 30-56% accuracy (DeepSeek LLM: 55.6%, TinyLlama: 33.3%)
Cybersecurity-Focused: 20-40% accuracy (code models show domain understanding)
Small Models (1-4B): 10-40% accuracy (Phi-3, Qwen2.5 perform well)
Medium Models (7-12B): 25-45% accuracy (Mistral, Llama 3.1 strong performers)
Reasoning Models: 30-50% accuracy (DeepSeek, SOLAR excel at complex tasks)

Performance varies significantly by category:

Essential Eight: Higher scores (25-50%) - well-documented standards
ISM Controls: Moderate scores (15-35%) - detailed technical requirements
Terminology: Good scores (20-40%) - definition-based tasks
Threat Intelligence: Variable (15-45%) - requires current knowledge
Privacy Act / SOCI Act: Challenging (15-35%) - complex regulatory understanding

Technical Requirements

This Space requires GPU hardware for model inference.

⚡ ZeroGPU Free Tier Limitations

60-Second Timeout: Free tier has a strict 60-second limit per evaluation session.

Best Practices:

✅ Test 1-2 models at a time with 10 tasks each (~30-40 seconds total)
⚠️ Avoid selecting 5+ models in one run (will timeout midway)
✅ Use 4-bit quantization for 7B+ models to speed up inference
✅ Run separate evaluations for thorough testing across many models

Example Timing:

1 model × 10 tasks: ~15-25 seconds ✅
2 models × 10 tasks: ~30-50 seconds ✅
5 models × 10 tasks: ~75-125 seconds ❌ Will timeout

For comprehensive multi-model benchmarking, run evaluations sequentially rather than all at once.

Citation

If you use AusCyberBench in your research, please cite:

@dataset{auscyberbench2025,
  title={AusCyberBench: Australia's First LLM Cybersecurity Benchmark},
  author={Zen0},
  year={2025},
  publisher={HuggingFace},
  url={https://huggingface.co/datasets/Zen0/AusCyberBench}
}

License

MIT License - See LICENSE file for details

Acknowledgements

Australian Cyber Security Centre (ACSC) for Essential Eight, ISM, and threat intelligence
Office of the Australian Information Commissioner (OAIC) for Privacy Act guidance
Department of Home Affairs for SOCI Act resources
HuggingFace for infrastructure and model hosting

Built with Australian orthography 🇦🇺

Visualise • Analyse • Optimise • Quantisation