Spaces:
Running
on
Zero
title: AusCyberBench Evaluation Dashboard
emoji: π‘οΈ
colorFrom: green
colorTo: yellow
sdk: gradio
sdk_version: 5.49.1
app_file: app.py
pinned: false
license: mit
π¦πΊ AusCyberBench Evaluation Dashboard
Australia's First LLM Cybersecurity Benchmark
An interactive dashboard for evaluating language models on Australian cybersecurity knowledge, regulations, and threat intelligence.
π What's New (October 2025)
- 32 Tested Models - Focus on proven, stable models for reliable evaluation
- β Recommended Category - 7 models with verified performance (DeepSeek 55%+, TinyLlama 33%+)
- Enhanced Visuals - Australian color-coded charts with gold/green ranking system
- Better Stability - Removed experimental models causing compatibility issues
- Improved UI - Quick selection presets for recommended, security, and size-based filtering
- Memory Optimized - Better GPU management for HuggingFace Spaces
About AusCyberBench
AusCyberBench is a comprehensive benchmark dataset containing 13,449 tasks across six critical categories:
π Categories
π‘οΈ Regulatory: Essential Eight (2,558 tasks)
- ACSC's baseline cybersecurity mitigation strategies
- Maturity levels 1-3 across 8 mitigation strategies
- Application whitelisting, patching, MFA, backups, etc.
π Regulatory: ISM Controls (7,200 tasks)
- Information Security Manual control requirements
- Commonwealth entity security obligations
- Control effectiveness, implementation, and compliance
π Regulatory: Privacy Act (204 tasks)
- Australian Privacy Principles (APPs)
- Data protection and privacy obligations
- Notifiable Data Breaches (NDB) scheme
β‘ Regulatory: SOCI Act (240 tasks)
- Security of Critical Infrastructure Act 2018
- Critical infrastructure risk management
- Sector-specific obligations
π― Knowledge: Threat Intelligence (2,520 tasks)
- ACSC threat reports and advisories
- Australian threat landscape
- Cyber incident response
π Knowledge: Terminology (727 tasks)
- Australian cybersecurity terminology
- ACSC glossary and definitions
- Industry-specific language
Features
π€ 32 Pre-Configured Models (Tested & Stable)
Evaluate across diverse model categories with proven, reliable models:
β Recommended (Tested) - 7 models
Models with verified performance on AusCyberBench:
- Phi-3 & Phi-3.5 - Microsoft's efficient models (proven stable)
- Gemma-2-2b - Google's compact model (tested)
- Qwen2.5 (3B, 7B) - Alibaba's reliable models (good performance)
- DeepSeek LLM-7B - Previously achieved 55.6% accuracy β
- TinyLlama-1.1B - Previously achieved 33.3% accuracy
π‘οΈ Cybersecurity-Focused - 5 models
- DeepSeek Coder - Code-focused with security awareness
- WizardCoder-Python - Advanced code understanding
- StarCoder2 - BigCode's latest model
- CodeLlama - Meta's code specialist
- CodeGen25 - Salesforce's code model
Small Models (1-4B) - 7 models
Phi-3 series, Gemma-2, Qwen2.5, Llama 3.2, StableLM, TinyLlama
Medium Models (7-12B) - 6 models
Mistral, Qwen2.5, Llama 3.1, Gemma-2-9b, Mistral-Nemo, Yi
Reasoning & Analysis - 4 models
DeepSeek LLM, SOLAR, Hermes-3, Qwen2.5-14B
Multilingual & Diverse - 3 models
Falcon, OpenChat, OpenHermes
β‘ Quick Selection Presets
- β Recommended (7) - Tested models with verified performance
- π‘οΈ Security Focus (5) - Code and cybersecurity specialists
- Small/Medium - Size-based selection (7/6 models)
- Select All (32) - Comprehensive evaluation
- Clear All - Reset selection
π― Customisable Evaluation
- Sample size: 10-500 tasks (default: 10 for testing multiple models)
- β οΈ GPU Limits: Free tier has 60s timeout - test 1-2 models at a time for best results
- 4-bit quantisation: Reduce memory usage for larger models
- Temperature: Control response randomness (0.1-1.0)
- Max tokens: Limit response length (32-256)
π Real-Time Results
- Live leaderboard with rankings (π₯π₯π₯)
- Model comparison visualisation in Australian colours
- Per-category performance breakdown
- Downloadable results (JSON format)
Usage
πΎ Persistent Leaderboard Feature
NEW: Results now persist across sessions! This solves the GPU timeout issue:
- Run models one at a time to avoid timeouts
- Each run merges with previous results
- Best score per model is automatically kept
- Build a comprehensive leaderboard incrementally
- Perfect for the 60-second free tier limit
Workflow:
- Select 1-2 models and run evaluation
- Results automatically save and merge with leaderboard
- Select different models and run again
- Leaderboard updates with all results
- Use "Clear All Results" button to start fresh
Standard Usage
- Select Models: Use checkboxes or quick selection buttons
- Configure Settings: Adjust sample size, quantisation, temperature
- Run Evaluation: Click "π Run Evaluation"
- Monitor Progress: Watch real-time progress and intermediate results
- Analyse Results: Review persistent leaderboard, charts, and category breakdowns
- Download: Export results for further analysis
Dataset
The benchmark is available on HuggingFace:
π Zen0/AusCyberBench
Dataset Splits
- Full: All 13,449 tasks across all categories
- Australian: 4,899 Australia-specific tasks
Evaluation Methodology
Prompt Formatting
Model-specific chat templates ensure optimal performance:
- Phi-3/Phi-3.5:
<|user|>...<|end|>\n<|assistant|> - Gemma-2:
<start_of_turn>user\n...<end_of_turn>\n<start_of_turn>model - Generic (Llama, Mistral, Qwen, etc.):
[INST] ... [/INST]
Answer Extraction
Robust extraction for multiple-choice tasks:
- Primary: Regex pattern
\b([A-D])\bmatching - Fallback: First character validation
- Handles various response formats
Memory Management
Automatic cleanup between models:
- Model and tokeniser deletion
- CUDA cache clearing
- Garbage collection
- Prevents OOM errors on GPU instances
Performance Expectations
Based on verified benchmarking with tested models:
- β Recommended Models: 30-56% accuracy (DeepSeek LLM: 55.6%, TinyLlama: 33.3%)
- Cybersecurity-Focused: 20-40% accuracy (code models show domain understanding)
- Small Models (1-4B): 10-40% accuracy (Phi-3, Qwen2.5 perform well)
- Medium Models (7-12B): 25-45% accuracy (Mistral, Llama 3.1 strong performers)
- Reasoning Models: 30-50% accuracy (DeepSeek, SOLAR excel at complex tasks)
Performance varies significantly by category:
- Essential Eight: Higher scores (25-50%) - well-documented standards
- ISM Controls: Moderate scores (15-35%) - detailed technical requirements
- Terminology: Good scores (20-40%) - definition-based tasks
- Threat Intelligence: Variable (15-45%) - requires current knowledge
- Privacy Act / SOCI Act: Challenging (15-35%) - complex regulatory understanding
Technical Requirements
This Space requires GPU hardware for model inference.
β‘ ZeroGPU Free Tier Limitations
60-Second Timeout: Free tier has a strict 60-second limit per evaluation session.
Best Practices:
- β Test 1-2 models at a time with 10 tasks each (~30-40 seconds total)
- β οΈ Avoid selecting 5+ models in one run (will timeout midway)
- β Use 4-bit quantization for 7B+ models to speed up inference
- β Run separate evaluations for thorough testing across many models
Example Timing:
- 1 model Γ 10 tasks: ~15-25 seconds β
- 2 models Γ 10 tasks: ~30-50 seconds β
- 5 models Γ 10 tasks: ~75-125 seconds β Will timeout
For comprehensive multi-model benchmarking, run evaluations sequentially rather than all at once.
Citation
If you use AusCyberBench in your research, please cite:
@dataset{auscyberbench2025,
title={AusCyberBench: Australia's First LLM Cybersecurity Benchmark},
author={Zen0},
year={2025},
publisher={HuggingFace},
url={https://huggingface.co/datasets/Zen0/AusCyberBench}
}
License
MIT License - See LICENSE file for details
Acknowledgements
- Australian Cyber Security Centre (ACSC) for Essential Eight, ISM, and threat intelligence
- Office of the Australian Information Commissioner (OAIC) for Privacy Act guidance
- Department of Home Affairs for SOCI Act resources
- HuggingFace for infrastructure and model hosting
Built with Australian orthography π¦πΊ
Visualise β’ Analyse β’ Optimise β’ Quantisation