--- license: apache-2.0 datasets: - honicky/hdfs-logs-encoded-blocks - Kingslayer5437/BGL language: - en metrics: - f1 - precision - recall - roc_auc base_model: - distilbert/distilbert-base-uncased pipeline_tag: text-classification library_name: transformers tags: - log-analysis - anomaly-detection - bert - huggingface model-index: - name: CloudOpsBERT (distributed-storage) results: - task: type: text-classification name: Anomaly Detection dataset: name: HDFS type: honicky/hdfs-logs-encoded-blocks split: test metrics: - type: f1 value: 0.571 - type: precision value: 0.992 - type: recall value: 0.401 - type: auroc value: 0.73 - type: threshold value: 0.5 - name: CloudOpsBERT (HPC) results: - task: type: text-classification name: Anomaly Detection dataset: name: BGL type: Kingslayer5437/BGL split: test metrics: - type: f1 value: 1.00 - type: precision value: 1.00 - type: recall value: 1.00 - type: auroc value: 1.00 - type: threshold value: 0.05 --- --- # CloudOpsBERT: Domain-Specific Language Models for Cloud Operations CloudOpsBERT is an open-source project exploring **domain-adapted transformer models** for **cloud operations log analysis** β€” specifically anomaly detection, reliability monitoring, and cost optimization. This project fine-tunes lightweight BERT variants (e.g., DistilBERT) on large-scale system log datasets (HDFS, BGL) and provides ready-to-use models for the research and practitioner community. --- ## πŸš€ Motivation Modern cloud platforms generate massive amounts of logs. Detecting anomalies in these logs is crucial for: - Ensuring **reliability** (catching failures early), - Improving **cost efficiency** (identifying waste or misconfigurations), - Supporting **autonomous operations** (AIOps). Generic LLMs and BERT models are not optimized for this domain. CloudOpsBERT bridges that gap by: - Training on **real log datasets** (HDFS, BGL), - Addressing **imbalanced anomaly detection** with class weighting, - Publishing **open-source checkpoints** for reproducibility. --- ## πŸ” Inference (Pretrained) Predict anomaly probability for a single log line: ``` python src/predict.py \ --model_dir vaibhav2507/cloudops-bert \ --subfolder distributed-storage \ --text "ERROR dfs.DataNode: Lost connection to namenode" ``` Batch inference (file with one log line per row): ``` python src/predict.py \ --model_dir vaibhav2507/cloudops-bert \ --subfolder distributed-storage \ --file samples/sample_logs.txt \ --threshold 0.5 \ --jsonl_out predictions.jsonl ``` ## πŸ“Š Results * HDFS (in-domain, test set) * F1: 0.571 * Precision: 0.992 * Recall: 0.401 * AUROC: 0.730 * Threshold: 0.50 (tuneable) - Cross-domain (HDFS β†’ BGL) - Performance degrades significantly due to dataset/domain shift (see paper). - BGL (training in progress) - Will be released as cloudops-bert (subfolder bgl) once full training is complete. ## πŸ“¦ Models * vaibhav2507/cloudops-bert (Hugging Face Hub) * subfolder="distributed-storage" – HDFS-trained CloudOpsBERT * subfolder="hpc" – BGL-trained CloudOpsBERT (coming soon) * Each export includes: * Model weights (pytorch_model.bin) * Config with label mappings (normal, anomaly) * Tokenizer files ## πŸš€ Quickstart (Scripts) 1) Setup folders ``` bash scripts/setup_dirs.sh ``` 2) (optional) Download a local copy of a submodel from Hugging Face ``` bash scripts/fetch_pretrained.sh # downloads 'hdfs' by default SUBFOLDER=bgl bash scripts/fetch_pretrained.sh # downloads 'bgl' ``` 3) Single-line prediction (directly from HF) ``` bash scripts/predict_line.sh "ERROR dfs.DataNode: Lost connection to namenode" hdfs ``` 4) Batch prediction (using local model folder) ``` bash scripts/make_sample_logs.sh bash scripts/predict_file.sh samples/sample_logs.txt hdfs models/cloudops-bert-hdfs preds/preds_hdfs.jsonl ``` ## πŸ“š Related Work Several prior works have explored using BERT for log anomaly detection: - Leveraging BERT and Hugging Face Transformers for Log Anomaly Detection - Tutorial-style blog post demonstrating how to fine-tune BERT on log data with Hugging Face. Useful as an introduction, but not intended as a reproducible research artifact. LogBERT (HelenGuohx/logbert) - Academic prototype from ~2019–2020 focusing on modeling log sequences with BERT. Demonstrates feasibility but limited to in-domain experiments and lacks integration with modern Hugging Face tooling. AnomalyBERT (Jhryu30/AnomalyBERT) - Another exploratory repository showing BERT-based anomaly detection on logs, with dataset-specific preprocessing. Similar limitations in generalization and reproducibility. ## πŸ”‘ How CloudOpsBERT is different - Domain-specific adaptation: explicitly trained for cloud operations logs (HDFS, BGL) with class-weighted loss. - Cross-domain evaluation: includes in-domain and cross-domain benchmarks, highlighting generalization challenges. - Reproducibility & usability: clean repo, scripts, and ready-to-use Hugging Face exports. - Future directions: introduces MicroLM β€” compressed micro-language models for efficient edge/cloud hybrid inference. - In short: previous work showed that β€œBERT can work for logs.” - CloudOpsBERT operationalizes this idea into reproducible benchmarks, public models, and deployable tools for both researchers and practitioners. ## πŸ“œ Citation If you use CloudOpsBERT in your research or tools, please cite: ``` @misc{pandey2025cloudopsbert, title={CloudOpsBERT: Domain-Specific Transformer Models for Cloud Operations Anomaly Detection}, author={Pandey, Vaibhav}, year={2025}, howpublished={GitHub, Hugging Face}, url={https://github.com/vaibhav-research/cloudops-bert} } ```