Update README.md
Browse files
README.md
CHANGED
|
@@ -1,3 +1,96 @@
|
|
| 1 |
-
---
|
| 2 |
-
license: mit
|
| 3 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
---
|
| 2 |
+
license: mit
|
| 3 |
+
language:
|
| 4 |
+
- en
|
| 5 |
+
pipeline_tag: text-classification
|
| 6 |
+
library_name: scikit-learn
|
| 7 |
+
tags:
|
| 8 |
+
- password-strength
|
| 9 |
+
- cybersecurity
|
| 10 |
+
- random-forest
|
| 11 |
+
- scikit-learn
|
| 12 |
+
- password-classification
|
| 13 |
+
- password-security
|
| 14 |
+
- sklearn
|
| 15 |
+
---
|
| 16 |
+
# PasswordHealthModel
|
| 17 |
+
|
| 18 |
+
**Model Type**: Random Forest Classifier
|
| 19 |
+
**Framework**: scikit-learn
|
| 20 |
+
**Task**: Password Strength Classification (Weak / Medium / Strong)
|
| 21 |
+
|
| 22 |
+
## Overview
|
| 23 |
+
|
| 24 |
+
PasswordHealthModel is a machine learning model that classifies passwords into three strength levels:
|
| 25 |
+
|
| 26 |
+
- **Weak (0)**
|
| 27 |
+
- **Medium (1)**
|
| 28 |
+
- **Strong (2)**
|
| 29 |
+
|
| 30 |
+
The model leverages a Random Forest Classifier trained on 300,000 labeled passwords and is designed for integration into password management systems to provide real-time strength evaluation and guidance.
|
| 31 |
+
|
| 32 |
+
## Intended Uses
|
| 33 |
+
|
| 34 |
+
- Integration into password managers (e.g., [Password Utility](https://github.com/naail-khokhar/password_utility)) for evaluating password health.
|
| 35 |
+
- Providing real-time feedback on password strength and generating recommendations for stronger passwords.
|
| 36 |
+
- Enforcing password strength policies in security-focused applications.
|
| 37 |
+
|
| 38 |
+
## Training Data
|
| 39 |
+
|
| 40 |
+
- **Weak**: 100,000 passwords sourced from the [SecLists dataset](https://github.com/danielmiessler/SecLists).
|
| 41 |
+
- **Medium**: 100,000 synthetically generated passwords (8–12 characters, alphanumeric, 20% with symbols).
|
| 42 |
+
- **Strong**: 100,000 synthetically generated passwords (12–16 characters, alphanumeric + symbols).
|
| 43 |
+
|
| 44 |
+
All passwords were stripped of whitespace prior to feature extraction.
|
| 45 |
+
|
| 46 |
+
## Features (10 Total)
|
| 47 |
+
|
| 48 |
+
- **length**: Number of characters.
|
| 49 |
+
- **entropy**: Shannon entropy of characters.
|
| 50 |
+
- **has_upper**: Binary flag indicating presence of uppercase characters.
|
| 51 |
+
- **has_symbol**: Binary flag indicating presence of special characters.
|
| 52 |
+
- **has_leet**: Binary flag for leet-speak characters (e.g., @, 3, !, 0).
|
| 53 |
+
- **repetition**: Binary flag for repeated sequences (≥3 consecutive repeated characters).
|
| 54 |
+
- **digit_ratio**: Ratio of digits to total length.
|
| 55 |
+
- **unique_ratio**: Ratio of unique characters to total length.
|
| 56 |
+
- **bigram_entropy**: Entropy of character pairs (bigrams).
|
| 57 |
+
- **compression_ratio**: Ratio of compressed length to original length using zlib compression.
|
| 58 |
+
|
| 59 |
+
## Model Architecture
|
| 60 |
+
|
| 61 |
+
- **Algorithm**: Random Forest Classifier (scikit-learn)
|
| 62 |
+
- **Hyperparameters**:
|
| 63 |
+
- `n_estimators`: 200
|
| 64 |
+
- `max_depth`: 20
|
| 65 |
+
- `min_samples_split`: 5
|
| 66 |
+
- `random_state`: 42
|
| 67 |
+
|
| 68 |
+
## Performance
|
| 69 |
+
|
| 70 |
+
- **Evaluation Setup**: 80/20 train-test split (80% training, 20% testing; 240,000 training samples, 60,000 test samples)
|
| 71 |
+
- **Accuracy**: ~96.7% (±0.6% standard deviation)
|
| 72 |
+
|
| 73 |
+
## Limitations
|
| 74 |
+
|
| 75 |
+
- Feature engineering is heuristic-based and may not fully capture all password patterns across different contexts.
|
| 76 |
+
- Primarily trained on English-like and synthetic passwords.
|
| 77 |
+
- Potential overfitting to synthetic strong password patterns.
|
| 78 |
+
|
| 79 |
+
## Ethical Considerations
|
| 80 |
+
|
| 81 |
+
Weak password data is sourced from publicly available breaches with careful handling. The model does not store actual user passwords and is intended only for classification tasks.
|
| 82 |
+
|
| 83 |
+
## Dependencies
|
| 84 |
+
|
| 85 |
+
My project relies on the following open-source libraries and datasets:
|
| 86 |
+
|
| 87 |
+
- **[pandas](https://github.com/pandas-dev/pandas)**: Data manipulation and analysis (BSD-3-Clause License).
|
| 88 |
+
- **[scikit-learn](https://github.com/scikit-learn/scikit-learn)**: Machine learning framework for the Random Forest Classifier (BSD-3-Clause License).
|
| 89 |
+
- **[joblib](https://github.com/joblib/joblib)**: Model persistence and parallel computation (MIT License).
|
| 90 |
+
- **[SecLists](https://github.com/danielmiessler/SecLists)**: Dataset for weak passwords (MIT License).
|
| 91 |
+
|
| 92 |
+
If redistributing this project, please include the respective license texts for these dependencies.
|
| 93 |
+
|
| 94 |
+
## Citation
|
| 95 |
+
|
| 96 |
+
Khokhar, Naa'il Ahmad. (2025). *PasswordHealthModel: A Random Forest Model for Password Strength Classification*. Hugging Face Model Hub.
|