File size: 4,015 Bytes
4c29db3 16deca6 4c29db3 bdd1866 16deca6 bdd1866 16deca6 bdd1866 d092bef bdd1866 4c29db3 bdd1866 4c29db3 |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 |
---
license: mit
language:
- en
base_model:
- microsoft/deberta-v3-large
pipeline_tag: text-classification
metrics:
- accuracy
---
# π§ AI Text Detector v1.0 (DeBERTa-v3-large)
## π·οΈ Model Details
| Field | Description |
|:--|:--|
| **Model Name** | `ai-text-detector-v-n4.0` |
| **Base Model** | [microsoft/deberta-v3-large](https://huggingface.co/microsoft/deberta-v3-large) |
| **Task** | Text Classification (Human-written vs AI-generated) |
| **Language** | English |
| **Framework** | PyTorch, Transformers |
| **Trained by** | [Abhinav](https://huggingface.co/abhi099k) |
| **Fine-tuned using** | Hugging Face `Trainer` API with early stopping, mixed precision (fp16), and F1 optimization. |
---
## π Model Description
This model fine-tunes **DeBERTa-v3-large** for detecting whether a given text is written by a **Human** or generated by an **AI**.
It was trained on a custom dataset containing **10,000+ samples** of diverse text across multiple topics, labeled as:
- `0` β Human-written text
- `1` β AI-generated text
The goal is to identify subtle linguistic differences and stylistic cues between natural human writing and machine-generated content.
---
## βοΈ Training Configuration
| Parameter | Value |
|:--|:--|
| **Epochs** | 4 |
| **Batch size** | 8 |
| **Learning Rate** | 2e-5 |
| **Max Sequence Length** | 256 |
| **Optimizer** | AdamW |
| **Scheduler** | Linear decay |
| **Weight Decay** | 0.01 |
| **Seed** | 42 |
| **Mixed Precision** | β
Yes (fp16) |
| **Gradient Accumulation Steps** | 2 |
| **Frameworks** | PyTorch, Transformers, Datasets |
---
## π§Ύ Dataset
| Field | Value |
|:--|:--|
| **Source** | Custom dataset (`gpt_5_with_10k.csv`) |
| **Columns** | `text`, `label` |
| **Labels** | 0 = Human, 1 = AI |
| **Split** | 90% Train / 10% Test |
| **Cleaning** | Removed special characters, normalized whitespace, and standardized punctuation. |
---
## π Evaluation Results
| Metric | Score |
|:--|:--|
| **Accuracy** | ~0.97 |
| **F1 Score** | ~0.97 |
| **Precision / Recall** | Balanced |
**Confusion Matrix Example:**
```
[[4800 90] β True Human
[ 110 5000]] β True AI
```
---
## π Usage Example
```python
from transformers import AutoTokenizer, AutoModelForSequenceClassification
import torch
model_id = "abhi099k/ai-text-detector-v-n4.0"
tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForSequenceClassification.from_pretrained(model_id)
text = "This article explores the evolution of large language models..."
inputs = tokenizer(text, return_tensors="pt", truncation=True, padding=True)
outputs = model(**inputs)
pred = torch.argmax(outputs.logits, dim=1).item()
print("π§ Human" if pred == 0 else "π€ AI")
```
---
## π Intended Use
- Detect AI-generated content for moderation, academic integrity, or authenticity verification.
- Use as a foundation model for fine-tuning on domain-specific datasets (e.g., essays, reviews, research papers).
---
## β οΈ Limitations
- May misclassify **paraphrased AI text** or **human text with robotic phrasing**.
- Primarily trained on English β not guaranteed for other languages.
- Should not be used for punitive or high-stakes decisions without human review.
---
## π Future Improvements
- Multi-language support (Hindi, Spanish, etc.)
- Add stylistic embeddings for cross-model generalization.
- Robustness testing against prompt-engineering and obfuscation.
---
## π§© Technical Summary
| Component | Library |
|:--|:--|
| Tokenization | `AutoTokenizer` |
| Model | `AutoModelForSequenceClassification` |
| Trainer | `transformers.Trainer` |
| Metrics | `evaluate` (accuracy, f1) |
| Visualization | `matplotlib` (confusion matrix) |
---
## π¬ Citation
If you use this model, please cite:
```
@model{abhinav_ai_text_detector_v1,
title = {AI Text Detector v1.0 β DeBERTa-v3-large Fine-tune},
author = {Abhinav},
year = {2025},
url = {https://huggingface.co/Abhinav/ai-text-detector-v-n4.0}
}
``` |