File size: 4,015 Bytes
4c29db3
 
 
 
 
 
 
16deca6
 
4c29db3
bdd1866
 
 
 
 
16deca6
bdd1866
 
 
 
16deca6
bdd1866
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
d092bef
bdd1866
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
4c29db3
bdd1866
4c29db3
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
---
license: mit
language:
- en
base_model:
- microsoft/deberta-v3-large
pipeline_tag: text-classification
metrics:
- accuracy
---
# 🧠 AI Text Detector v1.0 (DeBERTa-v3-large)

## 🏷️ Model Details
| Field | Description |
|:--|:--|
| **Model Name** | `ai-text-detector-v-n4.0` |
| **Base Model** | [microsoft/deberta-v3-large](https://huggingface.co/microsoft/deberta-v3-large) |
| **Task** | Text Classification (Human-written vs AI-generated) |
| **Language** | English |
| **Framework** | PyTorch, Transformers |
| **Trained by** | [Abhinav](https://huggingface.co/abhi099k) |
| **Fine-tuned using** | Hugging Face `Trainer` API with early stopping, mixed precision (fp16), and F1 optimization. |

---

## πŸ“– Model Description
This model fine-tunes **DeBERTa-v3-large** for detecting whether a given text is written by a **Human** or generated by an **AI**.  
It was trained on a custom dataset containing **10,000+ samples** of diverse text across multiple topics, labeled as:
- `0` β†’ Human-written text  
- `1` β†’ AI-generated text  

The goal is to identify subtle linguistic differences and stylistic cues between natural human writing and machine-generated content.

---

## βš™οΈ Training Configuration
| Parameter | Value |
|:--|:--|
| **Epochs** | 4 |
| **Batch size** | 8 |
| **Learning Rate** | 2e-5 |
| **Max Sequence Length** | 256 |
| **Optimizer** | AdamW |
| **Scheduler** | Linear decay |
| **Weight Decay** | 0.01 |
| **Seed** | 42 |
| **Mixed Precision** | βœ… Yes (fp16) |
| **Gradient Accumulation Steps** | 2 |
| **Frameworks** | PyTorch, Transformers, Datasets |

---

## 🧾 Dataset
| Field | Value |
|:--|:--|
| **Source** | Custom dataset (`gpt_5_with_10k.csv`) |
| **Columns** | `text`, `label` |
| **Labels** | 0 = Human, 1 = AI |
| **Split** | 90% Train / 10% Test |
| **Cleaning** | Removed special characters, normalized whitespace, and standardized punctuation. |

---

## πŸ“Š Evaluation Results

| Metric | Score |
|:--|:--|
| **Accuracy** | ~0.97 |
| **F1 Score** | ~0.97 |
| **Precision / Recall** | Balanced |

**Confusion Matrix Example:**
```
[[4800   90]    β†’ True Human
 [ 110 5000]]   β†’ True AI
```

---

## πŸ” Usage Example

```python
from transformers import AutoTokenizer, AutoModelForSequenceClassification
import torch

model_id = "abhi099k/ai-text-detector-v-n4.0"
tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForSequenceClassification.from_pretrained(model_id)

text = "This article explores the evolution of large language models..."
inputs = tokenizer(text, return_tensors="pt", truncation=True, padding=True)
outputs = model(**inputs)
pred = torch.argmax(outputs.logits, dim=1).item()

print("πŸ§‘ Human" if pred == 0 else "πŸ€– AI")
```

---

## πŸ“ˆ Intended Use
- Detect AI-generated content for moderation, academic integrity, or authenticity verification.
- Use as a foundation model for fine-tuning on domain-specific datasets (e.g., essays, reviews, research papers).

---

## ⚠️ Limitations
- May misclassify **paraphrased AI text** or **human text with robotic phrasing**.
- Primarily trained on English β€” not guaranteed for other languages.
- Should not be used for punitive or high-stakes decisions without human review.

---

## πŸ† Future Improvements
- Multi-language support (Hindi, Spanish, etc.)
- Add stylistic embeddings for cross-model generalization.
- Robustness testing against prompt-engineering and obfuscation.

---

## 🧩 Technical Summary
| Component | Library |
|:--|:--|
| Tokenization | `AutoTokenizer` |
| Model | `AutoModelForSequenceClassification` |
| Trainer | `transformers.Trainer` |
| Metrics | `evaluate` (accuracy, f1) |
| Visualization | `matplotlib` (confusion matrix) |

---

## πŸ“¬ Citation

If you use this model, please cite:

```
@model{abhinav_ai_text_detector_v1,
  title     = {AI Text Detector v1.0 – DeBERTa-v3-large Fine-tune},
  author    = {Abhinav},
  year      = {2025},
  url       = {https://huggingface.co/Abhinav/ai-text-detector-v-n4.0}
}
```