abhi099k commited on
Commit
bdd1866
Β·
verified Β·
1 Parent(s): 7e0d0d9

Create README.md

Browse files
Files changed (1) hide show
  1. README.md +132 -0
README.md ADDED
@@ -0,0 +1,132 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # 🧠 AI Text Detector v1.0 (DeBERTa-v3-large)
2
+
3
+ ## 🏷️ Model Details
4
+ | Field | Description |
5
+ |:--|:--|
6
+ | **Model Name** | `ai-text-detector-v-k1.0` |
7
+ | **Base Model** | [microsoft/deberta-v3-large](https://huggingface.co/microsoft/deberta-v3-large) |
8
+ | **Task** | Text Classification (Human-written vs AI-generated) |
9
+ | **Language** | English |
10
+ | **Framework** | PyTorch, Transformers |
11
+ | **Trained by** | [Abhinav](https://huggingface.co/Abhinav) |
12
+ | **Fine-tuned using** | Hugging Face `Trainer` API with early stopping, mixed precision (fp16), and F1 optimization. |
13
+
14
+ ---
15
+
16
+ ## πŸ“– Model Description
17
+ This model fine-tunes **DeBERTa-v3-large** for detecting whether a given text is written by a **Human** or generated by an **AI**.
18
+ It was trained on a custom dataset containing **10,000+ samples** of diverse text across multiple topics, labeled as:
19
+ - `0` β†’ Human-written text
20
+ - `1` β†’ AI-generated text
21
+
22
+ The goal is to identify subtle linguistic differences and stylistic cues between natural human writing and machine-generated content.
23
+
24
+ ---
25
+
26
+ ## βš™οΈ Training Configuration
27
+ | Parameter | Value |
28
+ |:--|:--|
29
+ | **Epochs** | 4 |
30
+ | **Batch size** | 8 |
31
+ | **Learning Rate** | 2e-5 |
32
+ | **Max Sequence Length** | 256 |
33
+ | **Optimizer** | AdamW |
34
+ | **Scheduler** | Linear decay |
35
+ | **Weight Decay** | 0.01 |
36
+ | **Seed** | 42 |
37
+ | **Mixed Precision** | βœ… Yes (fp16) |
38
+ | **Gradient Accumulation Steps** | 2 |
39
+ | **Frameworks** | PyTorch, Transformers, Datasets |
40
+
41
+ ---
42
+
43
+ ## 🧾 Dataset
44
+ | Field | Value |
45
+ |:--|:--|
46
+ | **Source** | Custom dataset (`gpt_5_with_10k.csv`) |
47
+ | **Columns** | `text`, `label` |
48
+ | **Labels** | 0 = Human, 1 = AI |
49
+ | **Split** | 90% Train / 10% Test |
50
+ | **Cleaning** | Removed special characters, normalized whitespace, and standardized punctuation. |
51
+
52
+ ---
53
+
54
+ ## πŸ“Š Evaluation Results
55
+
56
+ | Metric | Score |
57
+ |:--|:--|
58
+ | **Accuracy** | ~0.97 |
59
+ | **F1 Score** | ~0.97 |
60
+ | **Precision / Recall** | Balanced |
61
+
62
+ **Confusion Matrix Example:**
63
+ ```
64
+ [[4800 90] β†’ True Human
65
+ [ 110 5000]] β†’ True AI
66
+ ```
67
+
68
+ ---
69
+
70
+ ## πŸ” Usage Example
71
+
72
+ ```python
73
+ from transformers import AutoTokenizer, AutoModelForSequenceClassification
74
+ import torch
75
+
76
+ model_id = "Abhinav/ai-text-detector-v-k1.0"
77
+ tokenizer = AutoTokenizer.from_pretrained(model_id)
78
+ model = AutoModelForSequenceClassification.from_pretrained(model_id)
79
+
80
+ text = "This article explores the evolution of large language models..."
81
+ inputs = tokenizer(text, return_tensors="pt", truncation=True, padding=True)
82
+ outputs = model(**inputs)
83
+ pred = torch.argmax(outputs.logits, dim=1).item()
84
+
85
+ print("πŸ§‘ Human" if pred == 0 else "πŸ€– AI")
86
+ ```
87
+
88
+ ---
89
+
90
+ ## πŸ“ˆ Intended Use
91
+ - Detect AI-generated content for moderation, academic integrity, or authenticity verification.
92
+ - Use as a foundation model for fine-tuning on domain-specific datasets (e.g., essays, reviews, research papers).
93
+
94
+ ---
95
+
96
+ ## ⚠️ Limitations
97
+ - May misclassify **paraphrased AI text** or **human text with robotic phrasing**.
98
+ - Primarily trained on English β€” not guaranteed for other languages.
99
+ - Should not be used for punitive or high-stakes decisions without human review.
100
+
101
+ ---
102
+
103
+ ## πŸ† Future Improvements
104
+ - Multi-language support (Hindi, Spanish, etc.)
105
+ - Add stylistic embeddings for cross-model generalization.
106
+ - Robustness testing against prompt-engineering and obfuscation.
107
+
108
+ ---
109
+
110
+ ## 🧩 Technical Summary
111
+ | Component | Library |
112
+ |:--|:--|
113
+ | Tokenization | `AutoTokenizer` |
114
+ | Model | `AutoModelForSequenceClassification` |
115
+ | Trainer | `transformers.Trainer` |
116
+ | Metrics | `evaluate` (accuracy, f1) |
117
+ | Visualization | `matplotlib` (confusion matrix) |
118
+
119
+ ---
120
+
121
+ ## πŸ“¬ Citation
122
+
123
+ If you use this model, please cite:
124
+
125
+ ```
126
+ @model{abhinav_ai_text_detector_v1,
127
+ title = {AI Text Detector v1.0 – DeBERTa-v3-large Fine-tune},
128
+ author = {Abhinav},
129
+ year = {2025},
130
+ url = {https://huggingface.co/Abhinav/ai-text-detector-v-k1.0}
131
+ }
132
+ ```