--- license: mit language: - en base_model: - microsoft/deberta-v3-large pipeline_tag: text-classification metrics: - accuracy --- # ๐Ÿง  AI Text Detector v1.0 (DeBERTa-v3-large) ## ๐Ÿท๏ธ Model Details | Field | Description | |:--|:--| | **Model Name** | `ai-text-detector-v-n4.0` | | **Base Model** | [microsoft/deberta-v3-large](https://huggingface.co/microsoft/deberta-v3-large) | | **Task** | Text Classification (Human-written vs AI-generated) | | **Language** | English | | **Framework** | PyTorch, Transformers | | **Trained by** | [Abhinav](https://huggingface.co/abhi099k) | | **Fine-tuned using** | Hugging Face `Trainer` API with early stopping, mixed precision (fp16), and F1 optimization. | --- ## ๐Ÿ“– Model Description This model fine-tunes **DeBERTa-v3-large** for detecting whether a given text is written by a **Human** or generated by an **AI**. It was trained on a custom dataset containing **10,000+ samples** of diverse text across multiple topics, labeled as: - `0` โ†’ Human-written text - `1` โ†’ AI-generated text The goal is to identify subtle linguistic differences and stylistic cues between natural human writing and machine-generated content. --- ## โš™๏ธ Training Configuration | Parameter | Value | |:--|:--| | **Epochs** | 4 | | **Batch size** | 8 | | **Learning Rate** | 2e-5 | | **Max Sequence Length** | 256 | | **Optimizer** | AdamW | | **Scheduler** | Linear decay | | **Weight Decay** | 0.01 | | **Seed** | 42 | | **Mixed Precision** | โœ… Yes (fp16) | | **Gradient Accumulation Steps** | 2 | | **Frameworks** | PyTorch, Transformers, Datasets | --- ## ๐Ÿงพ Dataset | Field | Value | |:--|:--| | **Source** | Custom dataset (`gpt_5_with_10k.csv`) | | **Columns** | `text`, `label` | | **Labels** | 0 = Human, 1 = AI | | **Split** | 90% Train / 10% Test | | **Cleaning** | Removed special characters, normalized whitespace, and standardized punctuation. | --- ## ๐Ÿ“Š Evaluation Results | Metric | Score | |:--|:--| | **Accuracy** | ~0.97 | | **F1 Score** | ~0.97 | | **Precision / Recall** | Balanced | **Confusion Matrix Example:** ``` [[4800 90] โ†’ True Human [ 110 5000]] โ†’ True AI ``` --- ## ๐Ÿ” Usage Example ```python from transformers import AutoTokenizer, AutoModelForSequenceClassification import torch model_id = "abhi099k/ai-text-detector-v-n4.0" tokenizer = AutoTokenizer.from_pretrained(model_id) model = AutoModelForSequenceClassification.from_pretrained(model_id) text = "This article explores the evolution of large language models..." inputs = tokenizer(text, return_tensors="pt", truncation=True, padding=True) outputs = model(**inputs) pred = torch.argmax(outputs.logits, dim=1).item() print("๐Ÿง‘ Human" if pred == 0 else "๐Ÿค– AI") ``` --- ## ๐Ÿ“ˆ Intended Use - Detect AI-generated content for moderation, academic integrity, or authenticity verification. - Use as a foundation model for fine-tuning on domain-specific datasets (e.g., essays, reviews, research papers). --- ## โš ๏ธ Limitations - May misclassify **paraphrased AI text** or **human text with robotic phrasing**. - Primarily trained on English โ€” not guaranteed for other languages. - Should not be used for punitive or high-stakes decisions without human review. --- ## ๐Ÿ† Future Improvements - Multi-language support (Hindi, Spanish, etc.) - Add stylistic embeddings for cross-model generalization. - Robustness testing against prompt-engineering and obfuscation. --- ## ๐Ÿงฉ Technical Summary | Component | Library | |:--|:--| | Tokenization | `AutoTokenizer` | | Model | `AutoModelForSequenceClassification` | | Trainer | `transformers.Trainer` | | Metrics | `evaluate` (accuracy, f1) | | Visualization | `matplotlib` (confusion matrix) | --- ## ๐Ÿ“ฌ Citation If you use this model, please cite: ``` @model{abhinav_ai_text_detector_v1, title = {AI Text Detector v1.0 โ€“ DeBERTa-v3-large Fine-tune}, author = {Abhinav}, year = {2025}, url = {https://huggingface.co/Abhinav/ai-text-detector-v-n4.0} } ```