File size: 2,755 Bytes
ce0757d
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
ba19f2e
 
 
 
 
 
 
 
 
 
 
 
 
ce0757d
 
 
ba19f2e
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
c758d04
 
 
 
ba19f2e
 
 
 
 
 
 
 
 
 
 
 
1708496
ba19f2e
 
 
1708496
 
ba19f2e
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
ce0757d
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
---
license: gpl-3.0
datasets:
- toughdata/quora-question-answer-dataset
language:
- en
metrics:
- accuracy
- precision
- recall
- f1
base_model:
- distilbert/distilbert-base-uncased
pipeline_tag: text-classification
library_name: transformers
---
# distilbert-base-q-cat

## Model Description

<!-- Provide a longer summary of what this model is. -->
distilbert-base-q-cat is a lightweight, fine-tuned DistilBERT model designed for text classification, specifically focusing on categorizing questions into three distinct categories: fact, opinion, and hypothetical. The model was trained on a Quora dataset, leveraging keyword-based labeling and sentiment analysis to ensure high-quality categorization.

## Features

Built on DistilBERT, ensuring faster inference and lower computational requirements compared to standard BERT.

Three Class Categories:

- **Fact:** Questions seeking factual or objective information.
- **Opinion:** Questions that elicit subjective views or opinions.
- **Hypothetical:** Questions exploring hypothetical scenarios or speculative ideas.

Pretrained and Fine-Tuned: Utilizes DistilBERT’s pretrained weights with additional fine-tuning on labeled data.
<!-- Address questions around how the model is intended to be used, including the foreseeable users of the model and those affected by the model. -->

## Dataset

The model was trained using a custom dataset derived from Quora questions:

Data Preparation:

- Labeling involved keyword-based rules for fact and hypothetical questions.

- Sentiment analysis determined questions as opinion-based.

Dataset Size: ~50k samples, split into training, validation, and test sets.

## Performance

The model achieves the following metrics on the validation set:

- **Accuracy:** 93.33%
- **Precision:** 93.41%
- **Recall:** 93.33%
- **F1-Score:** 93.32%

## Installation

To use this model, install the required dependencies:
```
pip install transformers torch
```

## Usage

### Load Model and Tokenizer
```
from transformers import AutoTokenizer, AutoModelForSequenceClassification

# Load model and tokenizer
model_name = "distilbert-base-q-cat"
model = AutoModelForSequenceClassification.from_pretrained(model_name, num_labels=3, ignore_mismatched_sizes=True)
tokenizer = AutoTokenizer.from_pretrained(model_name)
```

### Inference Example
```
def predict_question(question):
    inputs = tokenizer(question, return_tensors="pt", truncation=True, padding=True)
    outputs = model(**inputs)
    logits = outputs.logits
    predicted_class = logits.argmax(dim=-1).item()

    label_map = {0: "fact", 1: "opinion", 2: "hypothetical"}
    return label_map[predicted_class]

# Example usage
question = "What is artificial intelligence?"
print(predict_question(question))
```