File size: 6,700 Bytes
a16b9a2
6ae5094
 
212b2b0
 
 
 
6ae5094
 
 
a16b9a2
 
 
 
 
6ae5094
90b60b6
a16b9a2
 
 
 
 
 
6ae5094
 
a16b9a2
6ae5094
 
 
cceed7f
6ae5094
a16b9a2
 
 
 
 
cceed7f
 
 
a16b9a2
 
 
6ae5094
 
bbc6e39
a16b9a2
 
 
 
6030b15
6ae5094
 
 
6030b15
6ae5094
 
01a9c48
a16b9a2
 
 
 
6ae5094
 
 
 
01a9c48
 
2d92aee
6ae5094
 
 
 
 
 
 
 
 
 
 
 
90b60b6
6ae5094
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
9b46208
6ae5094
c63a7b1
2d92aee
a16b9a2
 
6ae5094
 
 
 
 
 
 
 
 
 
 
a16b9a2
 
 
 
 
6ae5094
 
 
9b46208
a16b9a2
 
 
 
6ae5094
a16b9a2
 
bbc6e39
8ceb33f
bbc6e39
 
 
 
 
 
 
 
8ceb33f
bbc6e39
 
8ceb33f
9b46208
 
 
 
bbc6e39
9b46208
 
 
bbc6e39
8ceb33f
96762cf
 
 
a16b9a2
96762cf
 
 
 
 
 
 
a16b9a2
9b46208
 
 
 
 
 
 
 
 
 
 
a16b9a2
 
01a9c48
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
---
language:
- en
license: mit
library_name: transformers
tags:
- fake news
metrics:
- accuracy
pipeline_tag: text-classification
---

# Model Card for Model ID

<!-- Provide a quick summary of what the model is/does. -->
Predicts whether the news article's title is fake or real.
This is my first work, if you find the model interesting or useful, please like it, it will encourage me to do more research <3

## Model Details

### Model Description

<!-- Provide a longer summary of what this model is. -->
This model's purpose is to classify, whether the information, given in the news article, is true or false. It was trained on 2 datasets, 
combined and preprocessed. 0 (LABEL_0) stands for false and 1 stands for true.

- **Developed by:** Ostap Mykhailiv
- **Model type:** Classification
- **Language(s) (NLP):** English
- **License:** MIT
- **Finetuned from model:** google-bert/bert-base-uncased


## Bias, Risks, and Limitations

<!-- This section is meant to convey both technical and sociotechnical limitations. -->
Since it's a Bert model, it also exhibits bias. Be careful about checking some specific data by this model, since
it was trained on pre-2023 data. Additionally, the lack of preprocessing for people's names in the training data might
cause a bias towards certain persons.
### Recommendations

<!-- This section is meant to convey recommendations with respect to the bias, risk, and technical limitations. -->
To get better overall results, I decided to make a title truncation in training. Though it increased the overall result for both longer and
shorter text, one should not give less than 6 and more than 12 words for predictions, excluding stopwords. For the preprocess operations look below.
One can translate news from the language into English, though it may not give the expected results. 

## How to Get Started with the Model

Use the code below to get started with the model.
```
from transformers import pipeline
pipe = pipeline("text-classification", model="omykhailiv/bert-fake-news-recognition")
pipe.predict('Some text')
```
It will return something like this: 
[{'label': 'LABEL_0', 'score': 0.7248537290096283}]
Where 'LABEL_0' means false and 'score' stands for the probability of it.

### Training Data

<!-- This should link to a Dataset Card, perhaps with a short stub of information on what the training data is all about as well as documentation related to data pre-processing or additional filtering. -->
https://huggingface.co/datasets/GonzaloA/fake_news
https://github.com/GeorgeMcIntire/fake_real_news_dataset

#### Preprocessing
Preprocessing was made by using this function. Note that the data, tested below, was not truncated to 
12 >= len(new_filtered_words) >= 6, but it has still been pre-processed.
```
import re
import string
import spacy
from nltk.corpus import stopwords
lem = spacy.load('en_core_web_sm')
def testing_data_prep(text):
    """
    Args:
        text (str): The input text string.

    Returns:
        str: The preprocessed text string, or an empty string if the length
             does not meet the specified criteria (6 to 20 words).
    """
    # Convert text to lowercase for case-insensitive processing
    text = str(text).lower()

    # Remove HTML tags and their contents (e.g., "<tag>text</tag>")
    text = re.sub('<.*?>+\w+<.*?>', '', text)

    # Remove punctuation using regular expressions and string escaping
    text = re.sub('[%s]' % re.escape(string.punctuation), '', text)

    # Remove words containing alphanumeric characters followed by digits
    # (e.g., "model2023", "data10")
    text = re.sub('\w*\d\w*', '', text)

    # Remove newline characters
    text = re.sub('\n', '', text)

    # Replace multiple whitespace characters with a single space
    text = re.sub('\\s+', ' ', text)

    # Lemmatize words (convert them to their base form)
    text = lem(text)
    words = [word.lemma_ for word in text]
    
    # Removing stopwords, such as do, not, as, etc. (https://gist.github.com/sebleier/554280)
    new_filtered_words = [
    word for word in words if word not in stopwords.words('english')]
    if 20 >= len(new_filtered_words) >= 6:
      return ' '.join(new_filtered_words)
    return ' '
```

#### Training Hyperparameters
The following hyperparameters were used during training:

 - learning_rate: 2e-5
 - train_batch_size: 32
 - eval_batch_size: 32
 - num_epochs: 5
 - warmup_steps: 500
 - weight_decay: 0.03
 - random seed: 42
   
### Testing Data, Metrics

#### Testing Data

<!-- This should link to a Dataset Card if possible. -->

https://huggingface.co/datasets/GonzaloA/fake_news
https://github.com/GeorgeMcIntire/fake_real_news_dataset
https://onlineacademiccommunity.uvic.ca/isot/2022/11/27/fake-news-detection-datasets/
https://www.kaggle.com/datasets/saurabhshahane/fake-news-classification/data


#### Metrics
<!-- These are the evaluation metrics being used, ideally with a description of why. -->
Accuracy

### Results
For testing on GonzaloA/fake_news test split dataset
```
              precision    recall  f1-score   support

           0       0.93      0.94      0.94      3782
           1       0.95      0.94      0.95      4335

    accuracy                           0.94      8117
   macro avg       0.94      0.94      0.94      8117
weighted avg       0.94      0.94      0.94      8117
```

For testing on https://github.com/GeorgeMcIntire/fake_real_news_dataset
```
               precision    recall  f1-score   support

           0       0.93      0.88      0.90      2297
           1       0.89      0.93      0.91      2297

    accuracy                           0.91      4594
   macro avg       0.91      0.91      0.91      4594
weighted avg       0.91      0.91      0.91      4594

```
For testing on https://onlineacademiccommunity.uvic.ca/isot/2022/11/27/fake-news-detection-datasets/
```
              precision    recall  f1-score   support

           0     0.9736    0.9750    0.9743     10455
           1     0.9726    0.9711    0.9718      9541

    accuracy                         0.9731     19996
   macro avg     0.9731    0.9731    0.9731     19996
weighted avg     0.9731    0.9731    0.9731     19996
```

For testing on random 1k rows of https://www.kaggle.com/datasets/saurabhshahane/fake-news-classification/data
```
              precision    recall  f1-score   support

           0       0.87      0.80      0.84       492
           1       0.82      0.89      0.85       508

    accuracy                           0.85      1000
   macro avg       0.85      0.85      0.85      1000
weighted avg       0.85      0.85      0.85      1000
```
#### Hardware

Tesla T4 GPU, available for free in Google Collab