File size: 6,700 Bytes
a16b9a2 6ae5094 212b2b0 6ae5094 a16b9a2 6ae5094 90b60b6 a16b9a2 6ae5094 a16b9a2 6ae5094 cceed7f 6ae5094 a16b9a2 cceed7f a16b9a2 6ae5094 bbc6e39 a16b9a2 6030b15 6ae5094 6030b15 6ae5094 01a9c48 a16b9a2 6ae5094 01a9c48 2d92aee 6ae5094 90b60b6 6ae5094 9b46208 6ae5094 c63a7b1 2d92aee a16b9a2 6ae5094 a16b9a2 6ae5094 9b46208 a16b9a2 6ae5094 a16b9a2 bbc6e39 8ceb33f bbc6e39 8ceb33f bbc6e39 8ceb33f 9b46208 bbc6e39 9b46208 bbc6e39 8ceb33f 96762cf a16b9a2 96762cf a16b9a2 9b46208 a16b9a2 01a9c48 |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 |
---
language:
- en
license: mit
library_name: transformers
tags:
- fake news
metrics:
- accuracy
pipeline_tag: text-classification
---
# Model Card for Model ID
<!-- Provide a quick summary of what the model is/does. -->
Predicts whether the news article's title is fake or real.
This is my first work, if you find the model interesting or useful, please like it, it will encourage me to do more research <3
## Model Details
### Model Description
<!-- Provide a longer summary of what this model is. -->
This model's purpose is to classify, whether the information, given in the news article, is true or false. It was trained on 2 datasets,
combined and preprocessed. 0 (LABEL_0) stands for false and 1 stands for true.
- **Developed by:** Ostap Mykhailiv
- **Model type:** Classification
- **Language(s) (NLP):** English
- **License:** MIT
- **Finetuned from model:** google-bert/bert-base-uncased
## Bias, Risks, and Limitations
<!-- This section is meant to convey both technical and sociotechnical limitations. -->
Since it's a Bert model, it also exhibits bias. Be careful about checking some specific data by this model, since
it was trained on pre-2023 data. Additionally, the lack of preprocessing for people's names in the training data might
cause a bias towards certain persons.
### Recommendations
<!-- This section is meant to convey recommendations with respect to the bias, risk, and technical limitations. -->
To get better overall results, I decided to make a title truncation in training. Though it increased the overall result for both longer and
shorter text, one should not give less than 6 and more than 12 words for predictions, excluding stopwords. For the preprocess operations look below.
One can translate news from the language into English, though it may not give the expected results.
## How to Get Started with the Model
Use the code below to get started with the model.
```
from transformers import pipeline
pipe = pipeline("text-classification", model="omykhailiv/bert-fake-news-recognition")
pipe.predict('Some text')
```
It will return something like this:
[{'label': 'LABEL_0', 'score': 0.7248537290096283}]
Where 'LABEL_0' means false and 'score' stands for the probability of it.
### Training Data
<!-- This should link to a Dataset Card, perhaps with a short stub of information on what the training data is all about as well as documentation related to data pre-processing or additional filtering. -->
https://huggingface.co/datasets/GonzaloA/fake_news
https://github.com/GeorgeMcIntire/fake_real_news_dataset
#### Preprocessing
Preprocessing was made by using this function. Note that the data, tested below, was not truncated to
12 >= len(new_filtered_words) >= 6, but it has still been pre-processed.
```
import re
import string
import spacy
from nltk.corpus import stopwords
lem = spacy.load('en_core_web_sm')
def testing_data_prep(text):
"""
Args:
text (str): The input text string.
Returns:
str: The preprocessed text string, or an empty string if the length
does not meet the specified criteria (6 to 20 words).
"""
# Convert text to lowercase for case-insensitive processing
text = str(text).lower()
# Remove HTML tags and their contents (e.g., "<tag>text</tag>")
text = re.sub('<.*?>+\w+<.*?>', '', text)
# Remove punctuation using regular expressions and string escaping
text = re.sub('[%s]' % re.escape(string.punctuation), '', text)
# Remove words containing alphanumeric characters followed by digits
# (e.g., "model2023", "data10")
text = re.sub('\w*\d\w*', '', text)
# Remove newline characters
text = re.sub('\n', '', text)
# Replace multiple whitespace characters with a single space
text = re.sub('\\s+', ' ', text)
# Lemmatize words (convert them to their base form)
text = lem(text)
words = [word.lemma_ for word in text]
# Removing stopwords, such as do, not, as, etc. (https://gist.github.com/sebleier/554280)
new_filtered_words = [
word for word in words if word not in stopwords.words('english')]
if 20 >= len(new_filtered_words) >= 6:
return ' '.join(new_filtered_words)
return ' '
```
#### Training Hyperparameters
The following hyperparameters were used during training:
- learning_rate: 2e-5
- train_batch_size: 32
- eval_batch_size: 32
- num_epochs: 5
- warmup_steps: 500
- weight_decay: 0.03
- random seed: 42
### Testing Data, Metrics
#### Testing Data
<!-- This should link to a Dataset Card if possible. -->
https://huggingface.co/datasets/GonzaloA/fake_news
https://github.com/GeorgeMcIntire/fake_real_news_dataset
https://onlineacademiccommunity.uvic.ca/isot/2022/11/27/fake-news-detection-datasets/
https://www.kaggle.com/datasets/saurabhshahane/fake-news-classification/data
#### Metrics
<!-- These are the evaluation metrics being used, ideally with a description of why. -->
Accuracy
### Results
For testing on GonzaloA/fake_news test split dataset
```
precision recall f1-score support
0 0.93 0.94 0.94 3782
1 0.95 0.94 0.95 4335
accuracy 0.94 8117
macro avg 0.94 0.94 0.94 8117
weighted avg 0.94 0.94 0.94 8117
```
For testing on https://github.com/GeorgeMcIntire/fake_real_news_dataset
```
precision recall f1-score support
0 0.93 0.88 0.90 2297
1 0.89 0.93 0.91 2297
accuracy 0.91 4594
macro avg 0.91 0.91 0.91 4594
weighted avg 0.91 0.91 0.91 4594
```
For testing on https://onlineacademiccommunity.uvic.ca/isot/2022/11/27/fake-news-detection-datasets/
```
precision recall f1-score support
0 0.9736 0.9750 0.9743 10455
1 0.9726 0.9711 0.9718 9541
accuracy 0.9731 19996
macro avg 0.9731 0.9731 0.9731 19996
weighted avg 0.9731 0.9731 0.9731 19996
```
For testing on random 1k rows of https://www.kaggle.com/datasets/saurabhshahane/fake-news-classification/data
```
precision recall f1-score support
0 0.87 0.80 0.84 492
1 0.82 0.89 0.85 508
accuracy 0.85 1000
macro avg 0.85 0.85 0.85 1000
weighted avg 0.85 0.85 0.85 1000
```
#### Hardware
Tesla T4 GPU, available for free in Google Collab |