Twitter Spam classification
Collection
Models designed to detect spam on Twitter (now known as X).
•
2 items
•
Updated
This model classifies Tweets from X (formerly known as Twitter) into 'Spam' (1) or 'Quality' (0).
This was fine-tuned on the UtkMl's Twitter Spam Detection dataset with microsoft/deberta-v3-large
serving as the base model.
Here is some source code to get you started on using the model to classify spam Tweets.
def classify_texts(df, text_col, model_path="cja5553/deberta-Twitter-spam-classification", batch_size=24):
'''
Classifies texts as either "Quality" or "Spam" using a pre-trained sequence classification model.
Parameters:
-----------
df : pandas.DataFrame
DataFrame containing the texts to classify.
text_col : str
Name of the column in that contains the text data to be classified.
model_path : str, default="cja5553/deberta-Twitter-spam-classification"
Path to the pre-trained model for sequence classification.
batch_size : int, optional, default=24
Batch size for loading and processing data in batches. Adjust based on available GPU memory.
Returns:
--------
pandas.DataFrame
The original DataFrame with an additional column `spam_prediction`, containing the predicted labels ("Quality" or "Spam") for each text.
'''
# Load the tokenizer and model
tokenizer = AutoTokenizer.from_pretrained(model_path)
model = AutoModelForSequenceClassification.from_pretrained(model_path).to("cuda")
model.eval() # Set model to evaluation mode
# Prepare the text data for classification
df["text"] = df[text_col].astype(str) # Ensure text is in string format
# Convert the data to a Hugging Face Dataset and tokenize
text_dataset = Dataset.from_pandas(df)
def tokenize_function(example):
return tokenizer(
example["text"],
padding="max_length",
truncation=True,
max_length=512
)
text_dataset = text_dataset.map(tokenize_function, batched=True)
text_dataset.set_format(type='torch', columns=['input_ids', 'attention_mask'])
# DataLoader for the text data
text_loader = DataLoader(text_dataset, batch_size=batch_size)
# Make predictions
predictions = []
with torch.no_grad():
for batch in tqdm_notebook(text_loader):
input_ids = batch['input_ids'].to("cuda")
attention_mask = batch['attention_mask'].to("cuda")
# Forward pass
outputs = model(input_ids=input_ids, attention_mask=attention_mask)
logits = outputs.logits
preds = torch.argmax(logits, dim=-1).cpu().numpy() # Get predicted labels
predictions.extend(preds)
# Map predictions to labels
id2label = {0: "Quality", 1: "Spam"}
predicted_labels = [id2label[pred] for pred in predictions]
# Add predictions to the original DataFrame
df["spam_prediction"] = predicted_labels
return df
spam_df_classification = classify_texts(df, "text_col")
print(spam_df_classification)
Based on a 80-10-10 train-val-test split, the following results were obtained on the test set:
Code used to train these models are available on GitHub at github.com/cja5553/Twitter_spam_detection
contact me at [email protected]
Base model
microsoft/deberta-v3-large