Hi Team,
This is a report from Giskard Bot Scan 🐢.
We have identified 8 potential vulnerabilities in your model based on an automated scan.
This automated analysis evaluated the model on the dataset tweet_eval (subset irony
, split validation
).
👉Ethical issues (1)
When feature “text” is perturbed with the transformation “Switch countries from high- to low-income and vice versa”, the model changes its prediction in 6.06% of the cases. We expected the predictions not to be affected by this transformation.
Level |
Data slice |
Metric |
Deviation |
medium 🟡 |
— |
Fail rate = 0.061 |
2/33 tested samples (6.06%) changed prediction after perturbation |
Taxonomy
avid-effect:ethics:E0101
avid-effect:performance:P0201
🔍✨Examples
|
text |
Switch countries from high- to low-income and vice versa(text) |
Original prediction |
Prediction after perturbation |
485 |
@user
@user
it's like you're in the Maldives #seaandwhitesands |
@user
@user
it's like you're in the Burkina Faso #seaandwhitesands |
irony (p = 0.61) |
non_irony (p = 0.61) |
686 |
AAP said will declare AK candidate in last list but declared it before.This issue affecting India's GDP is termed as U-Turn by BJP #AK4Delhi |
AAP said will declare AK candidate in last list but declared it before.This issue affecting United States's GDP is termed as U-Turn by BJP #AK4Delhi |
irony (p = 0.50) |
non_irony (p = 0.52) |
👉Overconfidence issues (1)
For records in the dataset where text_length(text)
< 87.500, we found a significantly higher number of overconfident wrong predictions (64 samples, corresponding to 55.17% of the wrong predictions in the data slice).
Level |
Data slice |
Metric |
Deviation |
medium 🟡 |
text_length(text) < 87.500 |
Overconfidence rate = 0.552 |
+12.47% than global |
Taxonomy
avid-effect:performance:P0204
🔍✨Examples
|
text |
text_length(text) |
label |
Predicted label |
470 |
Today has been a blast |
22 |
non_irony |
irony (p = 0.98) |
|
|
|
|
non_irony (p = 0.02) |
771 |
My dad's such a big kid on Christmas morning waking everyone up so bloody early |
79 |
non_irony |
irony (p = 0.97) |
|
|
|
|
non_irony (p = 0.03) |
902 |
When one ear breaks on your headphones it's so frustrating! #today |
67 |
non_irony |
irony (p = 0.97) |
|
|
|
|
non_irony (p = 0.03) |
👉Robustness issues (5)
When feature “text” is perturbed with the transformation “Transform to uppercase”, the model changes its prediction in 20.15% of the cases. We expected the predictions not to be affected by this transformation.
Level |
Data slice |
Metric |
Deviation |
major 🔴 |
— |
Fail rate = 0.201 |
192/953 tested samples (20.15%) changed prediction after perturbation |
Taxonomy
avid-effect:performance:P0201
🔍✨Examples
|
text |
Transform to uppercase(text) |
Original prediction |
Prediction after perturbation |
4 |
!!! RT
@user
Of all the places to get stuck in a traffic jam |
!!! RT
@USER
OF ALL THE PLACES TO GET STUCK IN A TRAFFIC JAM |
irony (p = 0.51) |
non_irony (p = 0.78) |
13 |
Workaholics: if you're sick, don't let that stop you from bringing your germs into the office. We all appreciate your commitment. |
WORKAHOLICS: IF YOU'RE SICK, DON'T LET THAT STOP YOU FROM BRINGING YOUR GERMS INTO THE OFFICE. WE ALL APPRECIATE YOUR COMMITMENT. |
irony (p = 0.90) |
non_irony (p = 0.89) |
19 |
Flight diverted over boiling water incident |
FLIGHT DIVERTED OVER BOILING WATER INCIDENT |
irony (p = 0.70) |
non_irony (p = 0.88) |
When feature “text” is perturbed with the transformation “Transform to title case”, the model changes its prediction in 15.53% of the cases. We expected the predictions not to be affected by this transformation.
Level |
Data slice |
Metric |
Deviation |
major 🔴 |
— |
Fail rate = 0.155 |
148/953 tested samples (15.53%) changed prediction after perturbation |
Taxonomy
avid-effect:performance:P0201
🔍✨Examples
|
text |
Transform to title case(text) |
Original prediction |
Prediction after perturbation |
4 |
!!! RT
@user
Of all the places to get stuck in a traffic jam |
!!! Rt
@User
Of All The Places To Get Stuck In A Traffic Jam |
irony (p = 0.51) |
non_irony (p = 0.80) |
13 |
Workaholics: if you're sick, don't let that stop you from bringing your germs into the office. We all appreciate your commitment. |
Workaholics: If You'Re Sick, Don'T Let That Stop You From Bringing Your Germs Into The Office. We All Appreciate Your Commitment. |
irony (p = 0.90) |
non_irony (p = 0.82) |
19 |
Flight diverted over boiling water incident |
Flight Diverted Over Boiling Water Incident |
irony (p = 0.70) |
non_irony (p = 0.79) |
When feature “text” is perturbed with the transformation “Add typos”, the model changes its prediction in 11.99% of the cases. We expected the predictions not to be affected by this transformation.
Level |
Data slice |
Metric |
Deviation |
major 🔴 |
— |
Fail rate = 0.120 |
102/851 tested samples (11.99%) changed prediction after perturbation |
Taxonomy
avid-effect:performance:P0201
🔍✨Examples
|
text |
Add typos(text) |
Original prediction |
Prediction after perturbation |
7 |
#notcies #eu EU backs 328 top early-career researchers with 485 million |
#niotcies #eu EU backs 328 top early-carere researchers with 485 million |
non_irony (p = 0.64) |
irony (p = 0.54) |
22 |
@user
@user
@user
Well done. You have more Twitter followers than me. You have succeeded in life |
@user
@user
@user
Well dkone. You have morre Twitter folloers than me. You have sucdeeded in life |
irony (p = 0.89) |
non_irony (p = 0.92) |
55 |
@user
@user
you can't reason with someone with a bio as moronic as his. "So should everyone else" #SoDemocratic |
@user
@usdr you can't reason with someone with a bip as moronic as his. "So shou everyone else" #SoDemoctatic |
irony (p = 0.59) |
non_irony (p = 0.55) |
When feature “text” is perturbed with the transformation “Punctuation Removal”, the model changes its prediction in 8.8% of the cases. We expected the predictions not to be affected by this transformation.
Level |
Data slice |
Metric |
Deviation |
medium 🟡 |
— |
Fail rate = 0.088 |
68/773 tested samples (8.8%) changed prediction after perturbation |
Taxonomy
avid-effect:performance:P0201
🔍✨Examples
|
text |
Punctuation Removal(text) |
Original prediction |
Prediction after perturbation |
4 |
!!! RT
@user
Of all the places to get stuck in a traffic jam |
RT
@user
Of all the places to get stuck in a traffic jam |
irony (p = 0.51) |
non_irony (p = 0.63) |
15 |
What else would you do on friday? |
#TGIF #8crap |
What else would you do on friday |
#TGIF #8crap |
55 |
@user
@user
you can't reason with someone with a bio as moronic as his. "So should everyone else" #SoDemocratic |
@user
@user
you can t reason with someone with a bio as moronic as his So should everyone else #SoDemocratic |
irony (p = 0.59) |
non_irony (p = 0.53) |
When feature “text” is perturbed with the transformation “Transform to lowercase”, the model changes its prediction in 5.16% of the cases. We expected the predictions not to be affected by this transformation.
Level |
Data slice |
Metric |
Deviation |
medium 🟡 |
— |
Fail rate = 0.052 |
44/852 tested samples (5.16%) changed prediction after perturbation |
Taxonomy
avid-effect:performance:P0201
🔍✨Examples
|
text |
Transform to lowercase(text) |
Original prediction |
Prediction after perturbation |
4 |
!!! RT
@user
Of all the places to get stuck in a traffic jam |
!!! rt
@user
of all the places to get stuck in a traffic jam |
irony (p = 0.51) |
non_irony (p = 0.62) |
29 |
@user
Frisky at 2am? That's nothing new. |
@user
frisky at 2am? that's nothing new. |
non_irony (p = 0.57) |
irony (p = 0.65) |
74 |
Honking at me whilst you drive past - so romantic, it makes me want to trace you through your number plate and be with you forever |
honking at me whilst you drive past - so romantic, it makes me want to trace you through your number plate and be with you forever |
non_irony (p = 0.53) |
irony (p = 0.51) |
👉Performance issues (1)
For records in the dataset where text
contains "user", the Recall is 22.76% lower than the global Recall.
Level |
Data slice |
Metric |
Deviation |
major 🔴 |
text contains "user" |
Recall = 0.556 |
-22.76% than global |
Taxonomy
avid-effect:performance:P0204
🔍✨Examples
|
text |
label |
Predicted label |
35 |
@user
hahaha such a 1% town |
non_irony |
irony (p = 0.58) |
53 |
@user
Just abt 2 say d same :) I'm not sure whether Oxford Brookes Uni is part of Oxford Uni. yet his CV is impressive still! |
irony |
non_irony (p = 0.83) |
64 |
@user
even your link to the service alert is down. |
irony |
non_irony (p = 0.65) |
Checkout out the Giskard Space and test your model.
Disclaimer: it's important to note that automated scans may produce false positives or miss certain vulnerabilities. We encourage you to review the findings and assess the impact accordingly.