Hi Team,
This is a report from Giskard Bot Scan 🐢.
We have identified 11 potential vulnerabilities in your model based on an automated scan.
This automated analysis evaluated the model on the dataset sst2 (subset default
, split train
).
👉Underconfidence issues (3)
For records in your dataset where text_length(text)
>= 50.500, we found a significantly higher number of underconfident predictions (1275 samples, corresponding to 4.77% of the predictions in the data slice).
Level |
Data slice |
Metric |
Deviation |
major 🔴 |
text_length(text) >= 50.500 |
Underconfidence rate = 0.048 |
+105.32% than global |
Taxonomy
avid-effect:performance:P0204
🔍✨Examples
|
text |
text_length(text) |
label |
Predicted label |
46185 |
richly detailed , deftly executed and utterly absorbing |
56 |
informal |
informal (p = 0.50) |
|
|
|
|
formal (p = 0.50) |
35430 |
an inelegant combination of two unrelated shorts that falls far short |
70 |
formal |
informal (p = 0.50) |
|
|
|
|
formal (p = 0.50) |
41573 |
covers huge , heavy topics in a bland , surfacey way that does n't offer any insight into why , for instance , good things happen to bad people |
144 |
formal |
formal (p = 0.50) |
|
|
|
|
informal (p = 0.50) |
For records in your dataset where avg_word_length(text)
< 6.969 AND avg_word_length(text)
>= 3.506, we found a significantly higher number of underconfident predictions (1462 samples, corresponding to 2.80% of the predictions in the data slice).
Level |
Data slice |
Metric |
Deviation |
major 🔴 |
avg_word_length(text) < 6.969 AND avg_word_length(text) >= 3.506 |
Underconfidence rate = 0.028 |
+20.70% than global |
Taxonomy
avid-effect:performance:P0204
🔍✨Examples
|
text |
avg_word_length(text) |
label |
Predicted label |
46185 |
richly detailed , deftly executed and utterly absorbing |
6 |
informal |
informal (p = 0.50) |
|
|
|
|
formal (p = 0.50) |
35430 |
an inelegant combination of two unrelated shorts that falls far short |
5.36364 |
formal |
informal (p = 0.50) |
|
|
|
|
formal (p = 0.50) |
41573 |
covers huge , heavy topics in a bland , surfacey way that does n't offer any insight into why , for instance , good things happen to bad people |
3.96552 |
formal |
formal (p = 0.50) |
|
|
|
|
informal (p = 0.50) |
For records in your dataset where avg_whitespace(text)
>= 0.125 AND avg_whitespace(text)
< 0.222, we found a significantly higher number of underconfident predictions (1462 samples, corresponding to 2.80% of the predictions in the data slice).
Level |
Data slice |
Metric |
Deviation |
major 🔴 |
avg_whitespace(text) >= 0.125 AND avg_whitespace(text) < 0.222 |
Underconfidence rate = 0.028 |
+20.70% than global |
Taxonomy
avid-effect:performance:P0204
🔍✨Examples
|
text |
avg_whitespace(text) |
label |
Predicted label |
46185 |
richly detailed , deftly executed and utterly absorbing |
0.142857 |
informal |
informal (p = 0.50) |
|
|
|
|
formal (p = 0.50) |
35430 |
an inelegant combination of two unrelated shorts that falls far short |
0.157143 |
formal |
informal (p = 0.50) |
|
|
|
|
formal (p = 0.50) |
41573 |
covers huge , heavy topics in a bland , surfacey way that does n't offer any insight into why , for instance , good things happen to bad people |
0.201389 |
formal |
formal (p = 0.50) |
|
|
|
|
informal (p = 0.50) |
👉Ethical issues (1)
When feature “text” is perturbed with the transformation “Switch countries from high- to low-income and vice versa”, the model changes its prediction in 5.8% of the cases. We expected the predictions not to be affected by this transformation.
Level |
Data slice |
Metric |
Deviation |
medium 🟡 |
— |
Fail rate = 0.058 |
58/1000 tested samples (5.8%) changed prediction after perturbation |
Taxonomy
avid-effect:ethics:E0101
avid-effect:performance:P0201
🔍✨Examples
|
text |
Switch countries from high- to low-income and vice versa(text) |
Original prediction |
Prediction after perturbation |
65960 |
point the way for adventurous indian filmmakers toward a crossover |
point the way for adventurous Tongan filmmakers toward a crossover |
informal (p = 0.52) |
formal (p = 0.55) |
63459 |
: land beyond time is an enjoyable big movie primarily because australia is a weirdly beautiful place . |
: land beyond time is an enjoyable big movie primarily because Yemen is a weirdly beautiful place . |
informal (p = 0.52) |
formal (p = 0.51) |
65238 |
is on his way to becoming the american indian spike lee . |
is on his way to becoming the Yemeni French spike lee . |
informal (p = 0.55) |
formal (p = 0.51) |
👉Robustness issues (3)
When feature “text” is perturbed with the transformation “Transform to uppercase”, the model changes its prediction in 14.3% of the cases. We expected the predictions not to be affected by this transformation.
Level |
Data slice |
Metric |
Deviation |
major 🔴 |
— |
Fail rate = 0.143 |
143/1000 tested samples (14.3%) changed prediction after perturbation |
Taxonomy
avid-effect:performance:P0201
🔍✨Examples
|
text |
Transform to uppercase(text) |
Original prediction |
Prediction after perturbation |
27065 |
affecting story about four sisters who are coping , in one way or another , with life |
AFFECTING STORY ABOUT FOUR SISTERS WHO ARE COPING , IN ONE WAY OR ANOTHER , WITH LIFE |
formal (p = 0.51) |
informal (p = 0.79) |
4718 |
is n't afraid to provoke introspection in both its characters and its audience |
IS N'T AFRAID TO PROVOKE INTROSPECTION IN BOTH ITS CHARACTERS AND ITS AUDIENCE |
formal (p = 0.51) |
informal (p = 0.66) |
26573 |
lacks the visual flair and bouncing bravado that characterizes better hip-hop clips and is content to recycle images and characters that were already tired 10 years ago |
LACKS THE VISUAL FLAIR AND BOUNCING BRAVADO THAT CHARACTERIZES BETTER HIP-HOP CLIPS AND IS CONTENT TO RECYCLE IMAGES AND CHARACTERS THAT WERE ALREADY TIRED 10 YEARS AGO |
formal (p = 0.76) |
informal (p = 0.66) |
When feature “text” is perturbed with the transformation “Add typos”, the model changes its prediction in 12.5% of the cases. We expected the predictions not to be affected by this transformation.
Level |
Data slice |
Metric |
Deviation |
major 🔴 |
— |
Fail rate = 0.125 |
125/1000 tested samples (12.5%) changed prediction after perturbation |
Taxonomy
avid-effect:performance:P0201
🔍✨Examples
|
text |
Add typos(text) |
Original prediction |
Prediction after perturbation |
61132 |
pared down its plots and characters to a few rather than dozens |
pared down its plots and characters to a few rather than xdozens |
formal (p = 0.53) |
informal (p = 0.59) |
61447 |
's so crammed with scenes and vistas and pretty moments that it 's left a few crucial things out , like character development and coherence . |
's so crzammed with scenes and vistas and pretty momenys that it 's left af ew crucial things out , like character development and coherencde . |
formal (p = 0.50) |
informal (p = 0.74) |
10747 |
the breathtakingly beautiful outer-space documentary space station 3d |
the breathtakingly beautifuo outer-space documentary dpace station 3d |
formal (p = 0.62) |
informal (p = 0.54) |
When feature “text” is perturbed with the transformation “Transform to title case”, the model changes its prediction in 9.1% of the cases. We expected the predictions not to be affected by this transformation.
Level |
Data slice |
Metric |
Deviation |
medium 🟡 |
— |
Fail rate = 0.091 |
91/1000 tested samples (9.1%) changed prediction after perturbation |
Taxonomy
avid-effect:performance:P0201
🔍✨Examples
|
text |
Transform to title case(text) |
Original prediction |
Prediction after perturbation |
27065 |
affecting story about four sisters who are coping , in one way or another , with life |
Affecting Story About Four Sisters Who Are Coping , In One Way Or Another , With Life |
formal (p = 0.51) |
informal (p = 0.60) |
4718 |
is n't afraid to provoke introspection in both its characters and its audience |
Is N'T Afraid To Provoke Introspection In Both Its Characters And Its Audience |
formal (p = 0.51) |
informal (p = 0.61) |
54137 |
demonstrates that the director of such hollywood blockbusters as patriot games can still turn out a small , personal film with an emotional wallop |
Demonstrates That The Director Of Such Hollywood Blockbusters As Patriot Games Can Still Turn Out A Small , Personal Film With An Emotional Wallop |
formal (p = 0.54) |
informal (p = 0.51) |
👉Overconfidence issues (4)
For records in the dataset where text_length(text)
< 22.500, we found a significantly higher number of overconfident wrong predictions (6722 samples, corresponding to 87.20% of the wrong predictions in the data slice).
Level |
Data slice |
Metric |
Deviation |
major 🔴 |
text_length(text) < 22.500 |
Overconfidence rate = 0.872 |
+86.80% than global |
Taxonomy
avid-effect:performance:P0204
🔍✨Examples
|
text |
text_length(text) |
label |
Predicted label |
2129 |
one baaaaaaaaad movie |
22 |
formal |
informal (p = 0.93) |
|
|
|
|
formal (p = 0.07) |
21336 |
baaaaaaaaad |
12 |
formal |
informal (p = 0.93) |
|
|
|
|
formal (p = 0.07) |
30102 |
baaaaaaaaad movie |
18 |
formal |
informal (p = 0.92) |
|
|
|
|
formal (p = 0.08) |
For records in the dataset where avg_word_length(text)
>= 6.969, we found a significantly higher number of overconfident wrong predictions (2734 samples, corresponding to 69.53% of the wrong predictions in the data slice).
Level |
Data slice |
Metric |
Deviation |
major 🔴 |
avg_word_length(text) >= 6.969 |
Overconfidence rate = 0.695 |
+48.95% than global |
Taxonomy
avid-effect:performance:P0204
🔍✨Examples
|
text |
avg_word_length(text) |
label |
Predicted label |
19179 |
compassionately explores the seemingly irreconcilable situation between conservative christian parents and their estranged gay and lesbian children . |
7.33333 |
informal |
formal (p = 1.00) |
|
|
|
|
informal (p = 0.00) |
26630 |
the intelligence or sincerity it unequivocally deserves |
7 |
informal |
formal (p = 0.98) |
|
|
|
|
informal (p = 0.02) |
45560 |
the performances and the cinematography to the outstanding soundtrack and unconventional narrative |
7.25 |
informal |
formal (p = 0.96) |
|
|
|
|
informal (p = 0.04) |
For records in the dataset where avg_whitespace(text)
< 0.125, we found a significantly higher number of overconfident wrong predictions (2734 samples, corresponding to 69.53% of the wrong predictions in the data slice).
Level |
Data slice |
Metric |
Deviation |
major 🔴 |
avg_whitespace(text) < 0.125 |
Overconfidence rate = 0.695 |
+48.95% than global |
Taxonomy
avid-effect:performance:P0204
🔍✨Examples
|
text |
avg_whitespace(text) |
label |
Predicted label |
19179 |
compassionately explores the seemingly irreconcilable situation between conservative christian parents and their estranged gay and lesbian children . |
0.12 |
informal |
formal (p = 1.00) |
|
|
|
|
informal (p = 0.00) |
26630 |
the intelligence or sincerity it unequivocally deserves |
0.125 |
informal |
formal (p = 0.98) |
|
|
|
|
informal (p = 0.02) |
45560 |
the performances and the cinematography to the outstanding soundtrack and unconventional narrative |
0.121212 |
informal |
formal (p = 0.96) |
|
|
|
|
informal (p = 0.04) |
For records in the dataset where text_length(text)
< 34.500 AND text_length(text)
>= 22.500, we found a significantly higher number of overconfident wrong predictions (3118 samples, corresponding to 63.83% of the wrong predictions in the data slice).
Level |
Data slice |
Metric |
Deviation |
major 🔴 |
text_length(text) < 34.500 AND text_length(text) >= 22.500 |
Overconfidence rate = 0.638 |
+36.73% than global |
Taxonomy
avid-effect:performance:P0204
🔍✨Examples
|
text |
text_length(text) |
label |
Predicted label |
15121 |
is one baaaaaaaaad movie |
25 |
formal |
informal (p = 0.92) |
|
|
|
|
formal (p = 0.08) |
49755 |
is one baaaaaaaaad movie . |
27 |
formal |
informal (p = 0.92) |
|
|
|
|
formal (p = 0.08) |
32815 |
this is one baaaaaaaaad movie . |
32 |
formal |
informal (p = 0.92) |
|
|
|
|
formal (p = 0.08) |
Checkout out the Giskard Space and test your model.
Disclaimer: it's important to note that automated scans may produce false positives or miss certain vulnerabilities. We encourage you to review the findings and assess the impact accordingly.