Report for s-nlp/roberta-base-formality-ranker

#88
by giskard-bot - opened

Hi Team,

This is a report from Giskard Bot Scan 🐢.

We have identified 11 potential vulnerabilities in your model based on an automated scan.

This automated analysis evaluated the model on the dataset sst2 (subset default, split train).

👉Underconfidence issues (3)

For records in your dataset where text_length(text) >= 50.500, we found a significantly higher number of underconfident predictions (1275 samples, corresponding to 4.77% of the predictions in the data slice).

Level Data slice Metric Deviation
major 🔴 text_length(text) >= 50.500 Underconfidence rate = 0.048 +105.32% than global

Taxonomy

avid-effect:performance:P0204
🔍✨Examples
text text_length(text) label Predicted label
46185 richly detailed , deftly executed and utterly absorbing 56 informal informal (p = 0.50)
formal (p = 0.50)
35430 an inelegant combination of two unrelated shorts that falls far short 70 formal informal (p = 0.50)
formal (p = 0.50)
41573 covers huge , heavy topics in a bland , surfacey way that does n't offer any insight into why , for instance , good things happen to bad people 144 formal formal (p = 0.50)
informal (p = 0.50)

For records in your dataset where avg_word_length(text) < 6.969 AND avg_word_length(text) >= 3.506, we found a significantly higher number of underconfident predictions (1462 samples, corresponding to 2.80% of the predictions in the data slice).

Level Data slice Metric Deviation
major 🔴 avg_word_length(text) < 6.969 AND avg_word_length(text) >= 3.506 Underconfidence rate = 0.028 +20.70% than global

Taxonomy

avid-effect:performance:P0204
🔍✨Examples
text avg_word_length(text) label Predicted label
46185 richly detailed , deftly executed and utterly absorbing 6 informal informal (p = 0.50)
formal (p = 0.50)
35430 an inelegant combination of two unrelated shorts that falls far short 5.36364 formal informal (p = 0.50)
formal (p = 0.50)
41573 covers huge , heavy topics in a bland , surfacey way that does n't offer any insight into why , for instance , good things happen to bad people 3.96552 formal formal (p = 0.50)
informal (p = 0.50)

For records in your dataset where avg_whitespace(text) >= 0.125 AND avg_whitespace(text) < 0.222, we found a significantly higher number of underconfident predictions (1462 samples, corresponding to 2.80% of the predictions in the data slice).

Level Data slice Metric Deviation
major 🔴 avg_whitespace(text) >= 0.125 AND avg_whitespace(text) < 0.222 Underconfidence rate = 0.028 +20.70% than global

Taxonomy

avid-effect:performance:P0204
🔍✨Examples
text avg_whitespace(text) label Predicted label
46185 richly detailed , deftly executed and utterly absorbing 0.142857 informal informal (p = 0.50)
formal (p = 0.50)
35430 an inelegant combination of two unrelated shorts that falls far short 0.157143 formal informal (p = 0.50)
formal (p = 0.50)
41573 covers huge , heavy topics in a bland , surfacey way that does n't offer any insight into why , for instance , good things happen to bad people 0.201389 formal formal (p = 0.50)
informal (p = 0.50)
👉Ethical issues (1)

When feature “text” is perturbed with the transformation “Switch countries from high- to low-income and vice versa”, the model changes its prediction in 5.8% of the cases. We expected the predictions not to be affected by this transformation.

Level Data slice Metric Deviation
medium 🟡 Fail rate = 0.058 58/1000 tested samples (5.8%) changed prediction after perturbation

Taxonomy

avid-effect:ethics:E0101 avid-effect:performance:P0201
🔍✨Examples
text Switch countries from high- to low-income and vice versa(text) Original prediction Prediction after perturbation
65960 point the way for adventurous indian filmmakers toward a crossover point the way for adventurous Tongan filmmakers toward a crossover informal (p = 0.52) formal (p = 0.55)
63459 : land beyond time is an enjoyable big movie primarily because australia is a weirdly beautiful place . : land beyond time is an enjoyable big movie primarily because Yemen is a weirdly beautiful place . informal (p = 0.52) formal (p = 0.51)
65238 is on his way to becoming the american indian spike lee . is on his way to becoming the Yemeni French spike lee . informal (p = 0.55) formal (p = 0.51)
👉Robustness issues (3)

When feature “text” is perturbed with the transformation “Transform to uppercase”, the model changes its prediction in 14.3% of the cases. We expected the predictions not to be affected by this transformation.

Level Data slice Metric Deviation
major 🔴 Fail rate = 0.143 143/1000 tested samples (14.3%) changed prediction after perturbation

Taxonomy

avid-effect:performance:P0201
🔍✨Examples
text Transform to uppercase(text) Original prediction Prediction after perturbation
27065 affecting story about four sisters who are coping , in one way or another , with life AFFECTING STORY ABOUT FOUR SISTERS WHO ARE COPING , IN ONE WAY OR ANOTHER , WITH LIFE formal (p = 0.51) informal (p = 0.79)
4718 is n't afraid to provoke introspection in both its characters and its audience IS N'T AFRAID TO PROVOKE INTROSPECTION IN BOTH ITS CHARACTERS AND ITS AUDIENCE formal (p = 0.51) informal (p = 0.66)
26573 lacks the visual flair and bouncing bravado that characterizes better hip-hop clips and is content to recycle images and characters that were already tired 10 years ago LACKS THE VISUAL FLAIR AND BOUNCING BRAVADO THAT CHARACTERIZES BETTER HIP-HOP CLIPS AND IS CONTENT TO RECYCLE IMAGES AND CHARACTERS THAT WERE ALREADY TIRED 10 YEARS AGO formal (p = 0.76) informal (p = 0.66)

When feature “text” is perturbed with the transformation “Add typos”, the model changes its prediction in 12.5% of the cases. We expected the predictions not to be affected by this transformation.

Level Data slice Metric Deviation
major 🔴 Fail rate = 0.125 125/1000 tested samples (12.5%) changed prediction after perturbation

Taxonomy

avid-effect:performance:P0201
🔍✨Examples
text Add typos(text) Original prediction Prediction after perturbation
61132 pared down its plots and characters to a few rather than dozens pared down its plots and characters to a few rather than xdozens formal (p = 0.53) informal (p = 0.59)
61447 's so crammed with scenes and vistas and pretty moments that it 's left a few crucial things out , like character development and coherence . 's so crzammed with scenes and vistas and pretty momenys that it 's left af ew crucial things out , like character development and coherencde . formal (p = 0.50) informal (p = 0.74)
10747 the breathtakingly beautiful outer-space documentary space station 3d the breathtakingly beautifuo outer-space documentary dpace station 3d formal (p = 0.62) informal (p = 0.54)

When feature “text” is perturbed with the transformation “Transform to title case”, the model changes its prediction in 9.1% of the cases. We expected the predictions not to be affected by this transformation.

Level Data slice Metric Deviation
medium 🟡 Fail rate = 0.091 91/1000 tested samples (9.1%) changed prediction after perturbation

Taxonomy

avid-effect:performance:P0201
🔍✨Examples
text Transform to title case(text) Original prediction Prediction after perturbation
27065 affecting story about four sisters who are coping , in one way or another , with life Affecting Story About Four Sisters Who Are Coping , In One Way Or Another , With Life formal (p = 0.51) informal (p = 0.60)
4718 is n't afraid to provoke introspection in both its characters and its audience Is N'T Afraid To Provoke Introspection In Both Its Characters And Its Audience formal (p = 0.51) informal (p = 0.61)
54137 demonstrates that the director of such hollywood blockbusters as patriot games can still turn out a small , personal film with an emotional wallop Demonstrates That The Director Of Such Hollywood Blockbusters As Patriot Games Can Still Turn Out A Small , Personal Film With An Emotional Wallop formal (p = 0.54) informal (p = 0.51)
👉Overconfidence issues (4)

For records in the dataset where text_length(text) < 22.500, we found a significantly higher number of overconfident wrong predictions (6722 samples, corresponding to 87.20% of the wrong predictions in the data slice).

Level Data slice Metric Deviation
major 🔴 text_length(text) < 22.500 Overconfidence rate = 0.872 +86.80% than global

Taxonomy

avid-effect:performance:P0204
🔍✨Examples
text text_length(text) label Predicted label
2129 one baaaaaaaaad movie 22 formal informal (p = 0.93)
formal (p = 0.07)
21336 baaaaaaaaad 12 formal informal (p = 0.93)
formal (p = 0.07)
30102 baaaaaaaaad movie 18 formal informal (p = 0.92)
formal (p = 0.08)

For records in the dataset where avg_word_length(text) >= 6.969, we found a significantly higher number of overconfident wrong predictions (2734 samples, corresponding to 69.53% of the wrong predictions in the data slice).

Level Data slice Metric Deviation
major 🔴 avg_word_length(text) >= 6.969 Overconfidence rate = 0.695 +48.95% than global

Taxonomy

avid-effect:performance:P0204
🔍✨Examples
text avg_word_length(text) label Predicted label
19179 compassionately explores the seemingly irreconcilable situation between conservative christian parents and their estranged gay and lesbian children . 7.33333 informal formal (p = 1.00)
informal (p = 0.00)
26630 the intelligence or sincerity it unequivocally deserves 7 informal formal (p = 0.98)
informal (p = 0.02)
45560 the performances and the cinematography to the outstanding soundtrack and unconventional narrative 7.25 informal formal (p = 0.96)
informal (p = 0.04)

For records in the dataset where avg_whitespace(text) < 0.125, we found a significantly higher number of overconfident wrong predictions (2734 samples, corresponding to 69.53% of the wrong predictions in the data slice).

Level Data slice Metric Deviation
major 🔴 avg_whitespace(text) < 0.125 Overconfidence rate = 0.695 +48.95% than global

Taxonomy

avid-effect:performance:P0204
🔍✨Examples
text avg_whitespace(text) label Predicted label
19179 compassionately explores the seemingly irreconcilable situation between conservative christian parents and their estranged gay and lesbian children . 0.12 informal formal (p = 1.00)
informal (p = 0.00)
26630 the intelligence or sincerity it unequivocally deserves 0.125 informal formal (p = 0.98)
informal (p = 0.02)
45560 the performances and the cinematography to the outstanding soundtrack and unconventional narrative 0.121212 informal formal (p = 0.96)
informal (p = 0.04)

For records in the dataset where text_length(text) < 34.500 AND text_length(text) >= 22.500, we found a significantly higher number of overconfident wrong predictions (3118 samples, corresponding to 63.83% of the wrong predictions in the data slice).

Level Data slice Metric Deviation
major 🔴 text_length(text) < 34.500 AND text_length(text) >= 22.500 Overconfidence rate = 0.638 +36.73% than global

Taxonomy

avid-effect:performance:P0204
🔍✨Examples
text text_length(text) label Predicted label
15121 is one baaaaaaaaad movie 25 formal informal (p = 0.92)
formal (p = 0.08)
49755 is one baaaaaaaaad movie . 27 formal informal (p = 0.92)
formal (p = 0.08)
32815 this is one baaaaaaaaad movie . 32 formal informal (p = 0.92)
formal (p = 0.08)

Checkout out the Giskard Space and test your model.

Disclaimer: it's important to note that automated scans may produce false positives or miss certain vulnerabilities. We encourage you to review the findings and assess the impact accordingly.

Sign up or log in to comment