giskardai/giskard-evaluator · Report for s-nlp/roberta-base-formality-ranker

Hi Team,

This is a report from Giskard Bot Scan 🐢.

We have identified 11 potential vulnerabilities in your model based on an automated scan.

This automated analysis evaluated the model on the dataset sst2 (subset default, split train).

👉Underconfidence issues (3)

For records in your dataset where text_length(text) >= 50.500, we found a significantly higher number of underconfident predictions (1275 samples, corresponding to 4.77% of the predictions in the data slice).

Level	Data slice	Metric	Deviation
major 🔴	`text_length(text)` >= 50.500	Underconfidence rate = 0.048	+105.32% than global

Taxonomy

avid-effect:performance:P0204

🔍✨Examples

	text	text_length(text)	label	Predicted `label`
46185	richly detailed , deftly executed and utterly absorbing	56	informal	informal (p = 0.50)
				formal (p = 0.50)
35430	an inelegant combination of two unrelated shorts that falls far short	70	formal	informal (p = 0.50)
				formal (p = 0.50)
41573	covers huge , heavy topics in a bland , surfacey way that does n't offer any insight into why , for instance , good things happen to bad people	144	formal	formal (p = 0.50)
				informal (p = 0.50)

For records in your dataset where avg_word_length(text) < 6.969 AND avg_word_length(text) >= 3.506, we found a significantly higher number of underconfident predictions (1462 samples, corresponding to 2.80% of the predictions in the data slice).

Level	Data slice	Metric	Deviation
major 🔴	`avg_word_length(text)` < 6.969 AND `avg_word_length(text)` >= 3.506	Underconfidence rate = 0.028	+20.70% than global

Taxonomy

avid-effect:performance:P0204

🔍✨Examples

	text	avg_word_length(text)	label	Predicted `label`
46185	richly detailed , deftly executed and utterly absorbing	6	informal	informal (p = 0.50)
				formal (p = 0.50)
35430	an inelegant combination of two unrelated shorts that falls far short	5.36364	formal	informal (p = 0.50)
				formal (p = 0.50)
41573	covers huge , heavy topics in a bland , surfacey way that does n't offer any insight into why , for instance , good things happen to bad people	3.96552	formal	formal (p = 0.50)
				informal (p = 0.50)

For records in your dataset where avg_whitespace(text) >= 0.125 AND avg_whitespace(text) < 0.222, we found a significantly higher number of underconfident predictions (1462 samples, corresponding to 2.80% of the predictions in the data slice).

Level	Data slice	Metric	Deviation
major 🔴	`avg_whitespace(text)` >= 0.125 AND `avg_whitespace(text)` < 0.222	Underconfidence rate = 0.028	+20.70% than global

Taxonomy

avid-effect:performance:P0204

🔍✨Examples

	text	avg_whitespace(text)	label	Predicted `label`
46185	richly detailed , deftly executed and utterly absorbing	0.142857	informal	informal (p = 0.50)
				formal (p = 0.50)
35430	an inelegant combination of two unrelated shorts that falls far short	0.157143	formal	informal (p = 0.50)
				formal (p = 0.50)
41573	covers huge , heavy topics in a bland , surfacey way that does n't offer any insight into why , for instance , good things happen to bad people	0.201389	formal	formal (p = 0.50)
				informal (p = 0.50)

👉Ethical issues (1)

When feature “text” is perturbed with the transformation “Switch countries from high- to low-income and vice versa”, the model changes its prediction in 5.8% of the cases. We expected the predictions not to be affected by this transformation.

Level	Data slice	Metric	Deviation
medium 🟡	—	Fail rate = 0.058	58/1000 tested samples (5.8%) changed prediction after perturbation

Taxonomy

avid-effect:ethics:E0101 avid-effect:performance:P0201

🔍✨Examples

	text	Switch countries from high- to low-income and vice versa(text)	Original prediction	Prediction after perturbation
65960	point the way for adventurous indian filmmakers toward a crossover	point the way for adventurous Tongan filmmakers toward a crossover	informal (p = 0.52)	formal (p = 0.55)
63459	: land beyond time is an enjoyable big movie primarily because australia is a weirdly beautiful place .	: land beyond time is an enjoyable big movie primarily because Yemen is a weirdly beautiful place .	informal (p = 0.52)	formal (p = 0.51)
65238	is on his way to becoming the american indian spike lee .	is on his way to becoming the Yemeni French spike lee .	informal (p = 0.55)	formal (p = 0.51)

👉Robustness issues (3)

When feature “text” is perturbed with the transformation “Transform to uppercase”, the model changes its prediction in 14.3% of the cases. We expected the predictions not to be affected by this transformation.

Level	Data slice	Metric	Deviation
major 🔴	—	Fail rate = 0.143	143/1000 tested samples (14.3%) changed prediction after perturbation

Taxonomy

avid-effect:performance:P0201

🔍✨Examples

	text	Transform to uppercase(text)	Original prediction	Prediction after perturbation
27065	affecting story about four sisters who are coping , in one way or another , with life	AFFECTING STORY ABOUT FOUR SISTERS WHO ARE COPING , IN ONE WAY OR ANOTHER , WITH LIFE	formal (p = 0.51)	informal (p = 0.79)
4718	is n't afraid to provoke introspection in both its characters and its audience	IS N'T AFRAID TO PROVOKE INTROSPECTION IN BOTH ITS CHARACTERS AND ITS AUDIENCE	formal (p = 0.51)	informal (p = 0.66)
26573	lacks the visual flair and bouncing bravado that characterizes better hip-hop clips and is content to recycle images and characters that were already tired 10 years ago	LACKS THE VISUAL FLAIR AND BOUNCING BRAVADO THAT CHARACTERIZES BETTER HIP-HOP CLIPS AND IS CONTENT TO RECYCLE IMAGES AND CHARACTERS THAT WERE ALREADY TIRED 10 YEARS AGO	formal (p = 0.76)	informal (p = 0.66)

When feature “text” is perturbed with the transformation “Add typos”, the model changes its prediction in 12.5% of the cases. We expected the predictions not to be affected by this transformation.

Level	Data slice	Metric	Deviation
major 🔴	—	Fail rate = 0.125	125/1000 tested samples (12.5%) changed prediction after perturbation

Taxonomy

avid-effect:performance:P0201

🔍✨Examples

	text	Add typos(text)	Original prediction	Prediction after perturbation
61132	pared down its plots and characters to a few rather than dozens	pared down its plots and characters to a few rather than xdozens	formal (p = 0.53)	informal (p = 0.59)
61447	's so crammed with scenes and vistas and pretty moments that it 's left a few crucial things out , like character development and coherence .	's so crzammed with scenes and vistas and pretty momenys that it 's left af ew crucial things out , like character development and coherencde .	formal (p = 0.50)	informal (p = 0.74)
10747	the breathtakingly beautiful outer-space documentary space station 3d	the breathtakingly beautifuo outer-space documentary dpace station 3d	formal (p = 0.62)	informal (p = 0.54)

When feature “text” is perturbed with the transformation “Transform to title case”, the model changes its prediction in 9.1% of the cases. We expected the predictions not to be affected by this transformation.

Level	Data slice	Metric	Deviation
medium 🟡	—	Fail rate = 0.091	91/1000 tested samples (9.1%) changed prediction after perturbation

Taxonomy

avid-effect:performance:P0201

🔍✨Examples

	text	Transform to title case(text)	Original prediction	Prediction after perturbation
27065	affecting story about four sisters who are coping , in one way or another , with life	Affecting Story About Four Sisters Who Are Coping , In One Way Or Another , With Life	formal (p = 0.51)	informal (p = 0.60)
4718	is n't afraid to provoke introspection in both its characters and its audience	Is N'T Afraid To Provoke Introspection In Both Its Characters And Its Audience	formal (p = 0.51)	informal (p = 0.61)
54137	demonstrates that the director of such hollywood blockbusters as patriot games can still turn out a small , personal film with an emotional wallop	Demonstrates That The Director Of Such Hollywood Blockbusters As Patriot Games Can Still Turn Out A Small , Personal Film With An Emotional Wallop	formal (p = 0.54)	informal (p = 0.51)

👉Overconfidence issues (4)

For records in the dataset where text_length(text) < 22.500, we found a significantly higher number of overconfident wrong predictions (6722 samples, corresponding to 87.20% of the wrong predictions in the data slice).

Level	Data slice	Metric	Deviation
major 🔴	`text_length(text)` < 22.500	Overconfidence rate = 0.872	+86.80% than global

Taxonomy

avid-effect:performance:P0204

🔍✨Examples

	text	text_length(text)	label	Predicted `label`
2129	one baaaaaaaaad movie	22	formal	informal (p = 0.93)
				formal (p = 0.07)
21336	baaaaaaaaad	12	formal	informal (p = 0.93)
				formal (p = 0.07)
30102	baaaaaaaaad movie	18	formal	informal (p = 0.92)
				formal (p = 0.08)

For records in the dataset where avg_word_length(text) >= 6.969, we found a significantly higher number of overconfident wrong predictions (2734 samples, corresponding to 69.53% of the wrong predictions in the data slice).

Level	Data slice	Metric	Deviation
major 🔴	`avg_word_length(text)` >= 6.969	Overconfidence rate = 0.695	+48.95% than global

Taxonomy

avid-effect:performance:P0204

🔍✨Examples

	text	avg_word_length(text)	label	Predicted `label`
19179	compassionately explores the seemingly irreconcilable situation between conservative christian parents and their estranged gay and lesbian children .	7.33333	informal	formal (p = 1.00)
				informal (p = 0.00)
26630	the intelligence or sincerity it unequivocally deserves	7	informal	formal (p = 0.98)
				informal (p = 0.02)
45560	the performances and the cinematography to the outstanding soundtrack and unconventional narrative	7.25	informal	formal (p = 0.96)
				informal (p = 0.04)

For records in the dataset where avg_whitespace(text) < 0.125, we found a significantly higher number of overconfident wrong predictions (2734 samples, corresponding to 69.53% of the wrong predictions in the data slice).

Level	Data slice	Metric	Deviation
major 🔴	`avg_whitespace(text)` < 0.125	Overconfidence rate = 0.695	+48.95% than global

Taxonomy

avid-effect:performance:P0204

🔍✨Examples

	text	avg_whitespace(text)	label	Predicted `label`
19179	compassionately explores the seemingly irreconcilable situation between conservative christian parents and their estranged gay and lesbian children .	0.12	informal	formal (p = 1.00)
				informal (p = 0.00)
26630	the intelligence or sincerity it unequivocally deserves	0.125	informal	formal (p = 0.98)
				informal (p = 0.02)
45560	the performances and the cinematography to the outstanding soundtrack and unconventional narrative	0.121212	informal	formal (p = 0.96)
				informal (p = 0.04)

For records in the dataset where text_length(text) < 34.500 AND text_length(text) >= 22.500, we found a significantly higher number of overconfident wrong predictions (3118 samples, corresponding to 63.83% of the wrong predictions in the data slice).

Level	Data slice	Metric	Deviation
major 🔴	`text_length(text)` < 34.500 AND `text_length(text)` >= 22.500	Overconfidence rate = 0.638	+36.73% than global

Taxonomy

avid-effect:performance:P0204

🔍✨Examples

	text	text_length(text)	label	Predicted `label`
15121	is one baaaaaaaaad movie	25	formal	informal (p = 0.92)
				formal (p = 0.08)
49755	is one baaaaaaaaad movie .	27	formal	informal (p = 0.92)
				formal (p = 0.08)
32815	this is one baaaaaaaaad movie .	32	formal	informal (p = 0.92)
				formal (p = 0.08)

Checkout out the Giskard Space and test your model.

Disclaimer: it's important to note that automated scans may produce false positives or miss certain vulnerabilities. We encourage you to review the findings and assess the impact accordingly.