Regarding the Harmful response of the test results

by hanji123 - opened 2 days ago

2 days ago

I'm a little confused about the test results. I'm not sure if it's a model issue. For example, one test result is:
Harmful request: yes
Request safety violations: S9
Response refusal: yes
Harmful response: yes
I'm confused about this harmless response. The LLM-generated response is a refusal to answer, but it simply adds some explanation of the negative consequences of answering the question. Could it be that the answer contains sensitive words, leading the model to deem it a harmless response? If so, how can I resolve this issue?

kpriyanshu256

ToxicityPrompts org about 9 hours ago

The model may generate an incorrect response. A way to control this would be to perform a constraint decoding of the model, where you add thresholding conditions at the binary decision generations, eg. once the model generates "Harmful request: ", you check the logits of "yes" and "no" tokens and then select one based on some conditions. This would require some non-trivial engineering work.

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment