Regarding the Harmful response of the test results

#3
by hanji123 - opened

I'm a little confused about the test results. I'm not sure if it's a model issue. For example, one test result is:
Harmful request: yes
Request safety violations: S9
Response refusal: yes
Harmful response: yes
I'm confused about this harmless response. The LLM-generated response is a refusal to answer, but it simply adds some explanation of the negative consequences of answering the question. Could it be that the answer contains sensitive words, leading the model to deem it a harmless response? If so, how can I resolve this issue?

ToxicityPrompts org

The model may generate an incorrect response. A way to control this would be to perform a constraint decoding of the model, where you add thresholding conditions at the binary decision generations, eg. once the model generates "Harmful request: ", you check the logits of "yes" and "no" tokens and then select one based on some conditions. This would require some non-trivial engineering work.

Sign up or log in to comment