File size: 16,635 Bytes
cde002d |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 270 271 272 273 274 275 276 277 278 279 280 281 282 283 284 285 286 287 288 289 290 291 292 293 294 295 296 297 298 299 300 301 302 303 304 305 306 307 308 309 310 311 312 313 314 315 316 317 318 319 320 321 322 323 324 325 326 327 328 329 330 331 332 333 334 335 336 337 338 339 340 341 342 343 344 345 346 347 348 349 350 351 352 353 354 355 356 357 358 359 360 361 362 363 364 365 366 367 368 369 370 371 372 373 374 375 376 377 378 379 380 381 |
---
license: apache-2.0
language:
- en
pipeline_tag: text-generation
library_name: peft
library_name: transformers
---
# Intrinsics for Answer Relevance Rewriter
## Model Summary
This is a RAG-specific intrinsic for answer relevance rewrite task.
The model takes as input the chat completion from answer relevance classifier output
consisting of conversation as well as answer_relevance_classification, together with grounding documents,
and provides a rewritten assistant response that is more relevant to the user's final inquiry.
We provide two intrinsics implemented as LoRA adapters (LoRA/aLoRA) trained over
Granite-3.3-2b-instruct, Granite-3.3-8b-instruct.
- **Developer:** IBM Research
- **Model type:** LoRA and aLoRA adapter for
[ibm-granite/granite-3.3-2b-instruct](https://huggingface.co/ibm-granite/granite-3.3-2b-instruct),
[ibm-granite/granite-3.3-8b-instruct](https://huggingface.co/ibm-granite/granite-3.3-8b-instruct)
- **License:** [Apache 2.0](https://www.apache.org/licenses/LICENSE-2.0)
## Intended use
This rag specific intrinsics is intended to be used to post-process the generated assistant response.
It should be used following the answer relevance classifier intrinsic, and should be applied to
the cases where the `answer_relevance_likelihood` is below a certain threshold according to application criteria.
For cases where the assistant answer is deemed not relevant (where `answer_relevance_likelihood` is below a
given threshold), the answer relevance rewriter intrinsic can be used to rewrite the assistant response
into a more relevant response. It takes as input the chat completion
from answer relevance classifier output and the grounding documents. Its output is of the form
{
answer_relevance_rewrite: <Rewritten response>
}
The rewriter is instructed to only correct deficiencies in relevance as identified by the classifier,
and ensure the rewritten response is grounded in the conversation and given documents.
**Model input**: The input to the answer relevance rewriter intrinsic is an
OpenAI-compatible chat completion request, containing a list of conversation
turns that can alternate between the `user` and `assistant` role and ending with
a `assistant` turn, plus two additional turns:
- A conversation between user and assistant ending with assistant response
- An additional user turn with content "answer_relevance"
**Model output**: The output of the answer relevance rewriter intrinsic is the result of the
original chat completion request formatted as a JSON object of the following schema
{
answer_relevance_rewrite: <Rewritten response>
}
Please see the code snippets in the Quickstart Example section below for
examples that illustrate the intrinsic's input/output.
## Quickstart Example
To run the answer relevance rewriter intrinsics through granite-common, you can either (a)
use an OpenAI-compatible inference backend, such as vLLM or (b) use the Hugging
Face transformers library. We provide below instructions for each of the two
approaches. Note that running inference using vLLM or another scalable
OpenAI-compatible inference backend should be significantly faster than using
the Hugging Face transformers library directly.
### Using an OpenAI-Compatible Inference Backend
To run the intrinsic using an OpenAI-compatible inference backend, such as vLLM,
follow the steps below.
1. Install the granite-common library:
pip install git+https://github.com/ibm-granite/granite-common.git
pip install granite_common[nltk]
2. Install the Hugging Face CLI:
pip install -U "huggingface_hub[cli]"
3. Install vLLM:
pip install vllm
4. Download the intrinsics library:
hf download ibm-granite/rag-intrinsics-lib --local-dir ./rag-intrinsics-lib
5. Edit the vLLM startup script found in `./rag-intrinsics-lib/run_vllm.sh`
using your favorite editor:
Edit the constants `BASE_MODEL_NAME` and `BASE_MODEL_ORG` depending on the
base model on which the desired LoRA adapter has been trained. Optionally,
edit the constant `PORT` to change the port on which vLLM will run. Save the
modified file and exit the editor.
6. Start vLLM through the startup script. The first time you run the script,
you may have to change the permissions to allow execution:
cd rag-intrinsics-lib
chmod u+x ./run_vllm.sh
./run_vllm.sh &
7. Run the following code snippet:
import json
import openai
import granite_common
intrinsic_name = "answer_relevance_classifier"
# Change the following constant to select a different base model
base_model_name = "granite-3.3-8b-instruct"
# Change the following constants as needed to reflect the location of the vLLM server
# The selected port should be identical to the one you specified in the vLLM startup script
openai_base_url = "http://localhost:55555/v1"
openai_api_key = "rag_intrinsics_1234"
# Fetch IO configuration file from Hugging Face Hub
io_yaml_file = granite_common.intrinsics.util.obtain_io_yaml(
intrinsic_name, base_model_name
)
# Instantiate input/output processors
rewriter = granite_common.IntrinsicsRewriter(config_file=io_yaml_file)
result_processor = granite_common.IntrinsicsResultProcessor(config_file=io_yaml_file)
# Sample request
request_json = {
"messages": [
{
"role": "user",
"content": "Who attended the meeting?"
},
{
"role": "assistant",
"content": "Many people attended the meeting."
},
{
"role": "user",
"content": "answer_relevance"
},
{
"role": "assistant",
"content": "{\"answer_relevance_analysis\": \"The inquiry asks for the attendees of the meeting. The response provides a vague and non-specific answer that does not address the inquiry.\", \"answer_relevance_category\": \"No attempt\", \"answer_relevance_likelihood\": 0.0}"
}
],
"extra_body": {
"documents": [
{
"doc_id": "1",
"text": "Meeting attendees: Alice, Bob, Carol."
},
{
"doc_id": "2",
"text": "Meeting time: 9:00 am to 11:00 am."
}
]
}
}
# Add other parameters
request_json["model"] = intrinsic_name
request_json["temperature"] = 0.0
# Apply input processor
intrinsic_kwargs = {}
rewritten_request = rewriter.transform(request_json, **intrinsic_kwargs)
# Run inference
client = openai.OpenAI(base_url=openai_base_url, api_key=openai_api_key)
chat_completion = client.chat.completions.create(**rewritten_request.model_dump())
# Apply output processor
processed_chat_completion = result_processor.transform(
chat_completion, rewritten_request
)
# Verify that the contents of the completion is valid JSON and pretty-print the JSON.
parsed_contents = json.loads(processed_chat_completion.choices[0].message.content)
print("JSON output:")
print(json.dumps(parsed_contents, indent=2))
### Using the Hugging Face Transformers Library
To run the intrinsic using the Hugging Face transformers library directly,
follow the steps below.
1. Install the granite-common library:
pip install git+https://github.com/ibm-granite/granite-common.git
pip install granite_common[nltk]
2. Install the Hugging Face CLI:
pip install -U "huggingface_hub[cli]"
3. Install PEFT:
pip install peft
4. Install xgrammar:
pip install xgrammar
5. Run the following code snippet:
import json
import granite_common.util
import peft
intrinsic_name = "answer_relevance_classifier"
# Change the following constant to select a different base model
base_model_name = "granite-3.3-8b-instruct"
use_cuda = True # Set to False to use default PyTorch device for this machine + model
# Fetch IO configuration file from Hugging Face Hub
io_yaml_file = granite_common.intrinsics.util.obtain_io_yaml(
intrinsic_name, base_model_name
)
# Fetch LoRA directory from Hugging Face Hub
lora_dir = granite_common.intrinsics.util.obtain_lora(
intrinsic_name, base_model_name
)
# Instantiate input/output processors
rewriter = granite_common.IntrinsicsRewriter(config_file=io_yaml_file)
result_processor = granite_common.IntrinsicsResultProcessor(config_file=io_yaml_file)
# Sample request
request_json = {
"messages": [
{
"role": "user",
"content": "Who attended the meeting?"
},
{
"role": "assistant",
"content": "Many people attended the meeting."
},
{
"role": "user",
"content": "answer_relevance"
},
{
"role": "assistant",
"content": "{\"answer_relevance_analysis\": \"The inquiry asks for the attendees of the meeting. The response provides a vague and non-specific answer that does not address the inquiry.\", \"answer_relevance_category\": \"No attempt\", \"answer_relevance_likelihood\": 0.0}"
}
],
"extra_body": {
"documents": [
{
"doc_id": "1",
"text": "Meeting attendees: Alice, Bob, Carol."
},
{
"doc_id": "2",
"text": "Meeting time: 9:00 am to 11:00 am."
}
]
}
}
# Add additional parameters
request_json["model"] = intrinsic_name
request_json["temperature"] = 0.0
# Apply input processor
intrinsic_kwargs = {}
rewritten_request = rewriter.transform(request_json, **intrinsic_kwargs)
# Load the base model and merge LoRA weights
model, tokenizer = granite_common.util.load_transformers_lora(lora_dir)
if use_cuda:
model = model.cuda()
# Convert the chat completion request into a the Transformers library's proprietary
# format.
generate_input, other_input = (
granite_common.util.chat_completion_request_to_transformers_inputs(
rewritten_request,
tokenizer,
model,
)
)
# Use the Transformers library's APIs to generate one or more completions,
# then convert those completions into OpenAI-compatible chat completion
responses = granite_common.util.generate_with_transformers(
tokenizer, model, generate_input, other_input
)
# Apply output processor
transformed_responses = result_processor.transform(responses, rewritten_request)
# Verify that the contents of the completion is valid JSON and pretty-print the JSON.
parsed_contents = json.loads(transformed_responses.choices[0].message.content)
print("JSON output:")
print(json.dumps(parsed_contents, indent=2))
## Training Details
### Training Data
The training data is created in the following process
1. Take the synthetic rag-data-granite dataset, consisting of conversations between user and assistant.
2. Replace the assistant response by running granite-3.2-intrinsics at temperature 1.0.
3. Produce answer_relevance_rewriter target output using mixtral-large with prompts with in-context examples.
The conversation created in steps 1 and 2 are taken as training input. The json string from step 3
is taken as train target output.
#### Training Hyperparameters
The LoRA adapter was fine-tuned using PEFT under the following regime: rank =
32, learning rate = 1.0e-04, number of epochs = 5.
## Evaluation
### Answer Relevance Rewriter
We evaluated the model on test data set generated by the same procedure as the training process,
using GPT-4o as judge.
The following table presents results comparing baselines and frontier models
on the answer relevance rewrite task. The data sets consists of those classified as irrelevant by
mixtral-large. The evaluations are first divided into two parts, those that are truly irrelevant,
for which we measure the rate of rewrite becoming relevant, and those that are false irrelevant,
for which we measure the rate of rewrite becoming irrelevant. Then the overall rate flipping
irrelevant to relevant and flipping relevant to irrelevant are calculated, and well as the net gain
of relevance and resulting final relevance.
The LoRAs out perform the best of frontier models
| | True irrelevant <br> flip to relevant | False irrelevant <br> flip to irrelevant| Overall <br> flip irrelevant <br> to relevant | Overall <br> flip relevant <br> to irrelevant| net gain | Result <br>relevance |
|:---------------------|:--------------|:---------|:------------------------------|:---------|:---------|:--------------|
| mixtral-8x22b-v0.1 | 0.416 | 0.101 | 0.286 | 0.032 | 0.254 | 0.566 |
| llama-3.3-70b | 0.804 | 0.041 | 0.554 | 0.013 | 0.541 | 0.853 |
| gpt-oss-20b | 0.902 | 0.034 | 0.621 | 0.011 | 0.610 | 0.922 |
| gpt-4o | 0.960 | 0.014 | 0.661 | 0.004 | 0.657 | 0.968 |
| gpt-4o-mini | 0.758 | 0.027 | 0.522 | 0.008 | 0.514 | 0.825 |
| | | | | | | |
| granite-3.3-2b/lora | 0.972 | 0.027 | 0.669 | 0.008 | 0.661 | 0.973 |
| granite-3.3-2b/alora | 0.972 | 0.007 | 0.669 | 0.002 | 0.667 | 0.979 |
| granite-3.3-8b/lora | 0.969 | 0.014 | 0.667 | 0.004 | 0.663 | 0.975 |
| granite-3.3-8b/alora | 0.966 | 0.027 | 0.665 | 0.008 | 0.657 | 0.968 |
| | | | | | | |
### Comparing the Answer Relevance Rewriter Intrinsics vs. Vanilla Granite Models
We compare the performance of Granite 3.3-2b, Granite 3.3-8b Instruct
vs. answer relevance rewriter intrinsics implemented as LoRA adapters.
It is seen that the LoRAs significantly out perform the base models.
| | True irrelevant <br> flip to relevant | False irrelevant <br> flip to irrelevant| Overall <br> flip irrelevant <br> to relevant | Overall <br> flip relevant <br> to irrelevant| net gain | Result relevance |
|:---------------------|:--------------|:---------|:------------------------------|:---------|:---------|:--------------|
| granite-3.3-2b | 0.346 | 0.169 | 0.238 | 0.053 | 0.185 | 0.497 |
| granite-3.3-2b/lora | 0.972 | 0.027 | 0.669 | 0.008 | 0.661 | 0.973 |
| granite-3.3-2b/alora | 0.972 | 0.007 | 0.669 | 0.002 | 0.667 | 0.979 |
| | | | | | | |
| granite-3.3-8b | 0.266 | 0.277 | 0.183 | 0.086 | 0.097 | 0.408 |
| granite-3.3-8b/lora | 0.969 | 0.014 | 0.667 | 0.004 | 0.663 | 0.975 |
| granite-3.3-8b/alora | 0.966 | 0.027 | 0.665 | 0.008 | 0.657 | 0.968 |
| | | | | | | |
## Model Card Authors
[Huaiyu Zhu](mailto:[email protected])
### Framework versions
- PEFT 0.14.0
|