File size: 16,635 Bytes
cde002d
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
---
license: apache-2.0
language:
- en
pipeline_tag: text-generation
library_name: peft
library_name: transformers
---

# Intrinsics for Answer Relevance Rewriter

## Model Summary
This is a RAG-specific intrinsic for answer relevance rewrite task. 
The model takes as input the chat completion from answer relevance classifier output
consisting of conversation as well as answer_relevance_classification, together with grounding documents, 
and provides a rewritten assistant response that is more relevant to the user's final inquiry.


We provide two intrinsics implemented as LoRA adapters (LoRA/aLoRA) trained over
Granite-3.3-2b-instruct, Granite-3.3-8b-instruct.

- **Developer:** IBM Research
- **Model type:** LoRA and aLoRA adapter for
  [ibm-granite/granite-3.3-2b-instruct](https://huggingface.co/ibm-granite/granite-3.3-2b-instruct),
  [ibm-granite/granite-3.3-8b-instruct](https://huggingface.co/ibm-granite/granite-3.3-8b-instruct)
- **License:** [Apache 2.0](https://www.apache.org/licenses/LICENSE-2.0)

## Intended use
This rag specific intrinsics is intended to be used to post-process the generated assistant response.
It should be used following the answer relevance classifier intrinsic, and should be applied to 
the cases where the `answer_relevance_likelihood` is below a certain threshold according to application criteria.

For cases where the assistant answer is deemed not relevant (where `answer_relevance_likelihood` is below a 
given threshold), the answer relevance rewriter intrinsic can be used to rewrite the assistant response 
into a more relevant response.   It takes as input the chat completion 
from answer relevance classifier output and the grounding documents.  Its output is of the form

    {
        answer_relevance_rewrite: <Rewritten response>
    }

The rewriter is instructed to only correct deficiencies in relevance as identified by the classifier, 
and ensure the rewritten response is grounded in the conversation and given documents.

**Model input**: The input to the answer relevance rewriter intrinsic is an
OpenAI-compatible chat completion request, containing a list of conversation
turns that can alternate between the `user` and `assistant` role and ending with
a `assistant` turn, plus two additional turns: 
- A conversation between user and assistant ending with assistant response
- An additional user turn with content "answer_relevance"

**Model output**: The output of the answer relevance rewriter intrinsic is the result of the
original chat completion request formatted as a JSON object of the following schema

    {
        answer_relevance_rewrite: <Rewritten response>
    }

Please see the code snippets in the Quickstart Example section below for
examples that illustrate the intrinsic's input/output.

## Quickstart Example

To run the answer relevance rewriter intrinsics through granite-common, you can either (a)
use an OpenAI-compatible inference backend, such as vLLM or (b) use the Hugging
Face transformers library. We provide below instructions for each of the two
approaches. Note that running inference using vLLM or another scalable
OpenAI-compatible inference backend should be significantly faster than using
the Hugging Face transformers library directly.

### Using an OpenAI-Compatible Inference Backend

To run the intrinsic using an OpenAI-compatible inference backend, such as vLLM,
follow the steps below.

1.  Install the granite-common library:

        pip install git+https://github.com/ibm-granite/granite-common.git
        pip install granite_common[nltk]

2.  Install the Hugging Face CLI:

        pip install -U "huggingface_hub[cli]"

3.  Install vLLM:

        pip install vllm

4.  Download the intrinsics library:

        hf download ibm-granite/rag-intrinsics-lib --local-dir ./rag-intrinsics-lib

5.  Edit the vLLM startup script found in `./rag-intrinsics-lib/run_vllm.sh`
    using your favorite editor:

    Edit the constants `BASE_MODEL_NAME` and `BASE_MODEL_ORG` depending on the
    base model on which the desired LoRA adapter has been trained. Optionally,
    edit the constant `PORT` to change the port on which vLLM will run. Save the
    modified file and exit the editor.

6.  Start vLLM through the startup script. The first time you run the script,
    you may have to change the permissions to allow execution:

        cd rag-intrinsics-lib
        chmod u+x ./run_vllm.sh
        ./run_vllm.sh &

7.  Run the following code snippet:

        import json
        import openai
        import granite_common

        intrinsic_name = "answer_relevance_classifier"

        # Change the following constant to select a different base model
        base_model_name = "granite-3.3-8b-instruct"

        # Change the following constants as needed to reflect the location of the vLLM server
        # The selected port should be identical to the one you specified in the vLLM startup script
        openai_base_url = "http://localhost:55555/v1"
        openai_api_key = "rag_intrinsics_1234"

        # Fetch IO configuration file from Hugging Face Hub
        io_yaml_file = granite_common.intrinsics.util.obtain_io_yaml(
            intrinsic_name, base_model_name
        )

        # Instantiate input/output processors
        rewriter = granite_common.IntrinsicsRewriter(config_file=io_yaml_file)
        result_processor = granite_common.IntrinsicsResultProcessor(config_file=io_yaml_file)

        # Sample request
        request_json = {
            "messages": [
                {
                "role": "user",
                "content": "Who attended the meeting?"
                },
                {
                "role": "assistant",
                "content": "Many people attended the meeting."
                },
                {
                "role": "user",
                "content": "answer_relevance"
                },
                {
                "role": "assistant",
                "content": "{\"answer_relevance_analysis\": \"The inquiry asks for the attendees of the meeting. The response provides a vague and non-specific answer that does not address the inquiry.\", \"answer_relevance_category\": \"No attempt\", \"answer_relevance_likelihood\": 0.0}"
                }
            ],
            "extra_body": {
                "documents": [
                {
                    "doc_id": "1",
                    "text": "Meeting attendees: Alice, Bob, Carol."
                },
                {
                    "doc_id": "2",
                    "text": "Meeting time: 9:00 am to 11:00 am."
                }
                ]
            }
         }

        # Add other parameters
        request_json["model"] = intrinsic_name
        request_json["temperature"] = 0.0

        # Apply input processor
        intrinsic_kwargs = {}
        rewritten_request = rewriter.transform(request_json, **intrinsic_kwargs)

        # Run inference
        client = openai.OpenAI(base_url=openai_base_url, api_key=openai_api_key)
        chat_completion = client.chat.completions.create(**rewritten_request.model_dump())

        # Apply output processor
        processed_chat_completion = result_processor.transform(
            chat_completion, rewritten_request
        )

        # Verify that the contents of the completion is valid JSON and pretty-print the JSON.
        parsed_contents = json.loads(processed_chat_completion.choices[0].message.content)
        print("JSON output:")
        print(json.dumps(parsed_contents, indent=2))

### Using the Hugging Face Transformers Library

To run the intrinsic using the Hugging Face transformers library directly,
follow the steps below.

1.  Install the granite-common library:

        pip install git+https://github.com/ibm-granite/granite-common.git
        pip install granite_common[nltk]

2.  Install the Hugging Face CLI:

        pip install -U "huggingface_hub[cli]"

3.  Install PEFT:

        pip install peft

4.  Install xgrammar:

        pip install xgrammar

5.  Run the following code snippet:

        import json
        import granite_common.util
        import peft

        intrinsic_name = "answer_relevance_classifier"

        # Change the following constant to select a different base model
        base_model_name = "granite-3.3-8b-instruct"

        use_cuda = True  # Set to False to use default PyTorch device for this machine + model

        # Fetch IO configuration file from Hugging Face Hub
        io_yaml_file = granite_common.intrinsics.util.obtain_io_yaml(
            intrinsic_name, base_model_name
        )

        # Fetch LoRA directory from Hugging Face Hub
        lora_dir = granite_common.intrinsics.util.obtain_lora(
            intrinsic_name, base_model_name
        )

        # Instantiate input/output processors
        rewriter = granite_common.IntrinsicsRewriter(config_file=io_yaml_file)
        result_processor = granite_common.IntrinsicsResultProcessor(config_file=io_yaml_file)

        # Sample request
        request_json = {
            "messages": [
                {
                "role": "user",
                "content": "Who attended the meeting?"
                },
                {
                "role": "assistant",
                "content": "Many people attended the meeting."
                },
                {
                "role": "user",
                "content": "answer_relevance"
                },
                {
                "role": "assistant",
                "content": "{\"answer_relevance_analysis\": \"The inquiry asks for the attendees of the meeting. The response provides a vague and non-specific answer that does not address the inquiry.\", \"answer_relevance_category\": \"No attempt\", \"answer_relevance_likelihood\": 0.0}"
                }
            ],
            "extra_body": {
                "documents": [
                {
                    "doc_id": "1",
                    "text": "Meeting attendees: Alice, Bob, Carol."
                },
                {
                    "doc_id": "2",
                    "text": "Meeting time: 9:00 am to 11:00 am."
                }
                ]
            }
         }

        # Add additional parameters
        request_json["model"] = intrinsic_name
        request_json["temperature"] = 0.0

        # Apply input processor
        intrinsic_kwargs = {}
        rewritten_request = rewriter.transform(request_json, **intrinsic_kwargs)

        # Load the base model and merge LoRA weights
        model, tokenizer = granite_common.util.load_transformers_lora(lora_dir)
        if use_cuda:
            model = model.cuda()

        # Convert the chat completion request into a the Transformers library's proprietary
        # format.
        generate_input, other_input = (
            granite_common.util.chat_completion_request_to_transformers_inputs(
                rewritten_request,
                tokenizer,
                model,
            )
        )

        # Use the Transformers library's APIs to generate one or more completions,
        # then convert those completions into OpenAI-compatible chat completion
        responses = granite_common.util.generate_with_transformers(
            tokenizer, model, generate_input, other_input
        )

        # Apply output processor
        transformed_responses = result_processor.transform(responses, rewritten_request)

        # Verify that the contents of the completion is valid JSON and pretty-print the JSON.
        parsed_contents = json.loads(transformed_responses.choices[0].message.content)
        print("JSON output:")
        print(json.dumps(parsed_contents, indent=2))

## Training Details

### Training Data

The training data is created in the following process
1. Take the synthetic rag-data-granite dataset, consisting of conversations between user and assistant.
2. Replace the assistant response by running granite-3.2-intrinsics at temperature 1.0.  
3. Produce answer_relevance_rewriter target output using mixtral-large with prompts with in-context examples.
The conversation created in steps 1 and 2 are taken as training input.  The json string from step 3
is taken as train target output.

#### Training Hyperparameters

The LoRA adapter was fine-tuned using PEFT under the following regime: rank =
32, learning rate = 1.0e-04, number of epochs = 5. 

## Evaluation

### Answer Relevance Rewriter

We evaluated the model on test data set generated by the same procedure as the training process, 
using GPT-4o as judge.  


The following table presents results comparing baselines and frontier models 
on the answer relevance rewrite task. The data sets consists of those classified as irrelevant by 
mixtral-large.  The evaluations are first divided into two parts, those that are truly irrelevant, 
for which we measure the rate of rewrite becoming relevant, and those that are false irrelevant, 
for which we measure the rate of rewrite becoming irrelevant.  Then the overall rate flipping 
irrelevant to relevant and flipping relevant to irrelevant are calculated, and well as the net gain 
of relevance and resulting final relevance.  

The LoRAs out perform the best of frontier models

|                      | True irrelevant <br> flip to relevant | False irrelevant <br> flip to irrelevant| Overall <br> flip irrelevant <br> to relevant | Overall <br> flip relevant <br> to irrelevant| net gain | Result <br>relevance |
|:---------------------|:--------------|:---------|:------------------------------|:---------|:---------|:--------------|
| mixtral-8x22b-v0.1   | 0.416         | 0.101    | 0.286                         | 0.032    | 0.254    | 0.566         |
| llama-3.3-70b        | 0.804         | 0.041    | 0.554                         | 0.013    | 0.541    | 0.853         |
| gpt-oss-20b          | 0.902         | 0.034    | 0.621                         | 0.011    | 0.610    | 0.922         |
| gpt-4o               | 0.960         | 0.014    | 0.661                         | 0.004    | 0.657    | 0.968         |
| gpt-4o-mini          | 0.758         | 0.027    | 0.522                         | 0.008    | 0.514    | 0.825         |
|                      |               |          |                               |          |          |               |
| granite-3.3-2b/lora  | 0.972         | 0.027    | 0.669                         | 0.008    | 0.661    | 0.973         |
| granite-3.3-2b/alora | 0.972         | 0.007    | 0.669                         | 0.002    | 0.667    | 0.979         |
| granite-3.3-8b/lora  | 0.969         | 0.014    | 0.667                         | 0.004    | 0.663    | 0.975         |
| granite-3.3-8b/alora | 0.966         | 0.027    | 0.665                         | 0.008    | 0.657    | 0.968         |
|                      |               |          |                               |          |          |               |

### Comparing the Answer Relevance Rewriter Intrinsics vs. Vanilla Granite Models

We compare the performance of Granite 3.3-2b, Granite 3.3-8b Instruct
vs. answer relevance rewriter intrinsics implemented as LoRA adapters.
It is seen that the LoRAs significantly out perform the base models.
|                      | True irrelevant <br> flip to relevant | False irrelevant <br> flip to irrelevant| Overall <br> flip irrelevant <br> to relevant | Overall <br> flip relevant <br> to irrelevant| net gain | Result relevance |
|:---------------------|:--------------|:---------|:------------------------------|:---------|:---------|:--------------|
| granite-3.3-2b       | 0.346         | 0.169    | 0.238                         | 0.053    | 0.185    | 0.497         |
| granite-3.3-2b/lora  | 0.972         | 0.027    | 0.669                         | 0.008    | 0.661    | 0.973         |
| granite-3.3-2b/alora | 0.972         | 0.007    | 0.669                         | 0.002    | 0.667    | 0.979         |
|                      |               |          |                               |          |          |               |
| granite-3.3-8b       | 0.266         | 0.277    | 0.183                         | 0.086    | 0.097    | 0.408         |
| granite-3.3-8b/lora  | 0.969         | 0.014    | 0.667                         | 0.004    | 0.663    | 0.975         |
| granite-3.3-8b/alora | 0.966         | 0.027    | 0.665                         | 0.008    | 0.657    | 0.968         |
|                      |               |          |                               |          |          |               |


## Model Card Authors

[Huaiyu Zhu](mailto:[email protected])

### Framework versions

- PEFT 0.14.0