huaiyu-zhu commited on
Commit
cde002d
·
1 Parent(s): ee3a7d7

Add answer relevance model cards

Browse files
.gitignore CHANGED
@@ -2,3 +2,4 @@
2
  # No MacOS image cache files, please.
3
  **/.DS_Store
4
 
 
 
2
  # No MacOS image cache files, please.
3
  **/.DS_Store
4
 
5
+ .venv/
answer_relevance_classifier/README.md ADDED
@@ -0,0 +1,363 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ license: apache-2.0
3
+ language:
4
+ - en
5
+ pipeline_tag: text-generation
6
+ library_name: peft
7
+ library_name: transformers
8
+ ---
9
+
10
+ # Intrinsics for Answer Relevance Classifier
11
+
12
+ ## Model Summary
13
+ This is a RAG-specific intrinsic for answer relevance classification task.
14
+ The model takes as input a multi-turn conversation ending with assistant response,
15
+ and provides a classification of whether the assistant's response is relevant to the
16
+ user's final inquiry, as well as categorization of the relevance and reasoning for the conclusions.
17
+
18
+
19
+ We provide two intrinsics implemented as LoRA adapters (LoRA/aLoRA) trained over
20
+ Granite-3.3-2b-instruct, Granite-3.3-8b-instruct.
21
+
22
+ - **Developer:** IBM Research
23
+ - **Model type:** LoRA and aLoRA adapter for
24
+ [ibm-granite/granite-3.3-2b-instruct](https://huggingface.co/ibm-granite/granite-3.3-2b-instruct),
25
+ [ibm-granite/granite-3.3-8b-instruct](https://huggingface.co/ibm-granite/granite-3.3-8b-instruct)
26
+ - **License:** [Apache 2.0](https://www.apache.org/licenses/LICENSE-2.0)
27
+
28
+ ## Intended use
29
+ This rag specific intrinsics is intended to be used to post-process the generated assistant response.
30
+
31
+ - The binary classification of relevance can be used to determine if the assistance response is suitable
32
+ to be given to the user, or a rewrite to a more relevant response is necessary.
33
+ - The category and the analysis providing reasoning for the conclusion can be used to
34
+ be incorporated into prompt for the answer relevance rewriter, indicating specific directions
35
+ the rewrite must take to overcome the perceived deficiency in relevance.
36
+
37
+ **Model input**: The input to the answer relevance classifier intrinsic is an
38
+ OpenAI-compatible chat completion request, containing a list of conversation
39
+ turns that can alternate between the `user` and `assistant` role and ending with
40
+ a `assistant` turn.
41
+
42
+ **Model output**: The output of the answer relevance classifier intrinsic is the result of the
43
+ original chat completion request formatted as a JSON object of the following schema
44
+
45
+ {
46
+ answer_relevance_analysis: <Free text analysis of whether and in which ways the assistant response is relevant or not>
47
+ answer_relevance_category: <One of a set of labels>
48
+ answer_relevance_likelihood: <float between 0.0 and 1.0>
49
+ }
50
+
51
+ The set of labels for `answer_relevance_category` are:
52
+ "Pertinent",
53
+ "Pertinent with relevant extra",
54
+ "Excessive unnecessary information",
55
+ "Unduly restrictive",
56
+ "Too vague or generic",
57
+ "Contextual misalignment",
58
+ "Misinterpreted inquiry",
59
+ "No attempt"
60
+
61
+ Please see the code snippets in the Quickstart Example section below for
62
+ examples that illustrate the intrinsic's input/output.
63
+
64
+ ## Quickstart Example
65
+
66
+ To run the answer relevance classifier intrinsics through granite-common, you can either (a)
67
+ use an OpenAI-compatible inference backend, such as vLLM or (b) use the Hugging
68
+ Face transformers library. We provide below instructions for each of the two
69
+ approaches. Note that running inference using vLLM or another scalable
70
+ OpenAI-compatible inference backend should be significantly faster than using
71
+ the Hugging Face transformers library directly.
72
+
73
+ ### Using an OpenAI-Compatible Inference Backend
74
+
75
+ To run the intrinsic using an OpenAI-compatible inference backend, such as vLLM,
76
+ follow the steps below.
77
+
78
+ 1. Install the granite-common library:
79
+
80
+ pip install git+https://github.com/ibm-granite/granite-common.git
81
+ pip install granite_common[nltk]
82
+
83
+ 2. Install the Hugging Face CLI:
84
+
85
+ pip install -U "huggingface_hub[cli]"
86
+
87
+ 3. Install vLLM:
88
+
89
+ pip install vllm
90
+
91
+ 4. Download the intrinsics library:
92
+
93
+ hf download ibm-granite/rag-intrinsics-lib --local-dir ./rag-intrinsics-lib
94
+
95
+ 5. Edit the vLLM startup script found in `./rag-intrinsics-lib/run_vllm.sh`
96
+ using your favorite editor:
97
+
98
+ Edit the constants `BASE_MODEL_NAME` and `BASE_MODEL_ORG` depending on the
99
+ base model on which the desired LoRA adapter has been trained. Optionally,
100
+ edit the constant `PORT` to change the port on which vLLM will run. Save the
101
+ modified file and exit the editor.
102
+
103
+ 6. Start vLLM through the startup script. The first time you run the script,
104
+ you may have to change the permissions to allow execution:
105
+
106
+ cd rag-intrinsics-lib
107
+ chmod u+x ./run_vllm.sh
108
+ ./run_vllm.sh &
109
+
110
+ 7. Run the following code snippet:
111
+
112
+ import json
113
+ import openai
114
+ import granite_common
115
+
116
+ intrinsic_name = "answer_relevance_classifier"
117
+
118
+ # Change the following constant to select a different base model
119
+ base_model_name = "granite-3.3-8b-instruct"
120
+
121
+ # Change the following constants as needed to reflect the location of the vLLM server
122
+ # The selected port should be identical to the one you specified in the vLLM startup script
123
+ openai_base_url = "http://localhost:55555/v1"
124
+ openai_api_key = "rag_intrinsics_1234"
125
+
126
+ # Fetch IO configuration file from Hugging Face Hub
127
+ io_yaml_file = granite_common.intrinsics.util.obtain_io_yaml(
128
+ intrinsic_name, base_model_name
129
+ )
130
+
131
+ # Instantiate input/output processors
132
+ rewriter = granite_common.IntrinsicsRewriter(config_file=io_yaml_file)
133
+ result_processor = granite_common.IntrinsicsResultProcessor(config_file=io_yaml_file)
134
+
135
+ # Sample request
136
+ request_json = {
137
+ "messages": [
138
+ {
139
+ "role": "user",
140
+ "content": "Who attended the meeting?"
141
+ },
142
+ {
143
+ "role": "assistant",
144
+ "content": "Many people attended the meeting."
145
+ }
146
+ ],
147
+ "extra_body": {
148
+ "documents": [
149
+ {
150
+ "doc_id": "1",
151
+ "text": "Meeting attendees: Alice, Bob, Carol."
152
+ },
153
+ {
154
+ "doc_id": "2",
155
+ "text": "Meeting time: 9:00 am to 11:00 am."
156
+ }
157
+ ]
158
+ }
159
+ }
160
+
161
+ # Add other parameters
162
+ request_json["model"] = intrinsic_name
163
+ request_json["temperature"] = 0.0
164
+
165
+ # Apply input processor
166
+ intrinsic_kwargs = {}
167
+ rewritten_request = rewriter.transform(request_json, **intrinsic_kwargs)
168
+
169
+ # Run inference
170
+ client = openai.OpenAI(base_url=openai_base_url, api_key=openai_api_key)
171
+ chat_completion = client.chat.completions.create(**rewritten_request.model_dump())
172
+
173
+ # Apply output processor
174
+ processed_chat_completion = result_processor.transform(
175
+ chat_completion, rewritten_request
176
+ )
177
+
178
+ # Verify that the contents of the completion is valid JSON and pretty-print the JSON.
179
+ parsed_contents = json.loads(processed_chat_completion.choices[0].message.content)
180
+ print("JSON output:")
181
+ print(json.dumps(parsed_contents, indent=2))
182
+
183
+ ### Using the Hugging Face Transformers Library
184
+
185
+ To run the intrinsic using the Hugging Face transformers library directly,
186
+ follow the steps below.
187
+
188
+ 1. Install the granite-common library:
189
+
190
+ pip install git+https://github.com/ibm-granite/granite-common.git
191
+ pip install granite_common[nltk]
192
+
193
+ 2. Install the Hugging Face CLI:
194
+
195
+ pip install -U "huggingface_hub[cli]"
196
+
197
+ 3. Install PEFT:
198
+
199
+ pip install peft
200
+
201
+ 4. Install xgrammar:
202
+
203
+ pip install xgrammar
204
+
205
+ 5. Run the following code snippet:
206
+
207
+ import json
208
+ import granite_common.util
209
+ import peft
210
+
211
+ intrinsic_name = "answer_relevance_classifier"
212
+
213
+ # Change the following constant to select a different base model
214
+ base_model_name = "granite-3.3-8b-instruct"
215
+
216
+ use_cuda = True # Set to False to use default PyTorch device for this machine + model
217
+
218
+ # Fetch IO configuration file from Hugging Face Hub
219
+ io_yaml_file = granite_common.intrinsics.util.obtain_io_yaml(
220
+ intrinsic_name, base_model_name
221
+ )
222
+
223
+ # Fetch LoRA directory from Hugging Face Hub
224
+ lora_dir = granite_common.intrinsics.util.obtain_lora(
225
+ intrinsic_name, base_model_name
226
+ )
227
+
228
+ # Instantiate input/output processors
229
+ rewriter = granite_common.IntrinsicsRewriter(config_file=io_yaml_file)
230
+ result_processor = granite_common.IntrinsicsResultProcessor(config_file=io_yaml_file)
231
+
232
+ # Sample request
233
+ request_json = {
234
+ "messages": [
235
+ {
236
+ "role": "user",
237
+ "content": "Who attended the meeting?"
238
+ },
239
+ {
240
+ "role": "assistant",
241
+ "content": "Many people attended the meeting."
242
+ }
243
+ ],
244
+ "extra_body": {
245
+ "documents": [
246
+ {
247
+ "doc_id": "1",
248
+ "text": "Meeting attendees: Alice, Bob, Carol."
249
+ },
250
+ {
251
+ "doc_id": "2",
252
+ "text": "Meeting time: 9:00 am to 11:00 am."
253
+ }
254
+ ]
255
+ }
256
+ }
257
+
258
+ # Add additional parameters
259
+ request_json["model"] = intrinsic_name
260
+ request_json["temperature"] = 0.0
261
+
262
+ # Apply input processor
263
+ intrinsic_kwargs = {}
264
+ rewritten_request = rewriter.transform(request_json, **intrinsic_kwargs)
265
+
266
+ # Load the base model and merge LoRA weights
267
+ model, tokenizer = granite_common.util.load_transformers_lora(lora_dir)
268
+ if use_cuda:
269
+ model = model.cuda()
270
+
271
+ # Convert the chat completion request into a the Transformers library's proprietary
272
+ # format.
273
+ generate_input, other_input = (
274
+ granite_common.util.chat_completion_request_to_transformers_inputs(
275
+ rewritten_request,
276
+ tokenizer,
277
+ model,
278
+ )
279
+ )
280
+
281
+ # Use the Transformers library's APIs to generate one or more completions,
282
+ # then convert those completions into OpenAI-compatible chat completion
283
+ responses = granite_common.util.generate_with_transformers(
284
+ tokenizer, model, generate_input, other_input
285
+ )
286
+
287
+ # Apply output processor
288
+ transformed_responses = result_processor.transform(responses, rewritten_request)
289
+
290
+ # Verify that the contents of the completion is valid JSON and pretty-print the JSON.
291
+ parsed_contents = json.loads(transformed_responses.choices[0].message.content)
292
+ print("JSON output:")
293
+ print(json.dumps(parsed_contents, indent=2))
294
+
295
+ ## Training Details
296
+
297
+ ### Training Data
298
+
299
+ The training data is created in the following process
300
+ 1. Take the synthetic rag-data-granite dataset, consisting of conversations between user and assistant.
301
+ 2. Replace the assistant response by running granite-3.2-intrinsics at temperature 1.0.
302
+ 3. Produce answer_relevance_classifier target output using mixtral-large with prompts with in-context examples.
303
+ The conversation created in steps 1 and 2 are taken as training input. The json string from step 3
304
+ is taken as train target output.
305
+
306
+ #### Training Hyperparameters
307
+
308
+ The LoRA adapter was fine-tuned using PEFT under the following regime: rank =
309
+ 32, learning rate = 3.0e-06, number of epochs = 50.
310
+
311
+ ## Evaluation
312
+
313
+ ### Answer Relevance Classifier
314
+
315
+ We evaluated the model on test data set generated by the same procedure as the training process,
316
+ using GPT-4o as judge.
317
+
318
+
319
+ The following table presents results comparing baselines and frontier models
320
+ on the answer relevance classification task. The LoRAs perform on par with frontier models
321
+ of much larger size and outperforms frontier models of comparable size.
322
+
323
+ | | Not relevant | | | Relevant | | |
324
+ |:------------------------|:----------|:-------|:------|:----------|:-------|:------|
325
+ | | precision | recall | f1 | precision | recall | f1 |
326
+ | mixtral-8x22b-v0.1 | 0.934 | 0.592 | 0.725 | 0.886 | 0.880 | 0.883 |
327
+ | llama-3.3-70b | 0.895 | 0.829 | 0.861 | 0.898 | 0.939 | 0.918 |
328
+ | gpt-oss-20b | 0.747 | 0.745 | 0.746 | 0.969 | 0.782 | 0.865 |
329
+ | gpt-4o | 0.775 | 0.945 | 0.852 | 0.974 | 0.690 | 0.808 |
330
+ | gpt-4o-mini | 0.818 | 0.921 | 0.866 | 0.948 | 0.872 | 0.908 |
331
+ | | | | | | | |
332
+ | granite-3.3-2b/lora | 0.743 | 0.861 | 0.798 | 0.909 | 0.806 | 0.855 |
333
+ | granite-3.3-2b/alora | 0.761 | 0.821 | 0.790 | 0.894 | 0.833 | 0.862 |
334
+ | granite-3.3-8b/lora | 0.783 | 0.900 | 0.837 | 0.931 | 0.842 | 0.884 |
335
+ | granite-3.3-8b/alora | 0.793 | 0.879 | 0.834 | 0.919 | 0.856 | 0.886 |
336
+
337
+
338
+ ### Comparing the Answer Relevance Classifier Intrinsics vs. Vanilla Granite Models
339
+
340
+ We compare the performance of Granite 3.3-2b, Granite 3.3-8b Instruct
341
+ vs. answer relevance classifier intrinsics implemented as LoRA adapters.
342
+ It is seen that the LoRAs significantly out perform the base models.
343
+
344
+ | | Not relevant | | | Relevant | | |
345
+ |:------------------------|:----------|:-------|:------|:----------|:-------|:------|
346
+ | | precision | recall | f1 | precision | recall | f1 |
347
+ | granite-3.3-2b | | | | | | |
348
+ | granite-3.3-2b/lora | 0.743 | 0.861 | 0.798 | 0.909 | 0.806 | 0.855 |
349
+ | granite-3.3-2b/alora | 0.761 | 0.821 | 0.790 | 0.894 | 0.833 | 0.862 |
350
+ | | | | | | | |
351
+ | granite-3.3-8b | 0.798 | 0.542 | 0.646 | 0.813 | 0.770 | 0.791 |
352
+ | granite-3.3-8b/lora | 0.783 | 0.900 | 0.837 | 0.931 | 0.842 | 0.884 |
353
+ | granite-3.3-8b/alora | 0.793 | 0.879 | 0.834 | 0.919 | 0.856 | 0.886 |
354
+
355
+
356
+
357
+ ## Model Card Authors
358
+
359
+ [Huaiyu Zhu](mailto:[email protected])
360
+
361
+ ### Framework versions
362
+
363
+ - PEFT 0.14.0
answer_relevance_rewriter/README.md ADDED
@@ -0,0 +1,380 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ license: apache-2.0
3
+ language:
4
+ - en
5
+ pipeline_tag: text-generation
6
+ library_name: peft
7
+ library_name: transformers
8
+ ---
9
+
10
+ # Intrinsics for Answer Relevance Rewriter
11
+
12
+ ## Model Summary
13
+ This is a RAG-specific intrinsic for answer relevance rewrite task.
14
+ The model takes as input the chat completion from answer relevance classifier output
15
+ consisting of conversation as well as answer_relevance_classification, together with grounding documents,
16
+ and provides a rewritten assistant response that is more relevant to the user's final inquiry.
17
+
18
+
19
+ We provide two intrinsics implemented as LoRA adapters (LoRA/aLoRA) trained over
20
+ Granite-3.3-2b-instruct, Granite-3.3-8b-instruct.
21
+
22
+ - **Developer:** IBM Research
23
+ - **Model type:** LoRA and aLoRA adapter for
24
+ [ibm-granite/granite-3.3-2b-instruct](https://huggingface.co/ibm-granite/granite-3.3-2b-instruct),
25
+ [ibm-granite/granite-3.3-8b-instruct](https://huggingface.co/ibm-granite/granite-3.3-8b-instruct)
26
+ - **License:** [Apache 2.0](https://www.apache.org/licenses/LICENSE-2.0)
27
+
28
+ ## Intended use
29
+ This rag specific intrinsics is intended to be used to post-process the generated assistant response.
30
+ It should be used following the answer relevance classifier intrinsic, and should be applied to
31
+ the cases where the `answer_relevance_likelihood` is below a certain threshold according to application criteria.
32
+
33
+ For cases where the assistant answer is deemed not relevant (where `answer_relevance_likelihood` is below a
34
+ given threshold), the answer relevance rewriter intrinsic can be used to rewrite the assistant response
35
+ into a more relevant response. It takes as input the chat completion
36
+ from answer relevance classifier output and the grounding documents. Its output is of the form
37
+
38
+ {
39
+ answer_relevance_rewrite: <Rewritten response>
40
+ }
41
+
42
+ The rewriter is instructed to only correct deficiencies in relevance as identified by the classifier,
43
+ and ensure the rewritten response is grounded in the conversation and given documents.
44
+
45
+ **Model input**: The input to the answer relevance rewriter intrinsic is an
46
+ OpenAI-compatible chat completion request, containing a list of conversation
47
+ turns that can alternate between the `user` and `assistant` role and ending with
48
+ a `assistant` turn, plus two additional turns:
49
+ - A conversation between user and assistant ending with assistant response
50
+ - An additional user turn with content "answer_relevance"
51
+
52
+ **Model output**: The output of the answer relevance rewriter intrinsic is the result of the
53
+ original chat completion request formatted as a JSON object of the following schema
54
+
55
+ {
56
+ answer_relevance_rewrite: <Rewritten response>
57
+ }
58
+
59
+ Please see the code snippets in the Quickstart Example section below for
60
+ examples that illustrate the intrinsic's input/output.
61
+
62
+ ## Quickstart Example
63
+
64
+ To run the answer relevance rewriter intrinsics through granite-common, you can either (a)
65
+ use an OpenAI-compatible inference backend, such as vLLM or (b) use the Hugging
66
+ Face transformers library. We provide below instructions for each of the two
67
+ approaches. Note that running inference using vLLM or another scalable
68
+ OpenAI-compatible inference backend should be significantly faster than using
69
+ the Hugging Face transformers library directly.
70
+
71
+ ### Using an OpenAI-Compatible Inference Backend
72
+
73
+ To run the intrinsic using an OpenAI-compatible inference backend, such as vLLM,
74
+ follow the steps below.
75
+
76
+ 1. Install the granite-common library:
77
+
78
+ pip install git+https://github.com/ibm-granite/granite-common.git
79
+ pip install granite_common[nltk]
80
+
81
+ 2. Install the Hugging Face CLI:
82
+
83
+ pip install -U "huggingface_hub[cli]"
84
+
85
+ 3. Install vLLM:
86
+
87
+ pip install vllm
88
+
89
+ 4. Download the intrinsics library:
90
+
91
+ hf download ibm-granite/rag-intrinsics-lib --local-dir ./rag-intrinsics-lib
92
+
93
+ 5. Edit the vLLM startup script found in `./rag-intrinsics-lib/run_vllm.sh`
94
+ using your favorite editor:
95
+
96
+ Edit the constants `BASE_MODEL_NAME` and `BASE_MODEL_ORG` depending on the
97
+ base model on which the desired LoRA adapter has been trained. Optionally,
98
+ edit the constant `PORT` to change the port on which vLLM will run. Save the
99
+ modified file and exit the editor.
100
+
101
+ 6. Start vLLM through the startup script. The first time you run the script,
102
+ you may have to change the permissions to allow execution:
103
+
104
+ cd rag-intrinsics-lib
105
+ chmod u+x ./run_vllm.sh
106
+ ./run_vllm.sh &
107
+
108
+ 7. Run the following code snippet:
109
+
110
+ import json
111
+ import openai
112
+ import granite_common
113
+
114
+ intrinsic_name = "answer_relevance_classifier"
115
+
116
+ # Change the following constant to select a different base model
117
+ base_model_name = "granite-3.3-8b-instruct"
118
+
119
+ # Change the following constants as needed to reflect the location of the vLLM server
120
+ # The selected port should be identical to the one you specified in the vLLM startup script
121
+ openai_base_url = "http://localhost:55555/v1"
122
+ openai_api_key = "rag_intrinsics_1234"
123
+
124
+ # Fetch IO configuration file from Hugging Face Hub
125
+ io_yaml_file = granite_common.intrinsics.util.obtain_io_yaml(
126
+ intrinsic_name, base_model_name
127
+ )
128
+
129
+ # Instantiate input/output processors
130
+ rewriter = granite_common.IntrinsicsRewriter(config_file=io_yaml_file)
131
+ result_processor = granite_common.IntrinsicsResultProcessor(config_file=io_yaml_file)
132
+
133
+ # Sample request
134
+ request_json = {
135
+ "messages": [
136
+ {
137
+ "role": "user",
138
+ "content": "Who attended the meeting?"
139
+ },
140
+ {
141
+ "role": "assistant",
142
+ "content": "Many people attended the meeting."
143
+ },
144
+ {
145
+ "role": "user",
146
+ "content": "answer_relevance"
147
+ },
148
+ {
149
+ "role": "assistant",
150
+ "content": "{\"answer_relevance_analysis\": \"The inquiry asks for the attendees of the meeting. The response provides a vague and non-specific answer that does not address the inquiry.\", \"answer_relevance_category\": \"No attempt\", \"answer_relevance_likelihood\": 0.0}"
151
+ }
152
+ ],
153
+ "extra_body": {
154
+ "documents": [
155
+ {
156
+ "doc_id": "1",
157
+ "text": "Meeting attendees: Alice, Bob, Carol."
158
+ },
159
+ {
160
+ "doc_id": "2",
161
+ "text": "Meeting time: 9:00 am to 11:00 am."
162
+ }
163
+ ]
164
+ }
165
+ }
166
+
167
+ # Add other parameters
168
+ request_json["model"] = intrinsic_name
169
+ request_json["temperature"] = 0.0
170
+
171
+ # Apply input processor
172
+ intrinsic_kwargs = {}
173
+ rewritten_request = rewriter.transform(request_json, **intrinsic_kwargs)
174
+
175
+ # Run inference
176
+ client = openai.OpenAI(base_url=openai_base_url, api_key=openai_api_key)
177
+ chat_completion = client.chat.completions.create(**rewritten_request.model_dump())
178
+
179
+ # Apply output processor
180
+ processed_chat_completion = result_processor.transform(
181
+ chat_completion, rewritten_request
182
+ )
183
+
184
+ # Verify that the contents of the completion is valid JSON and pretty-print the JSON.
185
+ parsed_contents = json.loads(processed_chat_completion.choices[0].message.content)
186
+ print("JSON output:")
187
+ print(json.dumps(parsed_contents, indent=2))
188
+
189
+ ### Using the Hugging Face Transformers Library
190
+
191
+ To run the intrinsic using the Hugging Face transformers library directly,
192
+ follow the steps below.
193
+
194
+ 1. Install the granite-common library:
195
+
196
+ pip install git+https://github.com/ibm-granite/granite-common.git
197
+ pip install granite_common[nltk]
198
+
199
+ 2. Install the Hugging Face CLI:
200
+
201
+ pip install -U "huggingface_hub[cli]"
202
+
203
+ 3. Install PEFT:
204
+
205
+ pip install peft
206
+
207
+ 4. Install xgrammar:
208
+
209
+ pip install xgrammar
210
+
211
+ 5. Run the following code snippet:
212
+
213
+ import json
214
+ import granite_common.util
215
+ import peft
216
+
217
+ intrinsic_name = "answer_relevance_classifier"
218
+
219
+ # Change the following constant to select a different base model
220
+ base_model_name = "granite-3.3-8b-instruct"
221
+
222
+ use_cuda = True # Set to False to use default PyTorch device for this machine + model
223
+
224
+ # Fetch IO configuration file from Hugging Face Hub
225
+ io_yaml_file = granite_common.intrinsics.util.obtain_io_yaml(
226
+ intrinsic_name, base_model_name
227
+ )
228
+
229
+ # Fetch LoRA directory from Hugging Face Hub
230
+ lora_dir = granite_common.intrinsics.util.obtain_lora(
231
+ intrinsic_name, base_model_name
232
+ )
233
+
234
+ # Instantiate input/output processors
235
+ rewriter = granite_common.IntrinsicsRewriter(config_file=io_yaml_file)
236
+ result_processor = granite_common.IntrinsicsResultProcessor(config_file=io_yaml_file)
237
+
238
+ # Sample request
239
+ request_json = {
240
+ "messages": [
241
+ {
242
+ "role": "user",
243
+ "content": "Who attended the meeting?"
244
+ },
245
+ {
246
+ "role": "assistant",
247
+ "content": "Many people attended the meeting."
248
+ },
249
+ {
250
+ "role": "user",
251
+ "content": "answer_relevance"
252
+ },
253
+ {
254
+ "role": "assistant",
255
+ "content": "{\"answer_relevance_analysis\": \"The inquiry asks for the attendees of the meeting. The response provides a vague and non-specific answer that does not address the inquiry.\", \"answer_relevance_category\": \"No attempt\", \"answer_relevance_likelihood\": 0.0}"
256
+ }
257
+ ],
258
+ "extra_body": {
259
+ "documents": [
260
+ {
261
+ "doc_id": "1",
262
+ "text": "Meeting attendees: Alice, Bob, Carol."
263
+ },
264
+ {
265
+ "doc_id": "2",
266
+ "text": "Meeting time: 9:00 am to 11:00 am."
267
+ }
268
+ ]
269
+ }
270
+ }
271
+
272
+ # Add additional parameters
273
+ request_json["model"] = intrinsic_name
274
+ request_json["temperature"] = 0.0
275
+
276
+ # Apply input processor
277
+ intrinsic_kwargs = {}
278
+ rewritten_request = rewriter.transform(request_json, **intrinsic_kwargs)
279
+
280
+ # Load the base model and merge LoRA weights
281
+ model, tokenizer = granite_common.util.load_transformers_lora(lora_dir)
282
+ if use_cuda:
283
+ model = model.cuda()
284
+
285
+ # Convert the chat completion request into a the Transformers library's proprietary
286
+ # format.
287
+ generate_input, other_input = (
288
+ granite_common.util.chat_completion_request_to_transformers_inputs(
289
+ rewritten_request,
290
+ tokenizer,
291
+ model,
292
+ )
293
+ )
294
+
295
+ # Use the Transformers library's APIs to generate one or more completions,
296
+ # then convert those completions into OpenAI-compatible chat completion
297
+ responses = granite_common.util.generate_with_transformers(
298
+ tokenizer, model, generate_input, other_input
299
+ )
300
+
301
+ # Apply output processor
302
+ transformed_responses = result_processor.transform(responses, rewritten_request)
303
+
304
+ # Verify that the contents of the completion is valid JSON and pretty-print the JSON.
305
+ parsed_contents = json.loads(transformed_responses.choices[0].message.content)
306
+ print("JSON output:")
307
+ print(json.dumps(parsed_contents, indent=2))
308
+
309
+ ## Training Details
310
+
311
+ ### Training Data
312
+
313
+ The training data is created in the following process
314
+ 1. Take the synthetic rag-data-granite dataset, consisting of conversations between user and assistant.
315
+ 2. Replace the assistant response by running granite-3.2-intrinsics at temperature 1.0.
316
+ 3. Produce answer_relevance_rewriter target output using mixtral-large with prompts with in-context examples.
317
+ The conversation created in steps 1 and 2 are taken as training input. The json string from step 3
318
+ is taken as train target output.
319
+
320
+ #### Training Hyperparameters
321
+
322
+ The LoRA adapter was fine-tuned using PEFT under the following regime: rank =
323
+ 32, learning rate = 1.0e-04, number of epochs = 5.
324
+
325
+ ## Evaluation
326
+
327
+ ### Answer Relevance Rewriter
328
+
329
+ We evaluated the model on test data set generated by the same procedure as the training process,
330
+ using GPT-4o as judge.
331
+
332
+
333
+ The following table presents results comparing baselines and frontier models
334
+ on the answer relevance rewrite task. The data sets consists of those classified as irrelevant by
335
+ mixtral-large. The evaluations are first divided into two parts, those that are truly irrelevant,
336
+ for which we measure the rate of rewrite becoming relevant, and those that are false irrelevant,
337
+ for which we measure the rate of rewrite becoming irrelevant. Then the overall rate flipping
338
+ irrelevant to relevant and flipping relevant to irrelevant are calculated, and well as the net gain
339
+ of relevance and resulting final relevance.
340
+
341
+ The LoRAs out perform the best of frontier models
342
+
343
+ | | True irrelevant <br> flip to relevant | False irrelevant <br> flip to irrelevant| Overall <br> flip irrelevant <br> to relevant | Overall <br> flip relevant <br> to irrelevant| net gain | Result <br>relevance |
344
+ |:---------------------|:--------------|:---------|:------------------------------|:---------|:---------|:--------------|
345
+ | mixtral-8x22b-v0.1 | 0.416 | 0.101 | 0.286 | 0.032 | 0.254 | 0.566 |
346
+ | llama-3.3-70b | 0.804 | 0.041 | 0.554 | 0.013 | 0.541 | 0.853 |
347
+ | gpt-oss-20b | 0.902 | 0.034 | 0.621 | 0.011 | 0.610 | 0.922 |
348
+ | gpt-4o | 0.960 | 0.014 | 0.661 | 0.004 | 0.657 | 0.968 |
349
+ | gpt-4o-mini | 0.758 | 0.027 | 0.522 | 0.008 | 0.514 | 0.825 |
350
+ | | | | | | | |
351
+ | granite-3.3-2b/lora | 0.972 | 0.027 | 0.669 | 0.008 | 0.661 | 0.973 |
352
+ | granite-3.3-2b/alora | 0.972 | 0.007 | 0.669 | 0.002 | 0.667 | 0.979 |
353
+ | granite-3.3-8b/lora | 0.969 | 0.014 | 0.667 | 0.004 | 0.663 | 0.975 |
354
+ | granite-3.3-8b/alora | 0.966 | 0.027 | 0.665 | 0.008 | 0.657 | 0.968 |
355
+ | | | | | | | |
356
+
357
+ ### Comparing the Answer Relevance Rewriter Intrinsics vs. Vanilla Granite Models
358
+
359
+ We compare the performance of Granite 3.3-2b, Granite 3.3-8b Instruct
360
+ vs. answer relevance rewriter intrinsics implemented as LoRA adapters.
361
+ It is seen that the LoRAs significantly out perform the base models.
362
+ | | True irrelevant <br> flip to relevant | False irrelevant <br> flip to irrelevant| Overall <br> flip irrelevant <br> to relevant | Overall <br> flip relevant <br> to irrelevant| net gain | Result relevance |
363
+ |:---------------------|:--------------|:---------|:------------------------------|:---------|:---------|:--------------|
364
+ | granite-3.3-2b | 0.346 | 0.169 | 0.238 | 0.053 | 0.185 | 0.497 |
365
+ | granite-3.3-2b/lora | 0.972 | 0.027 | 0.669 | 0.008 | 0.661 | 0.973 |
366
+ | granite-3.3-2b/alora | 0.972 | 0.007 | 0.669 | 0.002 | 0.667 | 0.979 |
367
+ | | | | | | | |
368
+ | granite-3.3-8b | 0.266 | 0.277 | 0.183 | 0.086 | 0.097 | 0.408 |
369
+ | granite-3.3-8b/lora | 0.969 | 0.014 | 0.667 | 0.004 | 0.663 | 0.975 |
370
+ | granite-3.3-8b/alora | 0.966 | 0.027 | 0.665 | 0.008 | 0.657 | 0.968 |
371
+ | | | | | | | |
372
+
373
+
374
+ ## Model Card Authors
375
+
376
+ [Huaiyu Zhu](mailto:[email protected])
377
+
378
+ ### Framework versions
379
+
380
+ - PEFT 0.14.0