LoRA Adapter for Query Rewrite

Welcome to Granite Experiments!

Think of Experiments as a preview of what's to come. These projects are still under development, but we wanted to let the open-source community take them for spin! Use them, break them, and help us build what's next for Granite – we'll keep an eye out for feedback and questions. Happy exploring!

Just a heads-up: Experiments are forever evolving, so we can't commit to ongoing support or guarantee performance.

Model Summary

This is a LoRA adapter for ibm-granite/granite-3.2-8b-instruct that is fine-tuned for the query rewrite task:

Given a multi-turn conversation between a user and an AI assistant, decontextualize the last 
user utterance (query) by rewriting it (whenever necessary) into an equivalent version that 
is standalone and can be understood by itself.

While this adapter is general purpose, it is especially effective in RAG settings where its ability to rewrite a user query into a standalone version directly improves the retriever performance, which in turn improves the answer generation performance.

Developer: IBM Research
Model type: LoRA adapter for ibm-granite/granite-3.2-8b-instruct
License: Apache 2.0

Intended use

This is a LoRA adaptor that gives the ability to rewrite the last user query in a multi-turn conversation. Typically, the rewrite is a form of expansion that inlines into the query any implicit references that are made to entities, concepts, or even parts of the conversation that occur in the previous turns (either by the user or the AI assistant). Such expansion can include coreference resolution (i.e., replacement of pronouns with the actual entities), handling of ellipsis, which is the common linguistic phenomenon where parts of a sentence or phrase are omitted by the user, but can be understood from the context (i.e., for whom, of what, with respect to something discussed above, etc.).

As a result of the expansion, the query becomes a standalone query, still equivalent in meaning with what the user asked in the last turn. The rewritten query can be sent to downstream tasks (e.g., to a retriever in a RAG setting) as a better replacement for the original user query, and without the need for (a potentially very long) context.

Model input: The input to the model is a list of conversational turns converted to a string using apply_chat_template function. These turns can alternate between the user and assistant roles, and the last turn is assumed to be from the user.

To prompt the LoRA adapter to rewrite the last user turn, a special rewrite role is used to trigger the rewrite capability of the model. The role includes the keyword "rewrite" followed by a short description of the query rewrite task.

<|start_of_role|>rewrite: Reword the final utterance from the USER into a single utterance that doesn't need the prior conversation history to understand the user's intent. If the final utterance is a clear and standalone question, please DO NOT attempt to rewrite it, rather output the last utterance as is. Your output format should be in JSON: { \"rewritten_question\": <REWRITE> }"<|end_of_role|>

Model output: When prompted with the above special rewrite role, the model generates a json object, which contains a field with the actual rewritten question.

Note: Even though one main application for query rewrite is in RAG settings, this LoRA adapter can be used to rewrite user questions for other conversational use cases (e.g., to access a database, or other APIs, or tools). As such, the adapter does not need any RAG documents (that may be present in the context, in a RAG setting) and uses only the dialog turns with what is being said between the user and assistant.

Quickstart Example

Use the code below to get started with the model.

import  torch
from  transformers  import  AutoTokenizer,  AutoModelForCausalLM
from  peft  import  PeftModel
import  json, re

INSTRUCTION_TEXT = "Reword the final utterance from the USER into a single utterance that doesn't need the prior conversation history to understand the user's intent. If the final utterance is a clear and standalone question, please DO NOT attempt to rewrite it, rather output the last user utterance as is. "
JSON = "Your output format should be in JSON: { \"rewritten_question\": <REWRITE> }"
REWRITE_PROMPT = "<|start_of_role|>rewrite: " + INSTRUCTION_TEXT + JSON + "<|end_of_role|>"

device=torch.device('cuda' if torch.cuda.is_available() else 'cpu')

BASE_NAME = "ibm-granite/granite-3.2-8b-instruct"
LORA_NAME = "ibm-granite/granite-3.2-8b-lora-rag-query-rewrite"

tokenizer = AutoTokenizer.from_pretrained(BASE_NAME, padding_side='left',trust_remote_code=True) 
model_base = AutoModelForCausalLM.from_pretrained(BASE_NAME, device_map ='auto') 
model_rewrite = PeftModel.from_pretrained(model_base, LORA_NAME)

# Input conversation
conv = [
    {
        "role":"user",
        "content":"Tim Cook is the CEO of Apple Inc."
    },
    {
        "role":"assistant",
        "content":"Yes, Tim Cook is the Chief Executive Officer of Apple Inc."
    },
    {
        "role":"user",
        "content":"and for Microsoft?"
    }
]

# Generate the query rewrite for the last turn in the above conversation
conv = [{"role":"system", "content":""}] + conv
input_text = tokenizer.apply_chat_template(conv, tokenize=False) + REWRITE_PROMPT
inputs = tokenizer(input_text, return_tensors="pt")
output = model_rewrite.generate(inputs["input_ids"].to(device), attention_mask=inputs["attention_mask"].to(device), max_new_tokens=80)
output_text = tokenizer.decode(output[0])

# Regex pattern to extract the JSON with the rewrite from the output of the model
pattern = r'\{\s*"[^"]+"\s*:\s*"[^"]*"\s*\}'
match_js = re.findall(pattern, output_text)[0]
try:
    #Parse the JSON and extract the rewrite    
    rewrite = json.loads (match_js) ['rewritten_question']
except Exception as e: 
    rewrite = match_js.split ("\"rewritten_question\": ", 1)[1]

print(f"Rewrite: {rewrite}\n")
# Rewrite: Who is the CEO of Microsoft?

Training Details

The training data contains both: 1) standalone examples, which teach the adapter to refrain from rewriting user questions that are already standalone, and 2) non-standalone examples containing a diversity of patterns that are used to teach the adapter to expand the user turn so that it becomes standalone.

Training Data

The training data uses the publicly available Cloud corpus of technical documentation pages from MT-RAG. Based on this corpus of documents, we constructed a dataset consisting of high-quality, human-created conversations, where the last turn of the conversation comes into versions: non-standalone version, and corresponding standalone version. The training dataset is proprietary and was obtained in combination with a third-party company who contracted the human annotators.

Training Hyperparameters

The LoRA adapter was fine-tuned using PEFT under the following regime: rank = 32, learning rate = 3e-6, number of epochs = 25, with early stopping based on validation set, and 90/10 split between training and validation.

Evaluation

Evaluation of retriever

We evaluate Recall@k on the MT-RAG benchmark, under various query rewrite strategies for the retriever. All retrieved passages are obtained using the Elser retriever with the same settings as in the above paper. In addition to the LoRA adapter, we include several other baselines, including no-rewrite (where we send the last user turn to the retriever as-is), Mixtral rewrites, as well as gold rewrites (human-created). We evaluate on three different testsets: a) full MT-RAG dataset (842 data points with last user turns); b) the non-standalone subset of MT-RAG dataset, which is a subset of 260 (out of 842) last user turns that were annotated by humans as non-standalone (i.e., they are dependent on the prior context); c) the standalone subset of MT-RAG dataset, which is the complementary subset, with all the last user turns that were annotated by humans as standalone.

a. Evaluation of Recall@k on full MT-RAG dataset.

Strategy	Recall@5	Recall@10	Recall@20
No rewrite	0.486	0.587	0.665
Mixtral 8x7b rewrite	0.522	0.642	0.72
Granite 3.2-8b LoRA rewrite	0.557	0.68	0.76
Gold rewrite	0.563	0.674	0.747

b. Evaluation of Recall@k on the non-standalone subset of MT-RAG.

Strategy	Recall@5	Recall@10	Recall@20
No rewrite	0.263	0.338	0.435
Mixtral 8x7b rewrite	0.362	0.488	0.574
Granite 3.2-8b LoRA rewrite	0.444	0.567	0.661
Gold rewrite	0.479	0.582	0.662

c. Evaluation of Recall@k on the standalone subset of MT-RAG.

Strategy	Recall@5	Recall@10	Recall@20
No rewrite	0.609	0.723	0.792
Mixtral 8x7b rewrite	0.613	0.733	0.809
Granite 3.2-8b LoRA rewrite	0.632	0.754	0.828
Gold rewrite	0.609	0.723	0.792

If we focus on Recall@20 numbers, as one instance of the metric, there is an overall 9.5 percentage points jump when using query rewrite with the Granite 3.2-8b LoRA adapter versus when using the no rewrite strategy. This jump is more pronounced on the non-standalone fragment, where query rewrite with the Granite 3.2-8b LoRA adapter leads to almost 23 percentage points improvement over the no-rewrite strategy. Also, we can observe that the numbers with the LoRA rewrites are very close to what can be obtained with the gold rewrites on non-standalones (and slightly better on standalones for LoRA -- human annotators were instructed to leave the query unchanged when classifying it as standalone, however, the LoRA adapter may still perform some rewriting which turns out to further improve the recall).

Evaluation of answer generation

We evaluate answer generation quality, with top-k passages retrieved under the various query rewrite strategies for the retriever. We choose here k = 20, but similar trends take place for other values of k. We used Granite-3.2-8b instruct as the answer generator, and RAGAS Faithfulness and RAD-Bench score as metrics for answer quality. We use the same three testsets as above.

a. Evaluation of answer quality on full MT-RAG dataset.

Strategy	RAGAS-F	RAD-Bench
No rewrite	0.73	0.66
Mixtral 8x7b rewrite	0.8	0.68
Granite 3.2-8b LoRA rewrite	0.81	0.7
Gold rewrite	0.79	0.69

a. Evaluation of answer quality on non-standalone subset of MT-RAG.

Strategy	RAGAS-F	RAD-Bench
No rewrite	0.61	0.62
Mixtral 8x7b rewrite	0.76	0.65
Granite 3.2-8b LoRA rewrite	0.79	0.69
Gold rewrite	0.8	0.69

a. Evaluation of answer quality on standalone MT-RAG subset.

Strategy	RAGAS-F	RAD-Bench
No rewrite	0.79	0.68
Mixtral 8x7b rewrite	0.82	0.7
Granite 3.2-8b LoRA rewrite	0.83	0.71
Gold rewrite	0.79	0.69

As with Recall, similar observations can be made here as well. Specifically, we see an 8 percentage points jump in RAGAS Faithfulness and 4 percentage points jump in RAD-Bench score when using query rewrite with the Granite 3.2-8b LoRA adapter versus when using the no rewrite strategy. This improvement is more pronounced on the non-standalone fragment, where query rewrite with the Granite 3.2-8b LoRA adapter leads to a 18 percentage points jump in RAGAS Faithfulness and 7 percentage points jump in RAD-Bench score.

Contact

Lucian Popa

Framework versions

PEFT 0.14.0

ibm-granite
/

granite-3.2-8b-lora-rag-query-rewrite