LoRA Adapter for Answerability Classification
Welcome to Granite Experiments!
Think of Experiments as a preview of what's to come. These projects are still under development, but we wanted to let the open-source community take them for spin! Use them, break them, and help us build what's next for Granite – we'll keep an eye out for feedback and questions. Happy exploring!
Just a heads-up: Experiments are forever evolving, so we can't commit to ongoing support or guarantee performance.
Model Summary
This is a LoRA adapter for ibm-granite/granite-3.2-8b-instruct that is fine-tuned for binary answerability classification task. The model takes as input a multi-turn conversation and a set of documents, and classifies whether the user's final query is answerable or unanswerable based on the available information in the documents.
- Developer: IBM Research
- Model type: LoRA adapter for ibm-granite/granite-3.2-8b-instruct
- License: Apache 2.0
Intended use
This is a LoRA adapter that enables answerability classification for the final user query in a multi-turn conversation, with respect to a set of provided documents. The model is trained to determine whether the last user query is answerable or unanswerable, based solely on the information present in the documents. This makes it suitable for applications involving RAG and document-grounded chatbots, where knowing whether sufficient information exists to answer a query is crucial. The classification output from the answerability model can be used in several downstream applications, including but not limited to:
- Filter out unanswerable questions before sending them to generation in RAG setting. By classifying a query as unanswerable upfront, the system can prevent hallucinated or misleading responses.
- Re-query the retriever to get more relevant documents. If a query is initially deemed unanswerable, the retriever can be re-invoked with alternate formulations to fetch more relevant documents.
Model input: The input to the model is a list of conversational turns and a list of documents converted to a string using apply_chat_template
function. These turns can alternate between the user
and assistant
roles. The last turn is from the user
. The list of documents is a dictionary with text
field, which contains the text of the corresponding document.
To prompt the LoRA adapter to determine answerability, a special answerability role is used to trigger this capability of the model. The role includes the keyword "answerability": <|start_of_role|>answerability<|end_of_role|>
Model output: When prompted with the above input, the model generates the answerable or unanswerable output.
Quickstart Example
Use the code below to get started with the model.
import torch
from transformers import AutoTokenizer, AutoModelForCausalLM
from peft import PeftModel
from peft import PeftModelForCausalLM as lora_model
device=torch.device('cuda' if torch.cuda.is_available() else 'cpu')
ANSWERABILITY_PROMPT = "<|start_of_role|>answerability<|end_of_role|>"
BASE_NAME = "ibm-granite/granite-3.2-8b-instruct"
LORA_NAME = "ibm-granite/granite-3.2-8b-lora-rag-answerability-prediction"
tokenizer = AutoTokenizer.from_pretrained(BASE_NAME, padding_side='left',trust_remote_code=True)
model_base = AutoModelForCausalLM.from_pretrained(BASE_NAME,device_map="auto")
model_answerability = PeftModel.from_pretrained(model_base, LORA_NAME)
convo = [
{"role": "user", "content": "name the viceroy at the time of quit india movement"},
{"role": "assistant", "content": "The Viceroy at the time of the Quit India Movement was Lord Linlithgow."},
{"role": "user", "content": "opposition leader for the movement"},
{"role": "assistant", "content": "A movement was launched at the Bombay session of the All - India Congress Committee by Mahatma Gandhi on 8 August 1942. Many students paid attention to Subhas Chandra Bose as a leader, who was in exile. He organized the Indian Legion in Germany , reorganized the Indian National Army with Japanese assistance , and soliciting help from the Axis Powers , conducted a guerrilla war against the British authorities."},
{"role": "user", "content": "who is Patel?"},
{"role": "assistant", "content": "Vallabhbhai Patel, popularly known as Sardar Patel , was the first Deputy Prime Minister of India . He was an Indian barrister and statesman , a senior leader of the Indian National Congress and a founding father of the Republic of India who played a leading role in the country 's struggle for independence and guided its integration into a united , independent nation . He acted as de facto Supreme Commander - in - chief of Indian army during the political integration of India and the Indo - Pakistani War of 1947 . He is credited by historians with ensuring the success of the rebellion across India. He was arrested and was imprisoned with the entire Congress Working Committee from 1942 to 1945"},
{"role": "user", "content": "how do you pronounce Vallabhbhai?"},
]
documents = [
{'text': "Vallabhbhai Patel\nAmong Patel 's surviving family , Maniben Patel lived in a flat in Mumbai for the rest of her life following her father 's death ; she often led the work of the Sardar Patel Memorial Trust , which organises the prestigious annual Sardar Patel Memorial Lectures , and other charitable organisations . Dahyabhai Patel was a businessman who was elected to serve in the Lok Sabha ( the lower house of the Indian Parliament ) as an MP in the 1960s ."},
{'text': "Vallabhbhai Patel\nPatel 's date of birth was never officially recorded ; Patel entered it as 31 October on his matriculation examination papers . He belonged to the Leuva Patel Patidar community of Central Gujarat , although the Leuva Patels and Kadava Patels have also claimed him as one of their own ."},
{'text': "Vallabhbhai Patel\nIn April 2015 the Government of India declassified surveillance reports suggesting that Patel , while Home Minister , and Nehru were among officials involved in alleged government - authorised spying on the family of Subhas Chandra Bose ."}
]
convo = [{"role":"system", "content": ""}] +convo
string = tokenizer.apply_chat_template(convo,documents=documents, tokenize=False,add_generation_prompt=False)
string_to_remove = tokenizer.apply_chat_template(convo[0:1], tokenize=False,add_generation_prompt=False)
string = string[len(string_to_remove):]
inputs = string + ANSWERABILITY_PROMPT
inputT = tokenizer(inputs, return_tensors="pt")
output = model_answerability.generate(inputT["input_ids"].to(device), attention_mask=inputT["attention_mask"].to(device), max_new_tokens=3)
output_text = tokenizer.decode(output[0])
answer = output_text.split(ANSWERABILITY_PROMPT)[1]
print(answer)
Training Details
Training Data
The training data uses the publicly available Government corpus from MT-RAG as the source of documents. Based on this corpus, we constructed a dataset consisting of a mix of human-created and synthetically generated multi-turn conversations. It includes two types of examples: (1) Answerable queries, where the final user question can be answered based on the provided documents. These examples teach the adapter to recognize when sufficient information is present to support an answer. (2) Unanswerable queries, where the documents lack the necessary information to answer the final user query. We used Mixtral as an automatic judge to validate the answerability labels and filter out noisy samples.
Training Hyperparameters
The LoRA adapter was fine-tuned using PEFT under the following regime: rank = 32, learning rate = 5e-6, number of epochs = 25, with early stopping based on validation set, and 90/10 split between training and validation.
Evaluation
Answerability Classification
We evaluated the model against baselines on binary answerability classification using two separate benchmarks:
- Single-turn Setting (SQUADRun Benchmark): In this setting, the user query and the supporting documents are provided. Our model was evaluated against standard baselines to measure its ability to determine whether a standalone question is answerable based on the document set.
unanswerable | answerable | Classification Accuracy | Weighted F1 | |||||
---|---|---|---|---|---|---|---|---|
Precision | Recall | F1 | Precision | Recall | F1 | |||
BigBird (pre-trained embeddings) w/ MLP | 49.2 | 68.5 | 57.3 | 48 | 29.2 | 36.3 | 48.9 | 46.8 |
llama2-7b as classifier (Full SFT) | 72.2 | 71 | 71.6 | 71.4 | 72.6 | 72 | 71.8 | 71.8 |
Granite 3.2-8b LoRA | 84.2 | 68 | 75.2 | 73.1 | 87.2 | 79.5 | 77.6 | 77.4 |
- Multi-turn Setting (MT-RAG Benchmark): In this setting, the model is given the full multi-turn conversation history along with the supporting documents. This benchmark evaluates the model's ability to assess answerability when the final user query can also depend on prior turns for context.
unanswerable | answerable | Classification Accuracy | Weighted F1 Score | |||||
---|---|---|---|---|---|---|---|---|
Precision | Recall | F1 | Precision | Recall | F1 | |||
BigBird (pre-trained embeddings) w/ MLP | 69.6 | 77.6 | 73.4 | 70.1 | 60.8 | 65.2 | 69.8 | 69.6 |
llama2-7b as classifier (Full SFT) | 86.9 | 89.4 | 88.2 | 87.3 | 84.5 | 85.9 | 87.1 | 87.1 |
Granite 3.2-8b LoRA | 85.4 | 89.3 | 87.3 | 87 | 82.4 | 84.6 | 86.1 | 86.1 |
Comparing LoRA Adapter vs. Vanilla Granite for Answer Quality
We compare the performance of Granite 3.2-8b Instruct vs. Granite 3.2-8b LoRA adapter on a subset of MT-RAG Benchmark. In this setup, each query is paired with only 5 retrieved passages as context.
Answerability Classification Performance: The LoRA adapter outperforms the vanilla model in overall F1 on both answerables and unanswerables. The LoRA adapter achieves higher recall on unanswerable queries, making it better at identifying questions that should not be answered. However, this comes at the cost of lower recall on answerable queries.
The RAGAS Faithfulness score (on truly answerable queries) drops slightly with the LoRA adapter. This is not due to degraded generation quality, but rather because the model labels more truly answerable queries as unanswerable and abstains from answering.
Joint Answerability-Faithfulness Score computed as:
= 1 (if model prediction = IDK/unanswerable ∩ ground truth = unanswerable)
= RAGAS Faithfulness (if model prediction = non-IDK/answerable ∩ ground truth = answerable)
= 0 (otherwise)
This score rewards the model for correctly abstaining on unanswerable queries (full credit) and for providing faithful answers on answerable queries (partial credit based on RAGAS Faithfulness). No credit is given for incorrect or unfaithful predictions.
The LoRA adapter achieves a 7% lift on this metric - rewarding the model for correctly abstaining on unanswerable queries and for being faithful when it chooses to answer.
F1 Score Unanswerable | F1 Score Answerable | Recall Unanswerable | Recall Answerable | Ragas Faithfulness (on Truly Answerable) | Joint Answerability- Faithfulness Score | |
---|---|---|---|---|---|---|
Granite 3.2-8b Instruct | 14 | 76 | 8 | 97 | 75 | 50 |
Granite 3.2-8b LoRA | 47 | 77 | 37 | 88 | 70 | 57 |
Model Card Authors
Framework versions
- PEFT 0.14.0
- Downloads last month
- 16
Model tree for ibm-granite/granite-3.2-8b-lora-rag-answerability-prediction
Base model
ibm-granite/granite-3.1-8b-base