LoRA Adapter for Answerability Classification

Welcome to Granite Experiments!

Think of Experiments as a preview of what's to come. These projects are still under development, but we wanted to let the open-source community take them for spin! Use them, break them, and help us build what's next for Granite – we'll keep an eye out for feedback and questions. Happy exploring!

Just a heads-up: Experiments are forever evolving, so we can't commit to ongoing support or guarantee performance.

Model Summary

This is a LoRA adapter for ibm-granite/granite-3.2-8b-instruct that is fine-tuned for binary answerability classification task. The model takes as input a multi-turn conversation and a set of documents, and classifies whether the user's final query is answerable or unanswerable based on the available information in the documents.

Intended use

This is a LoRA adapter that enables answerability classification for the final user query in a multi-turn conversation, with respect to a set of provided documents. The model is trained to determine whether the last user query is answerable or unanswerable, based solely on the information present in the documents. This makes it suitable for applications involving RAG and document-grounded chatbots, where knowing whether sufficient information exists to answer a query is crucial. The classification output from the answerability model can be used in several downstream applications, including but not limited to:

  • Filter out unanswerable questions before sending them to generation in RAG setting. By classifying a query as unanswerable upfront, the system can prevent hallucinated or misleading responses.
  • Re-query the retriever to get more relevant documents. If a query is initially deemed unanswerable, the retriever can be re-invoked with alternate formulations to fetch more relevant documents.

Model input: The input to the model is a list of conversational turns and a list of documents converted to a string using apply_chat_template function. These turns can alternate between the user and assistant roles. The last turn is from the user. The list of documents is a dictionary with text field, which contains the text of the corresponding document.

To prompt the LoRA adapter to determine answerability, a special answerability role is used to trigger this capability of the model. The role includes the keyword "answerability": <|start_of_role|>answerability<|end_of_role|>

Model output: When prompted with the above input, the model generates the answerable or unanswerable output.

Quickstart Example

Use the code below to get started with the model.

import  torch
from  transformers  import  AutoTokenizer,  AutoModelForCausalLM
from  peft  import  PeftModel
from peft import PeftModelForCausalLM as lora_model

device=torch.device('cuda' if torch.cuda.is_available() else 'cpu')

ANSWERABILITY_PROMPT = "<|start_of_role|>answerability<|end_of_role|>"
BASE_NAME = "ibm-granite/granite-3.2-8b-instruct"
LORA_NAME = "ibm-granite/granite-3.2-8b-lora-rag-answerability-prediction"

tokenizer = AutoTokenizer.from_pretrained(BASE_NAME, padding_side='left',trust_remote_code=True)
model_base = AutoModelForCausalLM.from_pretrained(BASE_NAME,device_map="auto")
model_answerability = PeftModel.from_pretrained(model_base, LORA_NAME)

convo = [
    {"role": "user", "content": "name the viceroy at the time of quit india movement"}, 
    {"role": "assistant", "content": "The Viceroy at the time of the Quit India Movement was Lord Linlithgow."}, 
    {"role": "user", "content": "opposition leader for the movement"}, 
    {"role": "assistant", "content": "A movement was launched at the Bombay session of the All - India Congress Committee by Mahatma Gandhi on 8 August 1942.  Many students paid attention to Subhas Chandra Bose as a leader,  who was in exile. He organized the Indian Legion in Germany , reorganized the Indian National Army with Japanese assistance , and soliciting help from the Axis Powers , conducted a guerrilla war against the British authorities."}, 
    {"role": "user", "content": "who is Patel?"}, 
    {"role": "assistant", "content": "Vallabhbhai Patel, popularly known as Sardar Patel , was the first Deputy Prime Minister of India . He was an Indian barrister and statesman , a senior leader of the Indian National Congress and a founding father of the Republic of India who played a leading role in the country 's struggle for independence and guided its integration into a united , independent nation . He acted as de facto Supreme Commander - in - chief of Indian army during the political integration of India and the Indo - Pakistani War of 1947 . He is credited by historians with ensuring the success of the rebellion across India. He was arrested and was imprisoned with the entire Congress Working Committee from 1942 to 1945"},
    {"role": "user", "content": "how do you pronounce Vallabhbhai?"}, 
 ]


documents = [
    {'text': "Vallabhbhai Patel\nAmong Patel 's surviving family , Maniben Patel lived in a flat in Mumbai for the rest of her life following her father 's death ; she often led the work of the Sardar Patel Memorial Trust , which organises the prestigious annual Sardar Patel Memorial Lectures , and other charitable organisations . Dahyabhai Patel was a businessman who was elected to serve in the Lok Sabha ( the lower house of the Indian Parliament ) as an MP in the 1960s ."},
    {'text': "Vallabhbhai Patel\nPatel 's date of birth was never officially recorded ; Patel entered it as 31 October on his matriculation examination papers . He belonged to the Leuva Patel Patidar community of Central Gujarat , although the Leuva Patels and Kadava Patels have also claimed him as one of their own ."},
    {'text': "Vallabhbhai Patel\nIn April 2015 the Government of India declassified surveillance reports suggesting that Patel , while Home Minister , and Nehru were among officials involved in alleged government - authorised spying on the family of Subhas Chandra Bose ."}
]

convo = [{"role":"system", "content": ""}] +convo

string = tokenizer.apply_chat_template(convo,documents=documents, tokenize=False,add_generation_prompt=False)
string_to_remove = tokenizer.apply_chat_template(convo[0:1], tokenize=False,add_generation_prompt=False)
string = string[len(string_to_remove):]
inputs = string + ANSWERABILITY_PROMPT

inputT = tokenizer(inputs, return_tensors="pt")
            
output = model_answerability.generate(inputT["input_ids"].to(device), attention_mask=inputT["attention_mask"].to(device), max_new_tokens=3)
output_text = tokenizer.decode(output[0])
answer = output_text.split(ANSWERABILITY_PROMPT)[1]
print(answer)

Training Details

Training Data

The training data uses the publicly available Government corpus from MT-RAG as the source of documents. Based on this corpus, we constructed a dataset consisting of a mix of human-created and synthetically generated multi-turn conversations. It includes two types of examples: (1) Answerable queries, where the final user question can be answered based on the provided documents. These examples teach the adapter to recognize when sufficient information is present to support an answer. (2) Unanswerable queries, where the documents lack the necessary information to answer the final user query. We used Mixtral as an automatic judge to validate the answerability labels and filter out noisy samples.

Training Hyperparameters

The LoRA adapter was fine-tuned using PEFT under the following regime: rank = 32, learning rate = 5e-6, number of epochs = 25, with early stopping based on validation set, and 90/10 split between training and validation.

Evaluation

Answerability Classification

We evaluated the model against baselines on binary answerability classification using two separate benchmarks:

  • Single-turn Setting (SQUADRun Benchmark): In this setting, the user query and the supporting documents are provided. Our model was evaluated against standard baselines to measure its ability to determine whether a standalone question is answerable based on the document set.
unanswerable answerable Classification Accuracy Weighted F1
Precision Recall F1 Precision Recall F1
BigBird (pre-trained embeddings) w/ MLP 49.2 68.5 57.3 48 29.2 36.3 48.9 46.8
llama2-7b as classifier (Full SFT) 72.2 71 71.6 71.4 72.6 72 71.8 71.8
Granite 3.2-8b LoRA 84.2 68 75.2 73.1 87.2 79.5 77.6 77.4
  • Multi-turn Setting (MT-RAG Benchmark): In this setting, the model is given the full multi-turn conversation history along with the supporting documents. This benchmark evaluates the model's ability to assess answerability when the final user query can also depend on prior turns for context.
unanswerable answerable Classification Accuracy Weighted F1 Score
Precision Recall F1 Precision Recall F1
BigBird (pre-trained embeddings) w/ MLP 69.6 77.6 73.4 70.1 60.8 65.2 69.8 69.6
llama2-7b as classifier (Full SFT) 86.9 89.4 88.2 87.3 84.5 85.9 87.1 87.1
Granite 3.2-8b LoRA 85.4 89.3 87.3 87 82.4 84.6 86.1 86.1

Comparing LoRA Adapter vs. Vanilla Granite for Answer Quality

We compare the performance of Granite 3.2-8b Instruct vs. Granite 3.2-8b LoRA adapter on a subset of MT-RAG Benchmark. In this setup, each query is paired with only 5 retrieved passages as context.

  • Answerability Classification Performance: The LoRA adapter outperforms the vanilla model in overall F1 on both answerables and unanswerables. The LoRA adapter achieves higher recall on unanswerable queries, making it better at identifying questions that should not be answered. However, this comes at the cost of lower recall on answerable queries.

  • The RAGAS Faithfulness score (on truly answerable queries) drops slightly with the LoRA adapter. This is not due to degraded generation quality, but rather because the model labels more truly answerable queries as unanswerable and abstains from answering.

  • Joint Answerability-Faithfulness Score computed as:

    = 1 (if model prediction = IDK/unanswerable ∩ ground truth = unanswerable)

    = RAGAS Faithfulness (if model prediction = non-IDK/answerable ∩ ground truth = answerable)

    = 0 (otherwise)

    This score rewards the model for correctly abstaining on unanswerable queries (full credit) and for providing faithful answers on answerable queries (partial credit based on RAGAS Faithfulness). No credit is given for incorrect or unfaithful predictions.

The LoRA adapter achieves a 7% lift on this metric - rewarding the model for correctly abstaining on unanswerable queries and for being faithful when it chooses to answer.

F1 Score Unanswerable F1 Score Answerable Recall Unanswerable Recall Answerable Ragas Faithfulness (on Truly Answerable) Joint Answerability- Faithfulness Score
Granite 3.2-8b Instruct 14 76 8 97 75 50
Granite 3.2-8b LoRA 47 77 37 88 70 57

Model Card Authors

Vraj Shah

Framework versions

  • PEFT 0.14.0
Downloads last month
16
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for ibm-granite/granite-3.2-8b-lora-rag-answerability-prediction

Collection including ibm-granite/granite-3.2-8b-lora-rag-answerability-prediction