huaiyu-zhu's picture
Add answer relevance model cards
cde002d
|
raw
history blame
16.6 kB

Intrinsics for Answer Relevance Rewriter

Model Summary

This is a RAG-specific intrinsic for answer relevance rewrite task. The model takes as input the chat completion from answer relevance classifier output consisting of conversation as well as answer_relevance_classification, together with grounding documents, and provides a rewritten assistant response that is more relevant to the user's final inquiry.

We provide two intrinsics implemented as LoRA adapters (LoRA/aLoRA) trained over Granite-3.3-2b-instruct, Granite-3.3-8b-instruct.

Intended use

This rag specific intrinsics is intended to be used to post-process the generated assistant response. It should be used following the answer relevance classifier intrinsic, and should be applied to the cases where the answer_relevance_likelihood is below a certain threshold according to application criteria.

For cases where the assistant answer is deemed not relevant (where answer_relevance_likelihood is below a given threshold), the answer relevance rewriter intrinsic can be used to rewrite the assistant response into a more relevant response. It takes as input the chat completion from answer relevance classifier output and the grounding documents. Its output is of the form

{
    answer_relevance_rewrite: <Rewritten response>
}

The rewriter is instructed to only correct deficiencies in relevance as identified by the classifier, and ensure the rewritten response is grounded in the conversation and given documents.

Model input: The input to the answer relevance rewriter intrinsic is an OpenAI-compatible chat completion request, containing a list of conversation turns that can alternate between the user and assistant role and ending with a assistant turn, plus two additional turns:

  • A conversation between user and assistant ending with assistant response
  • An additional user turn with content "answer_relevance"

Model output: The output of the answer relevance rewriter intrinsic is the result of the original chat completion request formatted as a JSON object of the following schema

{
    answer_relevance_rewrite: <Rewritten response>
}

Please see the code snippets in the Quickstart Example section below for examples that illustrate the intrinsic's input/output.

Quickstart Example

To run the answer relevance rewriter intrinsics through granite-common, you can either (a) use an OpenAI-compatible inference backend, such as vLLM or (b) use the Hugging Face transformers library. We provide below instructions for each of the two approaches. Note that running inference using vLLM or another scalable OpenAI-compatible inference backend should be significantly faster than using the Hugging Face transformers library directly.

Using an OpenAI-Compatible Inference Backend

To run the intrinsic using an OpenAI-compatible inference backend, such as vLLM, follow the steps below.

  1. Install the granite-common library:

    pip install git+https://github.com/ibm-granite/granite-common.git
    pip install granite_common[nltk]
    
  2. Install the Hugging Face CLI:

    pip install -U "huggingface_hub[cli]"
    
  3. Install vLLM:

    pip install vllm
    
  4. Download the intrinsics library:

    hf download ibm-granite/rag-intrinsics-lib --local-dir ./rag-intrinsics-lib
    
  5. Edit the vLLM startup script found in ./rag-intrinsics-lib/run_vllm.sh using your favorite editor:

    Edit the constants BASE_MODEL_NAME and BASE_MODEL_ORG depending on the base model on which the desired LoRA adapter has been trained. Optionally, edit the constant PORT to change the port on which vLLM will run. Save the modified file and exit the editor.

  6. Start vLLM through the startup script. The first time you run the script, you may have to change the permissions to allow execution:

    cd rag-intrinsics-lib
    chmod u+x ./run_vllm.sh
    ./run_vllm.sh &
    
  7. Run the following code snippet:

    import json
    import openai
    import granite_common
    
    intrinsic_name = "answer_relevance_classifier"
    
    # Change the following constant to select a different base model
    base_model_name = "granite-3.3-8b-instruct"
    
    # Change the following constants as needed to reflect the location of the vLLM server
    # The selected port should be identical to the one you specified in the vLLM startup script
    openai_base_url = "http://localhost:55555/v1"
    openai_api_key = "rag_intrinsics_1234"
    
    # Fetch IO configuration file from Hugging Face Hub
    io_yaml_file = granite_common.intrinsics.util.obtain_io_yaml(
        intrinsic_name, base_model_name
    )
    
    # Instantiate input/output processors
    rewriter = granite_common.IntrinsicsRewriter(config_file=io_yaml_file)
    result_processor = granite_common.IntrinsicsResultProcessor(config_file=io_yaml_file)
    
    # Sample request
    request_json = {
        "messages": [
            {
            "role": "user",
            "content": "Who attended the meeting?"
            },
            {
            "role": "assistant",
            "content": "Many people attended the meeting."
            },
            {
            "role": "user",
            "content": "answer_relevance"
            },
            {
            "role": "assistant",
            "content": "{\"answer_relevance_analysis\": \"The inquiry asks for the attendees of the meeting. The response provides a vague and non-specific answer that does not address the inquiry.\", \"answer_relevance_category\": \"No attempt\", \"answer_relevance_likelihood\": 0.0}"
            }
        ],
        "extra_body": {
            "documents": [
            {
                "doc_id": "1",
                "text": "Meeting attendees: Alice, Bob, Carol."
            },
            {
                "doc_id": "2",
                "text": "Meeting time: 9:00 am to 11:00 am."
            }
            ]
        }
     }
    
    # Add other parameters
    request_json["model"] = intrinsic_name
    request_json["temperature"] = 0.0
    
    # Apply input processor
    intrinsic_kwargs = {}
    rewritten_request = rewriter.transform(request_json, **intrinsic_kwargs)
    
    # Run inference
    client = openai.OpenAI(base_url=openai_base_url, api_key=openai_api_key)
    chat_completion = client.chat.completions.create(**rewritten_request.model_dump())
    
    # Apply output processor
    processed_chat_completion = result_processor.transform(
        chat_completion, rewritten_request
    )
    
    # Verify that the contents of the completion is valid JSON and pretty-print the JSON.
    parsed_contents = json.loads(processed_chat_completion.choices[0].message.content)
    print("JSON output:")
    print(json.dumps(parsed_contents, indent=2))
    

Using the Hugging Face Transformers Library

To run the intrinsic using the Hugging Face transformers library directly, follow the steps below.

  1. Install the granite-common library:

    pip install git+https://github.com/ibm-granite/granite-common.git
    pip install granite_common[nltk]
    
  2. Install the Hugging Face CLI:

    pip install -U "huggingface_hub[cli]"
    
  3. Install PEFT:

    pip install peft
    
  4. Install xgrammar:

    pip install xgrammar
    
  5. Run the following code snippet:

    import json
    import granite_common.util
    import peft
    
    intrinsic_name = "answer_relevance_classifier"
    
    # Change the following constant to select a different base model
    base_model_name = "granite-3.3-8b-instruct"
    
    use_cuda = True  # Set to False to use default PyTorch device for this machine + model
    
    # Fetch IO configuration file from Hugging Face Hub
    io_yaml_file = granite_common.intrinsics.util.obtain_io_yaml(
        intrinsic_name, base_model_name
    )
    
    # Fetch LoRA directory from Hugging Face Hub
    lora_dir = granite_common.intrinsics.util.obtain_lora(
        intrinsic_name, base_model_name
    )
    
    # Instantiate input/output processors
    rewriter = granite_common.IntrinsicsRewriter(config_file=io_yaml_file)
    result_processor = granite_common.IntrinsicsResultProcessor(config_file=io_yaml_file)
    
    # Sample request
    request_json = {
        "messages": [
            {
            "role": "user",
            "content": "Who attended the meeting?"
            },
            {
            "role": "assistant",
            "content": "Many people attended the meeting."
            },
            {
            "role": "user",
            "content": "answer_relevance"
            },
            {
            "role": "assistant",
            "content": "{\"answer_relevance_analysis\": \"The inquiry asks for the attendees of the meeting. The response provides a vague and non-specific answer that does not address the inquiry.\", \"answer_relevance_category\": \"No attempt\", \"answer_relevance_likelihood\": 0.0}"
            }
        ],
        "extra_body": {
            "documents": [
            {
                "doc_id": "1",
                "text": "Meeting attendees: Alice, Bob, Carol."
            },
            {
                "doc_id": "2",
                "text": "Meeting time: 9:00 am to 11:00 am."
            }
            ]
        }
     }
    
    # Add additional parameters
    request_json["model"] = intrinsic_name
    request_json["temperature"] = 0.0
    
    # Apply input processor
    intrinsic_kwargs = {}
    rewritten_request = rewriter.transform(request_json, **intrinsic_kwargs)
    
    # Load the base model and merge LoRA weights
    model, tokenizer = granite_common.util.load_transformers_lora(lora_dir)
    if use_cuda:
        model = model.cuda()
    
    # Convert the chat completion request into a the Transformers library's proprietary
    # format.
    generate_input, other_input = (
        granite_common.util.chat_completion_request_to_transformers_inputs(
            rewritten_request,
            tokenizer,
            model,
        )
    )
    
    # Use the Transformers library's APIs to generate one or more completions,
    # then convert those completions into OpenAI-compatible chat completion
    responses = granite_common.util.generate_with_transformers(
        tokenizer, model, generate_input, other_input
    )
    
    # Apply output processor
    transformed_responses = result_processor.transform(responses, rewritten_request)
    
    # Verify that the contents of the completion is valid JSON and pretty-print the JSON.
    parsed_contents = json.loads(transformed_responses.choices[0].message.content)
    print("JSON output:")
    print(json.dumps(parsed_contents, indent=2))
    

Training Details

Training Data

The training data is created in the following process

  1. Take the synthetic rag-data-granite dataset, consisting of conversations between user and assistant.
  2. Replace the assistant response by running granite-3.2-intrinsics at temperature 1.0.
  3. Produce answer_relevance_rewriter target output using mixtral-large with prompts with in-context examples. The conversation created in steps 1 and 2 are taken as training input. The json string from step 3 is taken as train target output.

Training Hyperparameters

The LoRA adapter was fine-tuned using PEFT under the following regime: rank = 32, learning rate = 1.0e-04, number of epochs = 5.

Evaluation

Answer Relevance Rewriter

We evaluated the model on test data set generated by the same procedure as the training process, using GPT-4o as judge.

The following table presents results comparing baselines and frontier models on the answer relevance rewrite task. The data sets consists of those classified as irrelevant by mixtral-large. The evaluations are first divided into two parts, those that are truly irrelevant, for which we measure the rate of rewrite becoming relevant, and those that are false irrelevant, for which we measure the rate of rewrite becoming irrelevant. Then the overall rate flipping irrelevant to relevant and flipping relevant to irrelevant are calculated, and well as the net gain of relevance and resulting final relevance.

The LoRAs out perform the best of frontier models

True irrelevant
flip to relevant
False irrelevant
flip to irrelevant
Overall
flip irrelevant
to relevant
Overall
flip relevant
to irrelevant
net gain Result
relevance
mixtral-8x22b-v0.1 0.416 0.101 0.286 0.032 0.254 0.566
llama-3.3-70b 0.804 0.041 0.554 0.013 0.541 0.853
gpt-oss-20b 0.902 0.034 0.621 0.011 0.610 0.922
gpt-4o 0.960 0.014 0.661 0.004 0.657 0.968
gpt-4o-mini 0.758 0.027 0.522 0.008 0.514 0.825
granite-3.3-2b/lora 0.972 0.027 0.669 0.008 0.661 0.973
granite-3.3-2b/alora 0.972 0.007 0.669 0.002 0.667 0.979
granite-3.3-8b/lora 0.969 0.014 0.667 0.004 0.663 0.975
granite-3.3-8b/alora 0.966 0.027 0.665 0.008 0.657 0.968

Comparing the Answer Relevance Rewriter Intrinsics vs. Vanilla Granite Models

We compare the performance of Granite 3.3-2b, Granite 3.3-8b Instruct vs. answer relevance rewriter intrinsics implemented as LoRA adapters. It is seen that the LoRAs significantly out perform the base models.

True irrelevant
flip to relevant
False irrelevant
flip to irrelevant
Overall
flip irrelevant
to relevant
Overall
flip relevant
to irrelevant
net gain Result relevance
granite-3.3-2b 0.346 0.169 0.238 0.053 0.185 0.497
granite-3.3-2b/lora 0.972 0.027 0.669 0.008 0.661 0.973
granite-3.3-2b/alora 0.972 0.007 0.669 0.002 0.667 0.979
granite-3.3-8b 0.266 0.277 0.183 0.086 0.097 0.408
granite-3.3-8b/lora 0.969 0.014 0.667 0.004 0.663 0.975
granite-3.3-8b/alora 0.966 0.027 0.665 0.008 0.657 0.968

Model Card Authors

Huaiyu Zhu

Framework versions

  • PEFT 0.14.0