📄 Paper

LastingBench: Defend Benchmarks Against Knowledge Leakage.

Welcome to the repository for the research paper: "LastingBench: Defend Benchmarks Against Knowledge Leakage." This project addresses the growing concern about large language models (LLMs) "cheating" on standard Question Answering (QA) benchmarks by memorizing task-specific data, which undermines the validity of benchmark evaluations as they no longer reflect genuine model capabilities but instead the effects of data leakage.

Project Overview

LastingBench introduces a novel framework designed to continuously reinforce and safeguard existing benchmarks against knowledge leakage. The project aims to:

Detect knowledge leakage through context and question perturbation techniques
Rewrite leaked content to counterfactual alternatives that disrupt memorization while preserving the benchmark's original evaluative intent
Evaluate model responses to contextual evidence and reasoning patterns
Provide practical solutions to ensure benchmark robustness over time, promoting fairer and more interpretable evaluations of LLMs

Installation

Clone the repository:

git clone https://github.com/Seriousss/lastingbench

Create and activate conda environment:

conda create -n lastingbench python=3.12
conda activate lastingbench

Install dependencies:

pip install -r requirements.txt

Set up environment variables:

export OPENAI_BASE_URL="your-api-base-url"
export OPENAI_API_KEY="your-api-key"
export CUDA_VISIBLE_DEVICES="0,1,2,3"  # Adjust based on your GPU setup

Usage

LastingBench provides three main functionalities: Detection, Rewrite, and Training Comparision.

🔍 Detection

Detect knowledge leakage through various perturbation techniques.

1. Context Leakage Detection

Evaluate models using exact-match scoring on benchmark datasets:

# Using vLLM for most models
python -m detect.contextleakage --hf_model "Qwen/Qwen2.5-7B-Instruct" \
    --dataset_subset "hotpotqa" --cuda_devices "0,1"

# Using Transformers for Qwen3 models  
python -m detect.contextleakage --hf_model "Qwen/Qwen3-8B" \
    --is_qwen3 --max_new_tokens 30

python -m detect.contextleakage_api --model "deepseek-r1" --dataset_subset "hotpotqa"

2. Question Perturbation Detection

Rephrase questions to opposite meanings and test model consistency:

# Using OpenAI API
python -m detect.question_rephrase_answer_api \
    --model_name "gpt-4o" --dataset_subset "2wikimqa" \
    --rephrase_type "opposite" --sample_count 100

# Using local vLLM models
python -m detect.question_rephrase_answer_vllm \
    --model_name "Qwen/Qwen2.5-7B-Instruct" --dataset_subset "hotpotqa" --rephrase_type "similar"

# Using Qwen3 with Transformers
python -m detect.question_rephrase_answer_qwen3 \
    --model_name "Qwen/Qwen3-8B" --dataset_subset "2wikimqa"

✏️ Rewrite

Generate counterfactual answers and rewrite leaked evidence to create robust benchmarks. `

1. Evidence Finding and Counterfactual Rewriting Pipeline

Run the complete finding and rewriting pipeline:


# Specify custom output file and dataset
python main_gpu.py --output custom_output.jsonl \
    --dataset_subset "hotpotqa" --start_idx 0 --max_samples 100

Convert and merge JSONL files with question-answer mappings:

# Merge single mapping file with original dataset
python utils/convert.py original.jsonl revised.jsonl custom_output.jsonl

The original and revised dataset can be found under the data folder.

2. Random Answer Rewriting

Create random alternatives to disrupt memorization:

# Specify custom output file and dataset
python random_alternative_answer.py --output random_hotpot.jsonl \
    --dataset_subset "hotpotqa" --start_idx 0 --max_samples 50

🚀Dataset evaluations on model inference and training

1. Model Inference Evaluation

Comprehensive evaluation on original and revised benchmarks:

# Transformers-based evaluation
python -m eval.evaluation -i data/hotpotqa.jsonl -model "Qwen/Qwen3-8B" -k 40 -t 0.5

# API-based evaluation  
python -m eval.eval_with_api.py --input data/hotpotqa_antifact.jsonl \
    --model "deepseek-r1" --max_tokens 30 --temperature 0.5

2. Model training Evaluation

Compare training dynamics between original and rewritten datasets:

The training loss data can be found under training_result.

To repoduce the picture in our paper:

python utils/draw.py training_result/training_loss_qwen38.csv training_result/training_loss_antifact_qwen38.csv \
    --title "Original vs Rewritten Training Loss"

📊 Utility Functions

Additional tools for analysis and metrics:

Metrics Calculation: F1 scores, EM scores, and custom evaluation metrics
Document Retrieval: BM25-based retrieval for evidence analysis

All scripts support various parameters for customization. Use --help with any script to see available options.