Model Details

LoRA finetuned checkpoint of a meta-llama/Llama-3.2-3B-Instruct base model. This model can be loaded on an M3 Macbook Air with 16GB unified memory.

Model Description

This model assists users with searching for research papers. It assists in creating a query that is compatible with a search API. The model is finetuned to output structured markdown corresponding to the user query. This makes it possible to parse the output and construct a query for a search API.

Model Sources

Repository: https://github.com/shaikh58/llm-paper-retriever
Developed by: Mustafa Shaikh
Language(s) (NLP): English
License: MIT
Finetuned from model: meta-llama/Llama-3.2-3B-Instruct

Uses

This model is intended to be used with the MCP server released in the repository linked above. It is complete with search functionality and is integrated with Cursor.

How to Get Started with the Model

If you wish to use the model directly, rather than through Cursor, you can use the code below to load it.

from transformers import AutoModelForCausalLM
from peft import PeftModel

base_model = AutoModelForCausalLM.from_pretrained(
    "meta-llama/Llama-3.2-3B-Instruct",
    trust_remote_code=True,
    device_map="auto"
)

model = PeftModel.from_pretrained(
    base_model,
    "Shaikh58/llama-3.2-3b-instruct-lora-arxiv-query"
)

Training Details

Training Data

Input query	Label
"Find recent papers on transformer architectures in NLP published since 2023 with at least 100 citations"	`"## QUERY PARAMETERS\n\n- Topic: NLP\n\n## CONSTRAINTS\n\n- Citations: (>=, 100)\n- Keyword: transformers\n- Year: (>=, 2023)\n\n## OPTIONS\n\n- Limit: 10\n- Sort By: relevance\n- Sort Order: descending"`

During training, the input query is also augmented with a system prompt (not shown) to guide the model to output structured markdown.

Training Procedure

LoRA finetuned on 50,000 synthetically generated training data points.

Training Hyperparameters

Training regime:
fp16 mixed precision
LoRA: r = 16, alpha = 32, dropout = 0.05

Evaluation

Testing Data, Factors & Metrics

Testing Data

Same format as training data.

Metrics

The model was evaluated with the rouge metric. This is because the expected output is known in advance.

Results

Several versions of the model were evaluated, each with a different number of trianing samples used during fine tuning. The plots show that finetuning with as low as 1000 samples leads to a major improvement in model performance. Empirically, we see that the model trained on 50,000 samples performs better in production, even though the rouge score is similar to models trained on less data. This is because the rouge score does not penalize minor differences to the expected output. However, minor differences can lead to very different parsing of the output and query result.

Shaikh58
/

llama-3.2-3b-instruct-lora-arxiv-query