📈 MediMaven LambdaMART Learning-to-Rank (v1.1)

Gradient-boosted decision-tree ranker that fuses lexical, semantic and structural signals into a single final relevance score for our medical RAG pipeline.

💡 Why this model?


Algorithm	LightGBM LambdaMART (`lambdarank` objective)
Features (15)	BM25 score, cosine-sim (BGE embeddings), cross-encoder score, passage length, section depth, URL authority…
Training data	200 k synthetic triplets (query, positive, negative) auto-mined from Medimaven dataset (webmd, nhs, nih)
Metric optimised	nDCG@10

🚀 Quick start

import lightgbm as lgb
import numpy as np
import json, pathlib

# 1️⃣  load the model
model_path = "dranreb1660/[email protected]"
booster = lgb.Booster(model_file=model_path + "/ltr_lambdamart.txt")

# 2️⃣  prepare a feature matrix for a single query
features = np.array([
    [8.7, 0.82, 0.75, 120, 2, 0.91, ...],   # candidate doc 1
    [7.2, 0.67, 0.55, 300, 3, 0.80, ...],   # candidate doc 2
])
scores = booster.predict(features)

# 3️⃣  sort passages by `scores` (higher = better)
best_idx = np.argsort(-scores)

📊 Validation

Metric	BM25 only	BM25 → Cross-Encoder	BM25 → LambdaMART
nDCG@10	0.38	0.46	0.55
Recall@20	0.71	0.81	0.88

Evaluated on 1 k manually judged medical queries (Aug 2025).

🏗️ Training recipe

num_leaves:        255
learning_rate:     0.05
n_estimators:      800
min_data_in_leaf:  20
feature_fraction:  0.9
lambda_l1:         0.0
lambda_l2:         0.1
metric:            ndcg
ndcg_eval_at:      10

Hardware: 1 × Intel Xeon 6258R, ~4 min training time.

✍️ Citation

@misc{medimaven2025ltr,
  title = {MediMaven LambdaMART LTR},
  author = {Kyei-Mensah, Bernard},
  year   = {2025},
  howpublished = {\url{https://huggingface.co/dranreb1660/medimaven-ltr-lambdamart}}
}