Some questions about the results in Table 5

#17
by begonie - opened

I am trying to reproduce the evaluation metrics provided by gte-multilingual-reranker-base

Retrieval model: gte-multilingual-base
Ranker model: gte-multilingual-reranker-base
datastes: MLDR(nDCG@10[13])

CMD:

python -m FlagEmbedding.evaluation.mldr \
    --eval_name mldr \
    --dataset_dir ./mldr/data \
    --dataset_names ar de en es fr hi it ja ko pt ru th zh \
    --splits test \
    --corpus_embd_save_dir ./mldr/corpus_embd \
    --output_dir ./mldr/search_results \
    --search_top_k 1000 \
    --rerank_top_k 100 \
    --overwrite False \
    --k_values 10 100 \
    --eval_output_method markdown \
    --eval_output_path ./mldr/mldr_eval_results.md \
    --eval_metrics ndcg_at_10 \
    --embedder_name_or_path Alibaba-NLP/gte-multilingual-base \
    --reranker_name_or_path Alibaba-NLP/gte-multilingual-reranker-base \
    --embedder_passage_max_length 8192 \
    --reranker_max_length 8192 \
    --trust_remote_code True \
    --embedder_batch_size 64 \
    --reranker_batch_size 64

Result:

Model Reranker average ar-test de-test en-test es-test fr-test hi-test it-test ja-test ko-test pt-test ru-test th-test zh-test
gte-multilingual-base gte-multilingual-reranker-base 72.875 77.082 68.048 69.663 94.798 88.294 65.428 82.078 67.169 70.880 88.400 83.732 47.039hh 44.763
gte-multilingual-base NoReranker 56.602 54.981 55.155 51.032 81.228 76.218 45.197 66.926 52.053 46.773 79.298 64.037 35.472 27.461

I have a question. The score of gte-multilingual-base, 56.6, is consistent with that in the Table. However, after adding gte-multilingual-reranker-base, the score is only 72.875, which is not consistent with the 78.7 provided in the article. Is there something wrong with the usage?

begonie changed discussion title from Table 5 中结果的一些疑问 to Some questions about the results in Table 5
Your need to confirm your account before you can post a new comment.

Sign up or log in to comment