About model_max_length

by hongwen11 - opened Jul 31, 2024

Jul 31, 2024

in tokenizer_config.json it says: "model_max_length": 1000000000000000019884624838656,

Can you kindly tell me the length distribution of train set? That will help me to adjust better chunk length when testing your model.

cfli

Jul 31, 2024

Our query and passage are trained with a length of 512, but the maximum length of the query with examples is set to 2048.

Aug 1, 2024

Thanks for reply. Do you train all your passages with a length of 512, or their lengths are different while averages around 512?

cfli

Aug 1, 2024

Just like other LLM-based embedding models, we set all passages with a length of 512.

Oct 28, 2024

But the model can handle documents larger than 512, right? If so, would it be better to truncate to 512 or not?

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment