Question about long-context evaluation in DeepSeek-V3.2-Exp

#15

by fcMpKYz6Avp5QK - opened 9 days ago

9 days ago

According to the release notes, DeepSeek-V3.2-Exp improves cost efficiency for long contexts without performance regression on standard benchmarks.

Looking at the cost-vs-context-length chart, I noticed an interesting pattern: around the 4k-8k token range, there's a noticeable change in the linearity of the graph, suggesting some form of attention optimization kicks in around this point. Actually, below this range, V3.2-Exp appears slightly more costly than the previous version, which makes the long-context improvements even more intriguing.

However, since common benchmarks like MMLU and GSM8K primarily evaluate short-context tasks, I'm curious if there are any specific long-context benchmark results available to validate the actual long-context capabilities?

Could you share if evaluations were conducted on long-context benchmarks such as LongBench or L-Eval, and if so, what were the results? This would help understand whether the observed attention optimization translates to measurable performance gains in long-context tasks.

Note: This question was refined with the assistance of DeepSeek-V3.2-Exp itself to improve clarity and precision.

ghostplant

8 days ago

•

edited 8 days ago

Where is modeling_deepseek.py?

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment