Question about long-context evaluation in DeepSeek-V3.2-Exp
According to the release notes, DeepSeek-V3.2-Exp improves cost efficiency for long contexts without performance regression on standard benchmarks.
Looking at the cost-vs-context-length chart, I noticed an interesting pattern: around the 4k-8k token range, there's a noticeable change in the linearity of the graph, suggesting some form of attention optimization kicks in around this point. Actually, below this range, V3.2-Exp appears slightly more costly than the previous version, which makes the long-context improvements even more intriguing.
However, since common benchmarks like MMLU and GSM8K primarily evaluate short-context tasks, I'm curious if there are any specific long-context benchmark results available to validate the actual long-context capabilities?
Could you share if evaluations were conducted on long-context benchmarks such as LongBench or L-Eval, and if so, what were the results? This would help understand whether the observed attention optimization translates to measurable performance gains in long-context tasks.
Note: This question was refined with the assistance of DeepSeek-V3.2-Exp itself to improve clarity and precision.
Where is modeling_deepseek.py?