Papers
arxiv:2510.09776

Why Do Transformers Fail to Forecast Time Series In-Context?

Published on Oct 10
· Submitted by YUFA ZHOU on Oct 15
Authors:
,
,

Abstract

Theoretical analysis reveals that Transformers, particularly Linear Self-Attention models, have limitations in time series forecasting compared to classical linear models, with predictions collapsing to the mean under Chain-of-Thought inference.

AI-generated summary

Time series forecasting (TSF) remains a challenging and largely unsolved problem in machine learning, despite significant recent efforts leveraging Large Language Models (LLMs), which predominantly rely on Transformer architectures. Empirical evidence consistently shows that even powerful Transformers often fail to outperform much simpler models, e.g., linear models, on TSF tasks; however, a rigorous theoretical understanding of this phenomenon remains limited. In this paper, we provide a theoretical analysis of Transformers' limitations for TSF through the lens of In-Context Learning (ICL) theory. Specifically, under AR(p) data, we establish that: (1) Linear Self-Attention (LSA) models cannot achieve lower expected MSE than classical linear models for in-context forecasting; (2) as the context length approaches to infinity, LSA asymptotically recovers the optimal linear predictor; and (3) under Chain-of-Thought (CoT) style inference, predictions collapse to the mean exponentially. We empirically validate these findings through carefully designed experiments. Our theory not only sheds light on several previously underexplored phenomena but also offers practical insights for designing more effective forecasting architectures. We hope our work encourages the broader research community to revisit the fundamental theoretical limitations of TSF and to critically evaluate the direct application of increasingly sophisticated architectures without deeper scrutiny.

Community

Paper author Paper submitter

“Why Do Transformers Fail to Forecast Time Series In-Context?”

📄 arxiv.org/abs/2510.09776
💻 github.com/MasterZhou1/ICL-Time-Series

Transformers dominate NLP and vision, yet consistently underperform simple linear models in time-series forecasting (TSF).
Why does this happen — despite vastly greater parameters and compute?

Our paper provides the first theoretical explanation of this phenomenon.
We analyze Transformers through the lens of In-Context Learning (ICL) theory on AR(p) processes, offering rigorous insight and faithful explanations for many previously empirical observations.

We derive that:
1️⃣ Linear Self-Attention (LSA) ≈ compressed linear regression → cannot outperform OLS in expectation.
2️⃣ A strict finite-sample gap exists between LSA and the optimal linear predictor.
3️⃣ The gap vanishes only at a 1/n rate, even with infinite context.
4️⃣ Under Chain-of-Thought (CoT) rollout, predictions collapse exponentially toward the mean.


Empirical Validation
Synthetic AR benchmarks confirm our theory:

  • Under Teacher Forcing, LSA tracks the ground truth but never exceeds OLS.
  • Under CoT rollout, both collapse, with LSA failing earlier.
  • Increasing context length or depth yields diminishing returns.

Takeaways

  • “Attention is not all you need for time-series forecasting.”
  • The TSF bottleneck stems from architectural representation limits, not training or optimization issues.

This work bridges ICL theory and classical time-series analysis, laying a foundation for the next generation of forecasting architectures.

#MachineLearning #Transformers #TimeSeries #ICL #MLTheory #DeepLearning #TSF

This is an automated message from the Librarian Bot. I found the following papers similar to this paper.

The following papers were recommended by the Semantic Scholar API

Please give a thumbs up to this comment if you found it helpful!

If you want recommendations for any Paper on Hugging Face checkout this Space

You can directly ask Librarian Bot for paper recommendations by tagging it in a comment: @librarian-bot recommend

Sign up or log in to comment

Models citing this paper 0

No model linking this paper

Cite arxiv.org/abs/2510.09776 in a model README.md to link it from this page.

Datasets citing this paper 0

No dataset linking this paper

Cite arxiv.org/abs/2510.09776 in a dataset README.md to link it from this page.

Spaces citing this paper 0

No Space linking this paper

Cite arxiv.org/abs/2510.09776 in a Space README.md to link it from this page.

Collections including this paper 0

No Collection including this paper

Add this paper to a collection to link it from this page.