Why Do Transformers Fail to Forecast Time Series In-Context?
Abstract
Theoretical analysis reveals that Transformers, particularly Linear Self-Attention models, have limitations in time series forecasting compared to classical linear models, with predictions collapsing to the mean under Chain-of-Thought inference.
Time series forecasting (TSF) remains a challenging and largely unsolved problem in machine learning, despite significant recent efforts leveraging Large Language Models (LLMs), which predominantly rely on Transformer architectures. Empirical evidence consistently shows that even powerful Transformers often fail to outperform much simpler models, e.g., linear models, on TSF tasks; however, a rigorous theoretical understanding of this phenomenon remains limited. In this paper, we provide a theoretical analysis of Transformers' limitations for TSF through the lens of In-Context Learning (ICL) theory. Specifically, under AR(p) data, we establish that: (1) Linear Self-Attention (LSA) models cannot achieve lower expected MSE than classical linear models for in-context forecasting; (2) as the context length approaches to infinity, LSA asymptotically recovers the optimal linear predictor; and (3) under Chain-of-Thought (CoT) style inference, predictions collapse to the mean exponentially. We empirically validate these findings through carefully designed experiments. Our theory not only sheds light on several previously underexplored phenomena but also offers practical insights for designing more effective forecasting architectures. We hope our work encourages the broader research community to revisit the fundamental theoretical limitations of TSF and to critically evaluate the direct application of increasingly sophisticated architectures without deeper scrutiny.
Community
“Why Do Transformers Fail to Forecast Time Series In-Context?”
📄 arxiv.org/abs/2510.09776
💻 github.com/MasterZhou1/ICL-Time-Series
Transformers dominate NLP and vision, yet consistently underperform simple linear models in time-series forecasting (TSF).
Why does this happen — despite vastly greater parameters and compute?
Our paper provides the first theoretical explanation of this phenomenon.
We analyze Transformers through the lens of In-Context Learning (ICL) theory on AR(p) processes, offering rigorous insight and faithful explanations for many previously empirical observations.
We derive that:
1️⃣ Linear Self-Attention (LSA) ≈ compressed linear regression → cannot outperform OLS in expectation.
2️⃣ A strict finite-sample gap exists between LSA and the optimal linear predictor.
3️⃣ The gap vanishes only at a 1/n rate, even with infinite context.
4️⃣ Under Chain-of-Thought (CoT) rollout, predictions collapse exponentially toward the mean.
Empirical Validation
Synthetic AR benchmarks confirm our theory:
- Under Teacher Forcing, LSA tracks the ground truth but never exceeds OLS.
- Under CoT rollout, both collapse, with LSA failing earlier.
- Increasing context length or depth yields diminishing returns.
Takeaways
- “Attention is not all you need for time-series forecasting.”
- The TSF bottleneck stems from architectural representation limits, not training or optimization issues.
This work bridges ICL theory and classical time-series analysis, laying a foundation for the next generation of forecasting architectures.
#MachineLearning #Transformers #TimeSeries #ICL #MLTheory #DeepLearning #TSF
This is an automated message from the Librarian Bot. I found the following papers similar to this paper.
The following papers were recommended by the Semantic Scholar API
- Trained Mamba Emulates Online Gradient Descent in In-Context Linear Regression (2025)
- Integrating Time Series into LLMs via Multi-layer Steerable Embedding Fusion for Enhanced Forecasting (2025)
- VARMA-Enhanced Transformer for Time Series Forecasting (2025)
- Super-Linear: A Lightweight Pretrained Mixture of Linear Experts for Time Series Forecasting (2025)
- Characteristic Root Analysis and Regularization for Linear Time Series Forecasting (2025)
- Lightweight and Data-Efficient MultivariateTime Series Forecasting using Residual-Stacked Gaussian (RS-GLinear) Architecture (2025)
- Why Attention Fails: The Degeneration of Transformers into MLPs in Time Series Forecasting (2025)
Please give a thumbs up to this comment if you found it helpful!
If you want recommendations for any Paper on Hugging Face checkout this Space
You can directly ask Librarian Bot for paper recommendations by tagging it in a comment:
@librarian-bot
recommend
Models citing this paper 0
No model linking this paper
Datasets citing this paper 0
No dataset linking this paper
Spaces citing this paper 0
No Space linking this paper
Collections including this paper 0
No Collection including this paper