TemporalSelfAttention - A Time-Biased Attention Module

Give Transformers a sense of time - not by scaling, but by structure.


Why?

Standard attention treats all tokens equally in time.
This works for syntax, but breaks for:

  • Temporal event ordering
  • Causal reasoning
  • Timeline consistency
  • Long-range narrative coherence

πŸ’‘ Insight: These models simulate time via token position. We inject it structurally with a tiny inductive bias.


Core Equation

The time-aware attention score is computed as:

scoreij=Qiβ‹…Kj⊀dk+Ξ³β‹…f(tjβˆ’ti) \text{score}_{ij} = \frac{Q_i \cdot K_j^\top}{\sqrt{d_k}} + \gamma \cdot f(t_j - t_i)

Notation

Symbol Description
scoreij \text{score}_{ij} Attention score between query at position i i and key at position j j
Qi Q_i Query vector for position i i
Kj K_j Key vector for position j j
dk d_k Dimension of key vectors
Ξ³ \gamma Learnable time bias strength
f(β‹…) f(\cdot) Time difference function
tjβˆ’ti t_j - t_i Relative time difference

How To Use

from temporal_attention import TemporalSelfAttention

model = TemporalSelfAttention(
    embed_dim=64,
    num_heads=1,
    bias_type="linear",  # or 'gaussian'
    gamma=1.0,
    causal=False
)

# x: (B, T, D), timestamps: (B, T)
output, weights = model(x, timestamps)
Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support