TemporalSelfAttention - A Time-Biased Attention Module

Give Transformers a sense of time - not by scaling, but by structure.

Why?

Standard attention treats all tokens equally in time.
This works for syntax, but breaks for:

Temporal event ordering
Causal reasoning
Timeline consistency
Long-range narrative coherence

💡 Insight: These models simulate time via token position. We inject it structurally with a tiny inductive bias.

Core Equation

The time-aware attention score is computed as:

$\text{score}_{ij} = \frac{Q_i \cdot K_j^\top}{\sqrt{d_k}} + \gamma \cdot f(t_j - t_i)$

Notation

Symbol	Description
$\text{score}_{ij}$	Attention score between query at position $i$ and key at position $j$
$Q_{i}$	Query vector for position $i$
$K_{j}$	Key vector for position $j$
$d_{k}$	Dimension of key vectors
$\gamma$	Learnable time bias strength
$f(\cdot)$	Time difference function
$t_{j} - t_{i}$	Relative time difference

How To Use

from temporal_attention import TemporalSelfAttention

model = TemporalSelfAttention(
    embed_dim=64,
    num_heads=1,
    bias_type="linear",  # or 'gaussian'
    gamma=1.0,
    causal=False
)

# x: (B, T, D), timestamps: (B, T)
output, weights = model(x, timestamps)