File size: 5,446 Bytes
bfb39c5
 
 
 
 
 
 
 
 
 
 
 
 
 
3d950d6
bfb39c5
 
 
3d950d6
bfb39c5
 
 
 
 
 
3d950d6
bfb39c5
3d950d6
 
 
ea5aaed
 
3d950d6
 
ea5aaed
bfb39c5
3d950d6
bfb39c5
3d950d6
bfb39c5
3d950d6
 
 
 
 
 
 
 
 
bfb39c5
 
 
3d950d6
 
 
bfb39c5
 
4b683c6
3d950d6
 
 
 
bfb39c5
3d950d6
 
 
bfb39c5
 
3d950d6
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
bfb39c5
 
3d950d6
bfb39c5
3d950d6
bfb39c5
3d950d6
 
 
 
 
 
bfb39c5
3d950d6
 
 
bfb39c5
3d950d6
bfb39c5
3d950d6
 
 
 
 
 
bfb39c5
3d950d6
 
bfb39c5
3d950d6
 
 
 
bfb39c5
 
 
 
 
 
 
 
 
 
 
 
3d950d6
bfb39c5
 
 
 
3d950d6
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
---
library_name: pytorch
license: mit
language:
- en
tags:
- chronologically consistent
- modded-nanogpt
- hellaswag
pipeline_tag: text-generation
inference: false
---
# ChronoGPT

## ChronoGPT Highlights

ChronoGPT is a series **high-performance chronologically consistent large language models (LLMs)** designed to eliminate lookahead bias and training leakage while maintaining good language understanding in time-sensitive applications. The model is pretrained on **diverse, high-quality, open-source, and timestamped text** to maintain chronological consistency.

All models in the series achieve **HellaSwag benchmark scores that surpass those of the GPT-2 124M model.** This approach preserves the integrity of historical analysis and enables more reliable economic and financial modeling.

- **Developed by:** Songrun He, Linying Lv, Asaf Manela, Jimmy Wu
- **Model type:** Transformer-based autoregressive decoder (Modified modded-NanoGPT architecture)
- **Language(s) (NLP):** English
- **License:** MIT License

## Model Overview

**ChronoGPT** has the following features:
- Type: Causal Language Models
- Training Stage: Pretraining
- Number of Parameters: ~1,552 Million
- Encoder & Decoder Partitioning: 26 encoder and 26 decoder layers
- Tokenizer: GPT2Tokenizer from HuggingFace
- Context Length: 1,792
- Embedding Dimension: 1,536

## 🚀 Quickstart

You can try ChronoGPT directly in your browser via Google Colab:

<p align="left">
  <a href="https://colab.research.google.com/github/LinyingLyu/ChronoGPT/blob/main/ChronoGPT_tutorial.ipynb" target="_blank">
    <img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open in Colab"/>
  </a>
</p>

Or run it locally with:

```bash
pip install -r requirements.txt
```

### Text Generation

The following contains a code snippet illustrating how to use the model generate content based on given inputs. 

```python
import torch
import torch.nn.functional as F
import tiktoken
from huggingface_hub import HfApi, login
from ChronoGPT_inference import *

# ----------------------------- Setup -----------------------------
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
cache_dir = 'cache'  # Update this path as needed

tokenizer = tiktoken.get_encoding("gpt2")
max_length = 50
num_return_sequences = 5
seed = 123

# -------------------------- Load Model --------------------------
model = ChronoGPT.from_pretrained(
    "manelalab/chrono-gpt-v1-20241231",
    trust_remote_code=True,
    cache_dir=cache_dir
).to(device)

# ------------------------ Prepare Input -------------------------
prompt = "Hello, I am a language model,"
tokens = tokenizer.encode(prompt)
tokens = torch.tensor(tokens, dtype=torch.long).unsqueeze(0)
tokens = tokens.repeat(num_return_sequences, 1).to(device)

# -------------------- Sampling Initialization -------------------
xgen = tokens.clone()
sample_rng = torch.Generator(device=device)
sample_rng.manual_seed(seed)

# ------------------------- Text Generation -----------------------
while xgen.size(1) < max_length:
    with torch.no_grad():
        with torch.autocast(device_type='cuda', dtype=torch.bfloat16):
            logits, _ = model(xgen)

        logits = logits[:, -1, :]  # Last token logits
        probs = F.softmax(logits, dim=-1)
        topk_probs, topk_indices = torch.topk(probs, 50, dim=-1)

        sampled_idx = torch.multinomial(topk_probs, 1, generator=sample_rng)
        next_token = torch.gather(topk_indices, -1, sampled_idx)

        xgen = torch.cat([xgen, next_token], dim=1)

# ------------------------- Decode Output -------------------------
for i in range(num_return_sequences):
    decoded_tokens = xgen[i, :max_length].tolist()
    decoded_text = tokenizer.decode(decoded_tokens)
    print(f"Rank sample {i}:\n{decoded_text}\n")
```

### Extract Embeddings 

The following contains a code snippet illustrating how to use the model generate embeddings of all layers based on given inputs. 

```python
import torch
import torch.nn.functional as F
import tiktoken
from huggingface_hub import HfApi, login
from ChronoGPT_inference import *

# ----------------------------- Setup -----------------------------
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
cache_dir = 'cache'  # Update this path as needed

tokenizer = tiktoken.get_encoding("gpt2")

# -------------------------- Load Model --------------------------
model = ChronoGPT.from_pretrained(
    "manelalab/chrono-gpt-v1-20241231",
    trust_remote_code=True,
    cache_dir=cache_dir
).to(device)

# ----------------------- Embedding Generation ---------------------
text = "Obviously, the time continuum has been disrupted, creating a new temporal event sequence resulting in this alternate reality."

inputs = torch.tensor(tokenizer.encode(text))[:max_length].reshape(1,-1).to(device)
logits, emb = model(inputs)
print('Dimension of embeddings:', emb[0].shape)
```

## Citation

```
@article{He2025ChronoBERT,
  title={Chronologically Consistent Large Language Models},
  author={He, Songrun and Lv, Linying and Manela, Asaf and Wu, Jimmy},
  journal={Working Paper},
  year={2025}
}
```

### Model Card Authors

- Songrun He (Washington University in St. Louis, [email protected])
- Linying Lv (Washington University in St. Louis, [email protected])
- Asaf Manela (Washington University in St. Louis, [email protected])
- Jimmy Wu (Washington University in St. Louis, [email protected])