Upload README.md with huggingface_hub
Browse files
README.md
ADDED
|
@@ -0,0 +1,80 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
---
|
| 2 |
+
library_name: pytorch
|
| 3 |
+
license: mit
|
| 4 |
+
language:
|
| 5 |
+
- en
|
| 6 |
+
tags:
|
| 7 |
+
- chronologically consistent
|
| 8 |
+
- instruction following
|
| 9 |
+
- modded-nanogpt
|
| 10 |
+
- large language model
|
| 11 |
+
- lookahead-bias-free
|
| 12 |
+
pipeline_tag: text-generation
|
| 13 |
+
inference: false
|
| 14 |
+
---
|
| 15 |
+
|
| 16 |
+
# ChronoGPT-Instruct
|
| 17 |
+
|
| 18 |
+
ChronoGPT-Instruct is a family of **chronologically consistent, instruction-following large language models (LLMs)** that eliminate lookahead bias by training exclusively on time-stamped data available **before a fixed knowledge-cutoff date τ**.
|
| 19 |
+
Each `ChronoGPT-Instruct-τ` extends the `ChronoGPT-τ` base models through supervised instruction fine-tuning while strictly maintaining temporal separation from all post-τ information.
|
| 20 |
+
|
| 21 |
+
These models provide the research community with a transparent, replicable benchmark for testing **lookahead-bias-free prediction** in economics, finance, and other time-sensitive domains.
|
| 22 |
+
|
| 23 |
+
---
|
| 24 |
+
|
| 25 |
+
## 🔍 Model Overview
|
| 26 |
+
|
| 27 |
+
| Property | Description |
|
| 28 |
+
|:--|:--|
|
| 29 |
+
| **Architecture** | Transformer-decoder |
|
| 30 |
+
| **Parameters** | ≈ 1.55 B |
|
| 31 |
+
| **Layers** | 52 layers |
|
| 32 |
+
| **Embedding dim** | 1,536 |
|
| 33 |
+
| **Context length** | 1,792 tokens |
|
| 34 |
+
| **Tokenizer** | `GPT2Tokenizer` (Hugging Face) |
|
| 35 |
+
| **Training stage** | Pretraining + Instruction Fine-tuning (SFT) |
|
| 36 |
+
| **License** | MIT |
|
| 37 |
+
| **Languages** | English |
|
| 38 |
+
|
| 39 |
+
---
|
| 40 |
+
|
| 41 |
+
## 🧠 Training & Data
|
| 42 |
+
|
| 43 |
+
### Chronological Consistency
|
| 44 |
+
Each model’s corpus satisfies chronologically consistency in both pretraining and instruction-finetuning phases. Texts dated after the model year are excluded, ensuring zero overlap with evaluation data. A GPT-4.1 classifier screens every instruction-response pair.
|
| 45 |
+
|
| 46 |
+
### Instruction-Finetuning Corpus
|
| 47 |
+
| Stage | Source | # Examples | Avg Length |
|
| 48 |
+
|:--|:--|:--:|:--:|
|
| 49 |
+
| 1 | LLMs-from-Scratch | 1 097 | 102 |
|
| 50 |
+
| 2 | GPT-3 Self-Instruct | 67 136 | 183 |
|
| 51 |
+
| 3 | AllenAI Tulu-3 Mixture | 356 886 | 2 513 |
|
| 52 |
+
|
| 53 |
+
Only English, non-code entries with pre-2000 content (classifier label = 0 & confidence = 10) are retained.
|
| 54 |
+
|
| 55 |
+
We release the SFT dataset at https://huggingface.co/datasets/manelalab/ChronoInstruct-SFT.
|
| 56 |
+
|
| 57 |
+
---
|
| 58 |
+
|
| 59 |
+
## 🚀 Usage Examples
|
| 60 |
+
|
| 61 |
+
You can try ChronoGPT-instruct directly in your browser via Google Colab:
|
| 62 |
+
|
| 63 |
+
<p align="left">
|
| 64 |
+
<a href="https://colab.research.google.com/github/LinyingLyu/ChronoGPT/blob/main/ChronoGPT_instruct_tutorial.ipynb" target="_blank">
|
| 65 |
+
<img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open in Colab"/>
|
| 66 |
+
</a>
|
| 67 |
+
</p>
|
| 68 |
+
|
| 69 |
+
---
|
| 70 |
+
|
| 71 |
+
## 👩💻 Citation
|
| 72 |
+
|
| 73 |
+
```
|
| 74 |
+
@article{He_Lv_Manela_Wu_chronogpt_2025,
|
| 75 |
+
title={Chronologically Consistent Generative AI},
|
| 76 |
+
author={He, Songrun and Lv, Linying and Manela, Asaf and Wu, Jimmy},
|
| 77 |
+
journal={Working Paper},
|
| 78 |
+
year={2025}
|
| 79 |
+
}
|
| 80 |
+
```
|