LinyingLyu commited on
Commit
7f1087e
·
verified ·
1 Parent(s): 4a9d6f4

Upload README.md with huggingface_hub

Browse files
Files changed (1) hide show
  1. README.md +80 -0
README.md ADDED
@@ -0,0 +1,80 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ library_name: pytorch
3
+ license: mit
4
+ language:
5
+ - en
6
+ tags:
7
+ - chronologically consistent
8
+ - instruction following
9
+ - modded-nanogpt
10
+ - large language model
11
+ - lookahead-bias-free
12
+ pipeline_tag: text-generation
13
+ inference: false
14
+ ---
15
+
16
+ # ChronoGPT-Instruct
17
+
18
+ ChronoGPT-Instruct is a family of **chronologically consistent, instruction-following large language models (LLMs)** that eliminate lookahead bias by training exclusively on time-stamped data available **before a fixed knowledge-cutoff date τ**.
19
+ Each `ChronoGPT-Instruct-τ` extends the `ChronoGPT-τ` base models through supervised instruction fine-tuning while strictly maintaining temporal separation from all post-τ information.
20
+
21
+ These models provide the research community with a transparent, replicable benchmark for testing **lookahead-bias-free prediction** in economics, finance, and other time-sensitive domains.
22
+
23
+ ---
24
+
25
+ ## 🔍 Model Overview
26
+
27
+ | Property | Description |
28
+ |:--|:--|
29
+ | **Architecture** | Transformer-decoder |
30
+ | **Parameters** | ≈ 1.55 B |
31
+ | **Layers** | 52 layers |
32
+ | **Embedding dim** | 1,536 |
33
+ | **Context length** | 1,792 tokens |
34
+ | **Tokenizer** | `GPT2Tokenizer` (Hugging Face) |
35
+ | **Training stage** | Pretraining + Instruction Fine-tuning (SFT) |
36
+ | **License** | MIT |
37
+ | **Languages** | English |
38
+
39
+ ---
40
+
41
+ ## 🧠 Training & Data
42
+
43
+ ### Chronological Consistency
44
+ Each model’s corpus satisfies chronologically consistency in both pretraining and instruction-finetuning phases. Texts dated after the model year are excluded, ensuring zero overlap with evaluation data. A GPT-4.1 classifier screens every instruction-response pair.
45
+
46
+ ### Instruction-Finetuning Corpus
47
+ | Stage | Source | # Examples | Avg Length |
48
+ |:--|:--|:--:|:--:|
49
+ | 1 | LLMs-from-Scratch | 1 097 | 102 |
50
+ | 2 | GPT-3 Self-Instruct | 67 136 | 183 |
51
+ | 3 | AllenAI Tulu-3 Mixture | 356 886 | 2 513 |
52
+
53
+ Only English, non-code entries with pre-2000 content (classifier label = 0 & confidence = 10) are retained.
54
+
55
+ We release the SFT dataset at https://huggingface.co/datasets/manelalab/ChronoInstruct-SFT.
56
+
57
+ ---
58
+
59
+ ## 🚀 Usage Examples
60
+
61
+ You can try ChronoGPT-instruct directly in your browser via Google Colab:
62
+
63
+ <p align="left">
64
+ <a href="https://colab.research.google.com/github/LinyingLyu/ChronoGPT/blob/main/ChronoGPT_instruct_tutorial.ipynb" target="_blank">
65
+ <img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open in Colab"/>
66
+ </a>
67
+ </p>
68
+
69
+ ---
70
+
71
+ ## 👩‍💻 Citation
72
+
73
+ ```
74
+ @article{He_Lv_Manela_Wu_chronogpt_2025,
75
+ title={Chronologically Consistent Generative AI},
76
+ author={He, Songrun and Lv, Linying and Manela, Asaf and Wu, Jimmy},
77
+ journal={Working Paper},
78
+ year={2025}
79
+ }
80
+ ```