chrono-gpt-v1-20201231 / README.md

Upload README.md with huggingface_hub

c0d2acb verified 2 months ago

5.45 kB

	---
	library_name: pytorch
	license: mit
	language:
	- en
	tags:
	- chronologically consistent
	- modded-nanogpt
	- hellaswag
	pipeline_tag: text-generation
	inference: false
	---
	# ChronoGPT

	## ChronoGPT Highlights

	ChronoGPT is a series high-performance chronologically consistent large language models (LLMs) designed to eliminate lookahead bias and training leakage while maintaining good language understanding in time-sensitive applications. The model is pretrained on diverse, high-quality, open-source, and timestamped text to maintain chronological consistency.

	All models in the series achieve HellaSwag benchmark scores that surpass those of the GPT-2 124M model. This approach preserves the integrity of historical analysis and enables more reliable economic and financial modeling.

	- Developed by: Songrun He, Linying Lv, Asaf Manela, Jimmy Wu
	- Model type: Transformer-based autoregressive decoder (Modified modded-NanoGPT architecture)
	- Language(s) (NLP): English
	- License: MIT License

	## Model Overview

	ChronoGPT has the following features:
	- Type: Causal Language Models
	- Training Stage: Pretraining
	- Number of Parameters: ~1,552 Million
	- Encoder & Decoder Partitioning: 26 encoder and 26 decoder layers
	- Tokenizer: GPT2Tokenizer from HuggingFace
	- Context Length: 1,792
	- Embedding Dimension: 1,536

	## 🚀 Quickstart

	You can try ChronoGPT directly in your browser via Google Colab:

	<p align="left">
	<a href="https://colab.research.google.com/github/LinyingLyu/ChronoGPT/blob/main/ChronoGPT_tutorial.ipynb" target="_blank">
	<img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open in Colab"/>
	</a>
	</p>

	Or run it locally with:

	```bash
	pip install -r requirements.txt
	```

	### Text Generation

	The following contains a code snippet illustrating how to use the model generate content based on given inputs.

	```python
	import torch
	import torch.nn.functional as F
	import tiktoken
	from huggingface_hub import HfApi, login
	from ChronoGPT_inference import *

	# ----------------------------- Setup -----------------------------
	device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
	cache_dir = 'cache' # Update this path as needed

	tokenizer = tiktoken.get_encoding("gpt2")
	max_length = 50
	num_return_sequences = 5
	seed = 123

	# -------------------------- Load Model --------------------------
	model = ChronoGPT.from_pretrained(
	"manelalab/chrono-gpt-v1-20241231",
	trust_remote_code=True,
	cache_dir=cache_dir
	).to(device)

	# ------------------------ Prepare Input -------------------------
	prompt = "Hello, I am a language model,"
	tokens = tokenizer.encode(prompt)
	tokens = torch.tensor(tokens, dtype=torch.long).unsqueeze(0)
	tokens = tokens.repeat(num_return_sequences, 1).to(device)

	# -------------------- Sampling Initialization -------------------
	xgen = tokens.clone()
	sample_rng = torch.Generator(device=device)
	sample_rng.manual_seed(seed)

	# ------------------------- Text Generation -----------------------
	while xgen.size(1) < max_length:
	with torch.no_grad():
	with torch.autocast(device_type='cuda', dtype=torch.bfloat16):
	logits, _ = model(xgen)

	logits = logits[:, -1, :] # Last token logits
	probs = F.softmax(logits, dim=-1)
	topk_probs, topk_indices = torch.topk(probs, 50, dim=-1)

	sampled_idx = torch.multinomial(topk_probs, 1, generator=sample_rng)
	next_token = torch.gather(topk_indices, -1, sampled_idx)

	xgen = torch.cat([xgen, next_token], dim=1)

	# ------------------------- Decode Output -------------------------
	for i in range(num_return_sequences):
	decoded_tokens = xgen[i, :max_length].tolist()
	decoded_text = tokenizer.decode(decoded_tokens)
	print(f"Rank sample {i}:\n{decoded_text}\n")
	```

	### Extract Embeddings

	The following contains a code snippet illustrating how to use the model generate embeddings of all layers based on given inputs.

	```python
	import torch
	import torch.nn.functional as F
	import tiktoken
	from huggingface_hub import HfApi, login
	from ChronoGPT_inference import *

	# ----------------------------- Setup -----------------------------
	device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
	cache_dir = 'cache' # Update this path as needed

	tokenizer = tiktoken.get_encoding("gpt2")

	# -------------------------- Load Model --------------------------
	model = ChronoGPT.from_pretrained(
	"manelalab/chrono-gpt-v1-20241231",
	trust_remote_code=True,
	cache_dir=cache_dir
	).to(device)

	# ----------------------- Embedding Generation ---------------------
	text = "Obviously, the time continuum has been disrupted, creating a new temporal event sequence resulting in this alternate reality."

	inputs = torch.tensor(tokenizer.encode(text))[:max_length].reshape(1,-1).to(device)
	logits, emb = model(inputs)
	print('Dimension of embeddings:', emb[0].shape)
	```

	## Citation

	```
	@article{He2025ChronoBERT,
	title={Chronologically Consistent Large Language Models},
	author={He, Songrun and Lv, Linying and Manela, Asaf and Wu, Jimmy},
	journal={Working Paper},
	year={2025}
	}
	```

	### Model Card Authors

	- Songrun He (Washington University in St. Louis, [email protected])
	- Linying Lv (Washington University in St. Louis, [email protected])
	- Asaf Manela (Washington University in St. Louis, [email protected])
	- Jimmy Wu (Washington University in St. Louis, [email protected])