Upload folder using huggingface_hub
Browse files- .ipynb_checkpoints/README-checkpoint.md +79 -0
- README.md +16 -6
    	
        .ipynb_checkpoints/README-checkpoint.md
    ADDED
    
    | @@ -0,0 +1,79 @@ | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | 
|  | |
| 1 | 
            +
            # Model Card for ChronoBERT
         | 
| 2 | 
            +
             | 
| 3 | 
            +
            ## Model Details
         | 
| 4 | 
            +
             | 
| 5 | 
            +
            ### Model Description
         | 
| 6 | 
            +
             | 
| 7 | 
            +
            ChronoBERT is a series **high-performance chronologically consistent large language models (LLM)** designed to eliminate lookahead bias and training leakage while maintain good language understanding in time-sensitive applications. The model is pretrained on **diverse, high-quality, open-source, and timestamped text** to maintain chronological consistency.
         | 
| 8 | 
            +
             | 
| 9 | 
            +
            All models in the series achieve **GLUE benchmark scores that surpass standard BERT.** This approach preserves the integrity of historical analysis and enables more reliable economic and financial modeling.
         | 
| 10 | 
            +
             | 
| 11 | 
            +
            - **Developed by:** Songrun He, Linying Lv, Asaf Manela, Jimmy Wu
         | 
| 12 | 
            +
            - **Model type:** Transformer-based bidirectional encoder (ModernBERT architecture)
         | 
| 13 | 
            +
            - **Language(s) (NLP):** English
         | 
| 14 | 
            +
            - **License:** MIT License
         | 
| 15 | 
            +
             | 
| 16 | 
            +
            ### Model Sources
         | 
| 17 | 
            +
             | 
| 18 | 
            +
            - **Paper:** "Chronologically Consistent Large Language Models" (He, Lv, Manela, Wu, 2025)
         | 
| 19 | 
            +
             | 
| 20 | 
            +
            ## How to Get Started with the Model
         | 
| 21 | 
            +
             | 
| 22 | 
            +
            ```python
         | 
| 23 | 
            +
            from transformers import AutoTokenizer, AutoModel
         | 
| 24 | 
            +
             | 
| 25 | 
            +
            tokenizer = AutoTokenizer.from_pretrained("manelalab/chronobert-v1-19991231")
         | 
| 26 | 
            +
            model = AutoModel.from_pretrained("manelalab/chronobert-v1-19991231")
         | 
| 27 | 
            +
             | 
| 28 | 
            +
            text = "You've gotta be very careful not to mess with the space-time continuum. -- Dr. Brown, Back to the Future"
         | 
| 29 | 
            +
             | 
| 30 | 
            +
            inputs = tokenizer(text, return_tensors="pt")
         | 
| 31 | 
            +
            outputs = model(**inputs)
         | 
| 32 | 
            +
            ```
         | 
| 33 | 
            +
             | 
| 34 | 
            +
            ## Training Details
         | 
| 35 | 
            +
             | 
| 36 | 
            +
            ### Training Data
         | 
| 37 | 
            +
             | 
| 38 | 
            +
            - **Pretraining corpus:** Our initial model $\text{ChronoBERT}_{1999}$ is pretrained on 460 billion tokens of pre-2000, diverse, high-quality, and open-source text data to ensure no leakage of data afterwards.
         | 
| 39 | 
            +
            - **Incremental updates:** Yearly updates from 2000 to 2024 with an additional 65 billion tokens of timestamped text.
         | 
| 40 | 
            +
             | 
| 41 | 
            +
            ### Training Procedure
         | 
| 42 | 
            +
             | 
| 43 | 
            +
            - **Architecture:** ModernBERT-based model with rotary embeddings and flash attention.
         | 
| 44 | 
            +
            - **Objective:** Masked token prediction.
         | 
| 45 | 
            +
             | 
| 46 | 
            +
            ## Evaluation
         | 
| 47 | 
            +
             | 
| 48 | 
            +
            ### Testing Data, Factors & Metrics
         | 
| 49 | 
            +
             | 
| 50 | 
            +
            - **Language understanding:** Evaluated on **GLUE benchmark** tasks.
         | 
| 51 | 
            +
            - **Financial forecasting:** Evaluated using **return prediction task** based on Dow Jones Newswire data.
         | 
| 52 | 
            +
            - **Comparison models:** ChronoBERT was benchmarked against **BERT, FinBERT, StoriesLM-v1-1963, and Llama 3.1**.
         | 
| 53 | 
            +
             | 
| 54 | 
            +
            ### Results
         | 
| 55 | 
            +
             | 
| 56 | 
            +
            - **GLUE Score:** $\text{ChronoBERT}_{1999}$ and $\text{ChronoBERT}_{2024}$ achieved GLUE score of 84.71 and 85.54 respectively, outperforming BERT (84.52).
         | 
| 57 | 
            +
            - **Stock return predictions:** During the sample from 2008-01 to 2023-07, $\text{ChronoBERT}_{\text{Realtime}}$ achieves a long-short portfolio **Sharpe ratio of 4.80**, outperforming BERT, FinBERT, and StoriesLM-v1-1963, and comparable to **Llama 3.1 8B (4.90)**.
         | 
| 58 | 
            +
             | 
| 59 | 
            +
             | 
| 60 | 
            +
            ## Citation
         | 
| 61 | 
            +
             | 
| 62 | 
            +
            ```
         | 
| 63 | 
            +
            @article{He2025ChronoBERT,
         | 
| 64 | 
            +
              title={Chronologically Consistent Large Language Models},
         | 
| 65 | 
            +
              author={He, Songrun and Lv, Linying and Manela, Asaf and Wu, Jimmy},
         | 
| 66 | 
            +
              journal={Working Paper},
         | 
| 67 | 
            +
              year={2025}
         | 
| 68 | 
            +
            }
         | 
| 69 | 
            +
            ```
         | 
| 70 | 
            +
             | 
| 71 | 
            +
            ## Model Card Authors
         | 
| 72 | 
            +
             | 
| 73 | 
            +
            - Songrun He (Washington University in St. Louis, [email protected])
         | 
| 74 | 
            +
            - Linying Lv (Washington University in St. Louis, [email protected])
         | 
| 75 | 
            +
            - Asaf Manela (Washington University in St. Louis, [email protected])
         | 
| 76 | 
            +
            - Jimmy Wu (Washington University in St. Louis, [email protected])
         | 
| 77 | 
            +
             | 
| 78 | 
            +
             | 
| 79 | 
            +
             | 
    	
        README.md
    CHANGED
    
    | @@ -1,8 +1,18 @@ | |
| 1 | 
            -
             | 
| 2 | 
            -
             | 
| 3 | 
            -
             | 
| 4 | 
            -
             | 
| 5 | 
            -
             | 
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
| 6 |  | 
| 7 | 
             
            ChronoBERT is a series **high-performance chronologically consistent large language models (LLM)** designed to eliminate lookahead bias and training leakage while maintain good language understanding in time-sensitive applications. The model is pretrained on **diverse, high-quality, open-source, and timestamped text** to maintain chronological consistency.
         | 
| 8 |  | 
| @@ -13,7 +23,7 @@ All models in the series achieve **GLUE benchmark scores that surpass standard B | |
| 13 | 
             
            - **Language(s) (NLP):** English
         | 
| 14 | 
             
            - **License:** MIT License
         | 
| 15 |  | 
| 16 | 
            -
             | 
| 17 |  | 
| 18 | 
             
            - **Paper:** "Chronologically Consistent Large Language Models" (He, Lv, Manela, Wu, 2025)
         | 
| 19 |  | 
|  | |
| 1 | 
            +
            ---
         | 
| 2 | 
            +
            library_name: transformers
         | 
| 3 | 
            +
            license: mit
         | 
| 4 | 
            +
            language:
         | 
| 5 | 
            +
            - en
         | 
| 6 | 
            +
            tags:
         | 
| 7 | 
            +
            - chronologically consistent
         | 
| 8 | 
            +
            - modernbert
         | 
| 9 | 
            +
            - glue
         | 
| 10 | 
            +
            pipeline_tag: fill-mask
         | 
| 11 | 
            +
            inference: false
         | 
| 12 | 
            +
            ---
         | 
| 13 | 
            +
            # ChronoBERT
         | 
| 14 | 
            +
             | 
| 15 | 
            +
            ## Model Description
         | 
| 16 |  | 
| 17 | 
             
            ChronoBERT is a series **high-performance chronologically consistent large language models (LLM)** designed to eliminate lookahead bias and training leakage while maintain good language understanding in time-sensitive applications. The model is pretrained on **diverse, high-quality, open-source, and timestamped text** to maintain chronological consistency.
         | 
| 18 |  | 
|  | |
| 23 | 
             
            - **Language(s) (NLP):** English
         | 
| 24 | 
             
            - **License:** MIT License
         | 
| 25 |  | 
| 26 | 
            +
            ## Model Sources
         | 
| 27 |  | 
| 28 | 
             
            - **Paper:** "Chronologically Consistent Large Language Models" (He, Lv, Manela, Wu, 2025)
         | 
| 29 |  | 

