Update README.md
Browse files
README.md
CHANGED
|
@@ -15,7 +15,6 @@ This model was a joint collaboration of [Stanford CRFM](https://crfm.stanford.ed
|
|
| 15 |
- [Model Details](#model-details)
|
| 16 |
- [Model Description](#model-description)
|
| 17 |
- [Uses](#uses)
|
| 18 |
-
- [Direct Use](#direct-use)
|
| 19 |
- [Downstream Use](#downstream-use)
|
| 20 |
- [Out-of-Scope Use](#out-of-scope-use)
|
| 21 |
- [Bias, Risks, and Limitations](#bias-risks-and-limitations)
|
|
@@ -24,23 +23,11 @@ This model was a joint collaboration of [Stanford CRFM](https://crfm.stanford.ed
|
|
| 24 |
- [Training Data](#training-data)
|
| 25 |
- [Training Procedure](#training-procedure)
|
| 26 |
- [Preprocessing](#preprocessing)
|
| 27 |
-
- [Speeds, Sizes, Times](#speeds-sizes-times)
|
| 28 |
-
- [Evaluation](#evaluation)
|
| 29 |
-
- [Testing Data, Factors & Metrics](#testing-data-factors--metrics)
|
| 30 |
-
- [Testing Data](#testing-data)
|
| 31 |
-
- [Factors](#factors)
|
| 32 |
-
- [Metrics](#metrics)
|
| 33 |
-
- [Results](#results)
|
| 34 |
-
- [Model Examination](#model-examination)
|
| 35 |
- [Environmental Impact](#environmental-impact)
|
| 36 |
- [Technical Specifications](#technical-specifications)
|
| 37 |
- [Model Architecture and Objective](#model-architecture-and-objective)
|
| 38 |
- [Compute Infrastructure](#compute-infrastructure)
|
| 39 |
-
|
| 40 |
-
- [Software](#software)
|
| 41 |
-
- [Citation](#citation)
|
| 42 |
-
- [Model Card Contact](#model-card-contact)
|
| 43 |
-
- [How to Get Started with the Model](#how-to-get-started-with-the-model)
|
| 44 |
|
| 45 |
|
| 46 |
# Model Details
|
|
@@ -61,6 +48,8 @@ This model was a joint collaboration of [Stanford CRFM](https://crfm.stanford.ed
|
|
| 61 |
- **Language(s) (NLP):** en
|
| 62 |
- **License:** openrail
|
| 63 |
|
|
|
|
|
|
|
| 64 |
## Direct Use
|
| 65 |
|
| 66 |
<!-- This section is for the model use without fine-tuning or plugging into a larger ecosystem/app. -->
|
|
@@ -83,8 +72,6 @@ The main way we have used this model is finetuning for downstream question answe
|
|
| 83 |
We do not recommend using this model for natural language generation in a production environment, finetuned or otherwise.
|
| 84 |
|
| 85 |
|
| 86 |
-
|
| 87 |
-
|
| 88 |
# Bias, Risks, and Limitations
|
| 89 |
|
| 90 |
<!-- This section is meant to convey both technical and sociotechnical limitations. -->
|
|
@@ -155,19 +142,12 @@ This allows the model to encode information about these concepts in their indivi
|
|
| 155 |
|
| 156 |
Carbon emissions can be estimated using the [Machine Learning Impact calculator](https://mlco2.github.io/impact#compute) presented in [Lacoste et al. (2019)](https://arxiv.org/abs/1910.09700).
|
| 157 |
|
| 158 |
-
- **Hardware Type:** More information needed
|
| 159 |
-
- **Hours used:** More information needed
|
| 160 |
-
- **Cloud Provider:** More information needed
|
| 161 |
-
- **Compute Region:** More information needed
|
| 162 |
-
- **Carbon Emitted:** More information needed
|
| 163 |
-
|
| 164 |
# Technical Specifications
|
| 165 |
|
| 166 |
## Model Architecture and Objective
|
| 167 |
|
| 168 |
Pubmed GPT 2.7B is a standard GPT-2 implementation (trained with Flash Attention) with the following hyperparameters:
|
| 169 |
|
| 170 |
-
|
| 171 |
| | |
|
| 172 |
| ----------- | ----- |
|
| 173 |
| hidden size | 2560 |
|
|
@@ -176,7 +156,6 @@ Pubmed GPT 2.7B is a standard GPT-2 implementation (trained with Flash Attention
|
|
| 176 |
| vocab size | 28896 |
|
| 177 |
| sequence length| 1024 |
|
| 178 |
|
| 179 |
-
|
| 180 |
## Compute Infrastructure
|
| 181 |
|
| 182 |
The model was trained on [MosaicML Cloud](https://www.mosaicml.com/cloud), a platform designed for large workloads like LLMs. Using the [Composer](https://github.com/mosaicml/composer) training library and [PyTorch FSDP](https://pytorch.org/docs/stable/fsdp.html), it was easy to enable multi-node training across 128 A100-40GB GPUs, and the total run was completed in ~6.25 days.
|
|
|
|
| 15 |
- [Model Details](#model-details)
|
| 16 |
- [Model Description](#model-description)
|
| 17 |
- [Uses](#uses)
|
|
|
|
| 18 |
- [Downstream Use](#downstream-use)
|
| 19 |
- [Out-of-Scope Use](#out-of-scope-use)
|
| 20 |
- [Bias, Risks, and Limitations](#bias-risks-and-limitations)
|
|
|
|
| 23 |
- [Training Data](#training-data)
|
| 24 |
- [Training Procedure](#training-procedure)
|
| 25 |
- [Preprocessing](#preprocessing)
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 26 |
- [Environmental Impact](#environmental-impact)
|
| 27 |
- [Technical Specifications](#technical-specifications)
|
| 28 |
- [Model Architecture and Objective](#model-architecture-and-objective)
|
| 29 |
- [Compute Infrastructure](#compute-infrastructure)
|
| 30 |
+
|
|
|
|
|
|
|
|
|
|
|
|
|
| 31 |
|
| 32 |
|
| 33 |
# Model Details
|
|
|
|
| 48 |
- **Language(s) (NLP):** en
|
| 49 |
- **License:** openrail
|
| 50 |
|
| 51 |
+
# Uses
|
| 52 |
+
|
| 53 |
## Direct Use
|
| 54 |
|
| 55 |
<!-- This section is for the model use without fine-tuning or plugging into a larger ecosystem/app. -->
|
|
|
|
| 72 |
We do not recommend using this model for natural language generation in a production environment, finetuned or otherwise.
|
| 73 |
|
| 74 |
|
|
|
|
|
|
|
| 75 |
# Bias, Risks, and Limitations
|
| 76 |
|
| 77 |
<!-- This section is meant to convey both technical and sociotechnical limitations. -->
|
|
|
|
| 142 |
|
| 143 |
Carbon emissions can be estimated using the [Machine Learning Impact calculator](https://mlco2.github.io/impact#compute) presented in [Lacoste et al. (2019)](https://arxiv.org/abs/1910.09700).
|
| 144 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 145 |
# Technical Specifications
|
| 146 |
|
| 147 |
## Model Architecture and Objective
|
| 148 |
|
| 149 |
Pubmed GPT 2.7B is a standard GPT-2 implementation (trained with Flash Attention) with the following hyperparameters:
|
| 150 |
|
|
|
|
| 151 |
| | |
|
| 152 |
| ----------- | ----- |
|
| 153 |
| hidden size | 2560 |
|
|
|
|
| 156 |
| vocab size | 28896 |
|
| 157 |
| sequence length| 1024 |
|
| 158 |
|
|
|
|
| 159 |
## Compute Infrastructure
|
| 160 |
|
| 161 |
The model was trained on [MosaicML Cloud](https://www.mosaicml.com/cloud), a platform designed for large workloads like LLMs. Using the [Composer](https://github.com/mosaicml/composer) training library and [PyTorch FSDP](https://pytorch.org/docs/stable/fsdp.html), it was easy to enable multi-node training across 128 A100-40GB GPUs, and the total run was completed in ~6.25 days.
|