Update README.md
Browse files
README.md
CHANGED
|
@@ -108,6 +108,26 @@ Note: the config has 300M in the model name but it is actually 500M due to the v
|
|
| 108 |
```
|
| 109 |
litgpt pretrain \
|
| 110 |
--config microllama_v2.yaml \
|
| 111 |
-
--resume <
|
| 112 |
```
|
| 113 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 108 |
```
|
| 109 |
litgpt pretrain \
|
| 110 |
--config microllama_v2.yaml \
|
| 111 |
+
--resume <LOCAL_PATH_TO_CHECKPOINT_FROM_THIS_REPO>
|
| 112 |
```
|
| 113 |
|
| 114 |
+
**IMPORTANT NOTE**
|
| 115 |
+
I have had various issues when moving from server to server to resume training from checkpoints specifically when I switched from
|
| 116 |
+
Lightning AI Studio to a private server. For example, Lightning AI Studio may look for your preprocessed data from ```/root/.lightning/chunks/``` if you
|
| 117 |
+
store the preposed data on S3 and allows Lightning AI studio to stream the data while training. When I moved to a private server, litgpt tried to
|
| 118 |
+
look for the same data under ```/cache/chunks/```.
|
| 119 |
+
|
| 120 |
+
If you run into any issues with resuming training, just convert the checkpoint to inference checkpoint and then you can load it again.
|
| 121 |
+
```
|
| 122 |
+
litgpt convert_pretrained_checkpoint <LOCAL_PATH_TO_CHECKPOINT_FROM_THIS_REPO> \
|
| 123 |
+
--output_dir <LOCAL_PATH_TO_INFERENCE_CHECKPOINT>
|
| 124 |
+
|
| 125 |
+
litgpt pretrain \
|
| 126 |
+
--config microllama_v2.yaml \
|
| 127 |
+
--initial_checkpoint_dir <LOCAL_PATH_TO_INFERENCE_CHECKPOINT>
|
| 128 |
+
```
|
| 129 |
+
|
| 130 |
+
You will lose the index to the training dataset as well as other hyperparams such as learning rate but this allows you to start your pre-training quickly.
|
| 131 |
+
|
| 132 |
+
|
| 133 |
+
|