# 🚀 Continuous Pretraining of Llama 3.2 1B on SageMaker Hyperpod with Pre-built Containers

This tutorial demonstrates how to continuously pre-train the [Llama 3.2 1B](https://huggingface.co/meta-llama/Llama-3.2-1B) model using the Hugging Face [Optimum Neuron](https://huggingface.co/docs/optimum-neuron/index) library on [Amazon SageMaker Hyperpod](https://docs.aws.amazon.com/sagemaker/latest/dg/sagemaker-hyperpod.html). We leverage several performance optimizations such as tensor parallelism, sequence parallelism, and ZeRO-1 to efficiently train large language models on Trainium-powered instances.

One of the key benefits of using SageMaker Hyperpod is the ability to leverage the pre-built Optimum Neuron containers provided by Hugging Face. These containers come with all the necessary libraries and dependencies pre-installed, making it easy to get started with training on AWS Trainium instances.

By using the SageMaker pre-built containers, you can avoid the hassle of manually setting up the environment and focus on the core training and fine-tuning tasks. The containers are optimized for performance and include various optimization techniques, such as tensor parallelism and selective checkpointing, to efficiently train large language models like Llama 3.2 1B.

You will learn how to:

- [Continuous Pretraining of Llama 3.2 1B on SageMaker Hyperpod with Pre-built Containers](#continuous-pretraining-of-llama-32-1b-on-sagemaker-hyperpod-with-pre-built-containers)
  - [1. Setup AWS Environment](#1-setup-aws-environment)
  - [2. Prepare the Training Environment](#2-prepare-the-training-environment)
  - [3. Configure the Training Job](#3-configure-the-training-job)
  - [4. Launch Training on SageMaker Hyperpod](#4-launch-training-on-sagemaker-hyperpod)
  - [5. Monitor and Validate Training](#5-monitor-and-validate-training)

## 1. Setup AWS Environment

Before starting this tutorial, you need to set up your AWS environment:

1. Create an AWS SageMaker Hyperpod cluster with at least one `trn1.32xlarge` instance. You can follow the [Hyperpod EKS workshop](https://catalog.workshops.aws/sagemaker-hyperpod-eks/en-US/00-setup/own-account) to set up the cluster.
2. Since Llama 3.2 is a gated model users have to register in Hugging Face and obtain an [access token](https://huggingface.co/docs/hub/en/security-tokens) before running this example. You will also need to review and accept the license agreement on the [meta-llama/Llama-3.2-1B](https://huggingface.co/meta-llama/Llama-3.2-1B) model page.
3. Configure your AWS credentials. If you haven't already set up your AWS credentials, you can do this by installing the AWS CLI and running `aws configure`. You'll need to enter your AWS Access Key ID, Secret Access Key, default region, and output format.
   ```bash
   aws configure
   AWS Access Key ID [None]: YOUR_ACCESS_KEY
   AWS Secret Access Key [None]: YOUR_SECRET_KEY
   Default region name [None]: YOUR_REGION
   Default output format [None]: json
   ```

## 2. Prepare the Training Environment

Set up your training environment with the necessary dependencies:

```bash
git clone https://github.com/huggingface/optimum-neuron.git
mkdir ~/pre-training
cd pre-training

cp -r ../optimum-neuron/docs/source/training_tutorials/amazon_eks .
cd amazon_eks
```

Login to ECR and pull the `huggingface-pytorch-training-neuronx` image:

```bash
region=us-east-1
dlc_account_id=************
aws ecr get-login-password --region $region | docker login --username AWS --password-stdin $dlc_account_id.dkr.ecr.$region.amazonaws.com

docker pull ${dlc_account_id}.dkr.ecr.${region}.amazonaws.com/huggingface-pytorch-training-neuronx:2.1.2-transformers4.43.2-neuronx-py310-sdk2.20.0-ubuntu20.04-v1.0
```

Build and push the Docker image to your ECR registry:

```bash
export AWS_REGION=$(aws ec2 describe-availability-zones --output text --query 'AvailabilityZones[0].[RegionName]')
export ACCOUNT=$(aws sts get-caller-identity --query Account --output text)
export REGISTRY=${ACCOUNT}.dkr.ecr.${AWS_REGION}.amazonaws.com/
export IMAGE=optimum-neuron-llama-pretraining
export TAG=:latest

docker build -t ${REGISTRY}${IMAGE}${TAG} .
```

Push the image to your private registry:

```bash
# Create registry if needed
export REGISTRY_COUNT=$(aws ecr describe-repositories | grep \"${IMAGE}\" | wc -l)
if [ "${REGISTRY_COUNT//[!0-9]/}" == "0" ]; then
   echo "Creating repository ${REGISTRY}${IMAGE} ..."
   aws ecr create-repository --repository-name ${IMAGE}
else
   echo "Repository ${REGISTRY}${IMAGE} already exists"
fi

# Login to registry
echo "Logging in to $REGISTRY ..."
aws ecr get-login-password | docker login --username AWS --password-stdin $REGISTRY

# Push image to registry
docker image push ${REGISTRY}${IMAGE}${TAG}
```

## 3. Configure the Training Job

Next, you will generate the script to be used by the pre-training job. Begin by logging into Hugging Face using your access token mentioned in the prerequisite steps.
Modify the `generate-jobspec.sh` script to include the Hugging Face access token before running it:

```bash
export HF_ACCESS_TOKEN=""
```

Generate the Kubernetes job specification by executing `generate-jobspec.sh`. This will create a deployment manifest called `llama_train.yaml` for the Amazon SageMaker Hyperpod EKS cluster.

```bash
./generate-jobspec.sh
```

## 4. Launch Training on SageMaker Hyperpod

Deploy the training job to your Kubernetes cluster:

```bash
kubectl apply -f llama_train.yaml
```

The manifest runs the training script on the cluster using torchrun for distributed training. You can explore the complete training script at [run_clm.py](https://github.com/huggingface/optimum-neuron/blob/main/examples/language-modeling/run_clm.py).

You will use the following distributed training techniques in this script:
- Distributed Training: Uses torchrun with 8 processes per node for efficient multi-device training
- Model Parallelism: Implements both tensor parallelism (TP=8) and pipeline parallelism (PP=1)
- Mixed Precision: Utilizes BFloat16 for improved training efficiency
- Gradient Checkpointing: Enables memory-efficient training

The manifest runs the following command on the cluster. The environment variables are set when creating the manifest in `generate-jobspec.sh`.

```bash
torchrun --nproc_per_node=8 --nnodes=${NUM_NODES} run_clm.py \
    --model_name_or_path=${HF_MODEL_NAME}
    --token=${HF_ACCESS_TOKEN}
    --dataset_name=${DATASET_NAME}
    --dataset_config_name=${DATASET_CONFIG_NAME}
    --streaming=True
    --cache_dir=${TOKENIZED_DATA_PATH}
    --num_train_epochs=1
    --do_train
    --learning_rate=1e-4
    --max_steps=${MAX_STEPS}
    --per_device_train_batch_size=${BATCH_SIZE}
    --per_device_eval_batch_size=4
    --gradient_accumulation_steps=1
    --gradient_checkpointing
    --block_size=4096
    --bf16
    --max_grad_norm=1.0
    --lr_scheduler_type=linear
    --tensor_parallel_size=8
    --pipeline_parallel_size=1
    --logging_steps=1
    --save_total_limit=1
    --output_dir=${CHECKPOINT_DIR}
    --overwrite_output_dir
```

The training job will now start running on the SageMaker Hyperpod cluster.

This uses a pre-built script from Optimum-neuron. The script uses the Trainer class from the Optimum Neuron library, which is a specialized version of the Hugging Face Trainer optimized for training on AWS Trainium instances.

Here's an overview of the main components in the script:

   - Model Loading: The model is loaded using `AutoModelForCausalLM.from_pretrained()` with lazy loading for parallelism.

   - Data Processing: The dataset is tokenized and processed into chunks suitable for language modeling.

   - Training Arguments: The script uses `NeuronTrainingArguments` to configure training hyperparameters, including options for tensor parallelism and pipeline parallelism.

   - Trainer Setup: A Trainer instance `[optimum.neuron.NeuronTrainer]` is created with the model, training arguments, datasets, and other necessary components.

   - Training Loop: The `trainer.train()` method is called to start the continuous pretraining process.

## 5. Monitor and Validate Training

You can monitor the progress through Kubernetes logs:

```bash
# Monitor training logs
kubectl logs -f -n kubeflow llama-training-eks-worker-0

# Validate saved checkpoints
kubectl exec -it llama-training-eks-worker-0 -- ls -l /fsx/output
```

Once the pretraining is complete, you can fine-tune the model for specific tasks using the techniques covered in the previous tutorials. Congrats on pre-training Llama on AWS Trainium!