Continuous Pretraining of Llama 3.2 1B on SageMaker Hyperpod with Pre-built Containers

This tutorial demonstrates how to continuously pre-train the Llama 3.2 1B model using the Hugging Face Optimum Neuron library on Amazon SageMaker Hyperpod. We leverage several performance optimizations such as tensor parallelism, sequence parallelism, and ZeRO-1 to efficiently train large language models on Trainium-powered instances.

One of the key benefits of using SageMaker Hyperpod is the ability to leverage the pre-built Optimum Neuron containers provided by Hugging Face. These containers come with all the necessary libraries and dependencies pre-installed, making it easy to get started with training on AWS Trainium instances.

By using the SageMaker pre-built containers, you can avoid the hassle of manually setting up the environment and focus on the core training and fine-tuning tasks. The containers are optimized for performance and include various optimization techniques, such as tensor parallelism and selective checkpointing, to efficiently train large language models like Llama 3.2 1B.

You will learn how to:

Continuous Pretraining of Llama 3.2 1B on SageMaker Hyperpod with Pre-built Containers

1. Setup AWS Environment

Before starting this tutorial, you need to set up your AWS environment:

Create an AWS SageMaker Hyperpod cluster with at least one trn1.32xlarge instance. You can follow the Hyperpod EKS workshop to set up the cluster.
Since Llama 3.2 is a gated model users have to register in Hugging Face and obtain an access token before running this example. You will also need to review and accept the license agreement on the meta-llama/Llama-3.2-1B model page.
Configure your AWS credentials. If you haven’t already set up your AWS credentials, you can do this by installing the AWS CLI and running aws configure. You’ll need to enter your AWS Access Key ID, Secret Access Key, default region, and output format.
```
aws configure
AWS Access Key ID [None]: YOUR_ACCESS_KEY
AWS Secret Access Key [None]: YOUR_SECRET_KEY
Default region name [None]: YOUR_REGION
Default output format [None]: json
```

2. Prepare the Training Environment

Set up your training environment with the necessary dependencies:

git clone https://github.com/huggingface/optimum-neuron.git
mkdir ~/pre-training
cd pre-training

cp -r ../optimum-neuron/docs/source/training_tutorials/amazon_eks .
cd amazon_eks

region=us-east-1
dlc_account_id=************
aws ecr get-login-password --region $region | docker login --username AWS --password-stdin $dlc_account_id.dkr.ecr.$region.amazonaws.com

docker pull ${dlc_account_id}.dkr.ecr.${region}.amazonaws.com/huggingface-pytorch-training-neuronx:2.1.2-transformers4.43.2-neuronx-py310-sdk2.20.0-ubuntu20.04-v1.0

Build and push the Docker image to your ECR registry:

export AWS_REGION=$(aws ec2 describe-availability-zones --output text --query 'AvailabilityZones[0].[RegionName]')
export ACCOUNT=$(aws sts get-caller-identity --query Account --output text)
export REGISTRY=${ACCOUNT}.dkr.ecr.${AWS_REGION}.amazonaws.com/
export IMAGE=optimum-neuron-llama-pretraining
export TAG=:latest

docker build -t ${REGISTRY}${IMAGE}${TAG} .

Push the image to your private registry:

# Create registry if needed
export REGISTRY_COUNT=$(aws ecr describe-repositories | grep \"${IMAGE}\" | wc -l)
if [ "${REGISTRY_COUNT//[!0-9]/}" == "0" ]; then
   echo "Creating repository ${REGISTRY}${IMAGE} ..."
   aws ecr create-repository --repository-name ${IMAGE}
else
   echo "Repository ${REGISTRY}${IMAGE} already exists"
fi

# Login to registry
echo "Logging in to $REGISTRY ..."
aws ecr get-login-password | docker login --username AWS --password-stdin $REGISTRY

# Push image to registry
docker image push ${REGISTRY}${IMAGE}${TAG}

3. Configure the Training Job

Next, you will generate the script to be used by the pre-training job. Begin by logging into Hugging Face using your access token mentioned in the prerequisite steps. Modify the generate-jobspec.sh script to include the Hugging Face access token before running it:

export HF_ACCESS_TOKEN="<your_HF_token_here>"

Generate the Kubernetes job specification by executing generate-jobspec.sh. This will create a deployment manifest called llama_train.yaml for the Amazon SageMaker Hyperpod EKS cluster.

./generate-jobspec.sh

4. Launch Training on SageMaker Hyperpod

Deploy the training job to your Kubernetes cluster:

kubectl apply -f llama_train.yaml

The manifest runs the training script on the cluster using torchrun for distributed training. You can explore the complete training script at run_clm.py.

You will use the following distributed training techniques in this script:

Distributed Training: Uses torchrun with 8 processes per node for efficient multi-device training
Model Parallelism: Implements both tensor parallelism (TP=8) and pipeline parallelism (PP=1)
Mixed Precision: Utilizes BFloat16 for improved training efficiency
Gradient Checkpointing: Enables memory-efficient training

The manifest runs the following command on the cluster. The environment variables are set when creating the manifest in generate-jobspec.sh.

torchrun --nproc_per_node=8 --nnodes=${NUM_NODES} run_clm.py \
    --model_name_or_path=${HF_MODEL_NAME}
    --token=${HF_ACCESS_TOKEN}
    --dataset_name=${DATASET_NAME}
    --dataset_config_name=${DATASET_CONFIG_NAME}
    --streaming=True
    --cache_dir=${TOKENIZED_DATA_PATH}
    --num_train_epochs=1
    --do_train
    --learning_rate=1e-4
    --max_steps=${MAX_STEPS}
    --per_device_train_batch_size=${BATCH_SIZE}
    --per_device_eval_batch_size=4
    --gradient_accumulation_steps=1
    --gradient_checkpointing
    --block_size=4096
    --bf16
    --max_grad_norm=1.0
    --lr_scheduler_type=linear
    --tensor_parallel_size=8
    --pipeline_parallel_size=1
    --logging_steps=1
    --save_total_limit=1
    --output_dir=${CHECKPOINT_DIR}
    --overwrite_output_dir

The training job will now start running on the SageMaker Hyperpod cluster.

This uses a pre-built script from Optimum-neuron. The script uses the Trainer class from the Optimum Neuron library, which is a specialized version of the Hugging Face Trainer optimized for training on AWS Trainium instances.

Here’s an overview of the main components in the script:

Model Loading: The model is loaded using AutoModelForCausalLM.from_pretrained() with lazy loading for parallelism.
Data Processing: The dataset is tokenized and processed into chunks suitable for language modeling.
Training Arguments: The script uses NeuronTrainingArguments to configure training hyperparameters, including options for tensor parallelism and pipeline parallelism.
Trainer Setup: A Trainer instance [optimum.neuron.NeuronTrainer] is created with the model, training arguments, datasets, and other necessary components.
Training Loop: The trainer.train() method is called to start the continuous pretraining process.

5. Monitor and Validate Training

You can monitor the progress through Kubernetes logs:

# Monitor training logs
kubectl logs -f -n kubeflow llama-training-eks-worker-0

# Validate saved checkpoints
kubectl exec -it llama-training-eks-worker-0 -- ls -l /fsx/output