AWS Trainium & Inferentia documentation
Continuous Pretraining of Llama 3.2 1B on SageMaker Hyperpod with Pre-built Containers
Continuous Pretraining of Llama 3.2 1B on SageMaker Hyperpod with Pre-built Containers
This tutorial demonstrates how to continuously pre-train the Llama 3.2 1B model using the Hugging Face Optimum Neuron library on Amazon SageMaker Hyperpod. We leverage several performance optimizations such as tensor parallelism, sequence parallelism, and ZeRO-1 to efficiently train large language models on Trainium-powered instances.
One of the key benefits of using SageMaker Hyperpod is the ability to leverage the pre-built Optimum Neuron containers provided by Hugging Face. These containers come with all the necessary libraries and dependencies pre-installed, making it easy to get started with training on AWS Trainium instances.
By using the SageMaker pre-built containers, you can avoid the hassle of manually setting up the environment and focus on the core training and fine-tuning tasks. The containers are optimized for performance and include various optimization techniques, such as tensor parallelism and selective checkpointing, to efficiently train large language models like Llama 3.2 1B.
You will learn how to:
1. Setup AWS Environment
Before starting this tutorial, you need to set up your AWS environment:
- Create an AWS SageMaker Hyperpod cluster with at least one
trn1.32xlarge
instance. You can follow the Hyperpod EKS workshop to set up the cluster. - Since Llama 3.2 is a gated model users have to register in Hugging Face and obtain an access token before running this example. You will also need to review and accept the license agreement on the meta-llama/Llama-3.2-1B model page.
- Configure your AWS credentials. If you haven’t already set up your AWS credentials, you can do this by installing the AWS CLI and running
aws configure
. You’ll need to enter your AWS Access Key ID, Secret Access Key, default region, and output format.aws configure AWS Access Key ID [None]: YOUR_ACCESS_KEY AWS Secret Access Key [None]: YOUR_SECRET_KEY Default region name [None]: YOUR_REGION Default output format [None]: json
2. Prepare the Training Environment
Set up your training environment with the necessary dependencies:
git clone https://github.com/huggingface/optimum-neuron.git
mkdir ~/pre-training
cd pre-training
cp -r ../optimum-neuron/docs/source/training_tutorials/amazon_eks .
cd amazon_eks
Login to ECR and pull the huggingface-pytorch-training-neuronx
image:
region=us-east-1
dlc_account_id=************
aws ecr get-login-password --region $region | docker login --username AWS --password-stdin $dlc_account_id.dkr.ecr.$region.amazonaws.com
docker pull ${dlc_account_id}.dkr.ecr.${region}.amazonaws.com/huggingface-pytorch-training-neuronx:2.1.2-transformers4.43.2-neuronx-py310-sdk2.20.0-ubuntu20.04-v1.0
Build and push the Docker image to your ECR registry:
export AWS_REGION=$(aws ec2 describe-availability-zones --output text --query 'AvailabilityZones[0].[RegionName]')
export ACCOUNT=$(aws sts get-caller-identity --query Account --output text)
export REGISTRY=${ACCOUNT}.dkr.ecr.${AWS_REGION}.amazonaws.com/
export IMAGE=optimum-neuron-llama-pretraining
export TAG=:latest
docker build -t ${REGISTRY}${IMAGE}${TAG} .
Push the image to your private registry:
# Create registry if needed
export REGISTRY_COUNT=$(aws ecr describe-repositories | grep \"${IMAGE}\" | wc -l)
if [ "${REGISTRY_COUNT//[!0-9]/}" == "0" ]; then
echo "Creating repository ${REGISTRY}${IMAGE} ..."
aws ecr create-repository --repository-name ${IMAGE}
else
echo "Repository ${REGISTRY}${IMAGE} already exists"
fi
# Login to registry
echo "Logging in to $REGISTRY ..."
aws ecr get-login-password | docker login --username AWS --password-stdin $REGISTRY
# Push image to registry
docker image push ${REGISTRY}${IMAGE}${TAG}
3. Configure the Training Job
Next, you will generate the script to be used by the pre-training job. Begin by logging into Hugging Face using your access token mentioned in the prerequisite steps.
Modify the generate-jobspec.sh
script to include the Hugging Face access token before running it:
export HF_ACCESS_TOKEN="<your_HF_token_here>"
Generate the Kubernetes job specification by executing generate-jobspec.sh
. This will create a deployment manifest called llama_train.yaml
for the Amazon SageMaker Hyperpod EKS cluster.
./generate-jobspec.sh
4. Launch Training on SageMaker Hyperpod
Deploy the training job to your Kubernetes cluster:
kubectl apply -f llama_train.yaml
The manifest runs the training script on the cluster using torchrun for distributed training. You can explore the complete training script at run_clm.py.
You will use the following distributed training techniques in this script:
- Distributed Training: Uses torchrun with 8 processes per node for efficient multi-device training
- Model Parallelism: Implements both tensor parallelism (TP=8) and pipeline parallelism (PP=1)
- Mixed Precision: Utilizes BFloat16 for improved training efficiency
- Gradient Checkpointing: Enables memory-efficient training
The manifest runs the following command on the cluster. The environment variables are set when creating the manifest in generate-jobspec.sh
.
torchrun --nproc_per_node=8 --nnodes=${NUM_NODES} run_clm.py \
--model_name_or_path=${HF_MODEL_NAME}
--token=${HF_ACCESS_TOKEN}
--dataset_name=${DATASET_NAME}
--dataset_config_name=${DATASET_CONFIG_NAME}
--streaming=True
--cache_dir=${TOKENIZED_DATA_PATH}
--num_train_epochs=1
--do_train
--learning_rate=1e-4
--max_steps=${MAX_STEPS}
--per_device_train_batch_size=${BATCH_SIZE}
--per_device_eval_batch_size=4
--gradient_accumulation_steps=1
--gradient_checkpointing
--block_size=4096
--bf16
--max_grad_norm=1.0
--lr_scheduler_type=linear
--tensor_parallel_size=8
--pipeline_parallel_size=1
--logging_steps=1
--save_total_limit=1
--output_dir=${CHECKPOINT_DIR}
--overwrite_output_dir
The training job will now start running on the SageMaker Hyperpod cluster.
This uses a pre-built script from Optimum-neuron. The script uses the Trainer class from the Optimum Neuron library, which is a specialized version of the Hugging Face Trainer optimized for training on AWS Trainium instances.
Here’s an overview of the main components in the script:
Model Loading: The model is loaded using
AutoModelForCausalLM.from_pretrained()
with lazy loading for parallelism.Data Processing: The dataset is tokenized and processed into chunks suitable for language modeling.
Training Arguments: The script uses
NeuronTrainingArguments
to configure training hyperparameters, including options for tensor parallelism and pipeline parallelism.Trainer Setup: A Trainer instance
[optimum.neuron.NeuronTrainer]
is created with the model, training arguments, datasets, and other necessary components.Training Loop: The
trainer.train()
method is called to start the continuous pretraining process.
5. Monitor and Validate Training
You can monitor the progress through Kubernetes logs:
# Monitor training logs
kubectl logs -f -n kubeflow llama-training-eks-worker-0
# Validate saved checkpoints
kubectl exec -it llama-training-eks-worker-0 -- ls -l /fsx/output
Once the pretraining is complete, you can fine-tune the model for specific tasks using the techniques covered in the previous tutorials. Congrats on pre-training Llama on AWS Trainium!