Deploying in SageMaker
Hi there,
I'm experimenting with Dolly and I'm trying to deploy it in SageMaker. It all works fine but I'm struggling to run inference—there's something going on with the data format I'm passing, but cannot figure out what!
import json
import boto3
import sagemaker
from sagemaker.huggingface import HuggingFaceModel
# %% Deploy new model
role = sagemaker.get_execution_role()
hub = {"HF_MODEL_ID": "databricks/dolly-v2-12b", "HF_TASK": "text-generation"}
# Create Hugging Face Model Class
huggingface_model = HuggingFaceModel(
transformers_version="4.17.0",
pytorch_version="1.10.2",
py_version="py38",
env=hub,
role=role,
)
# Deploy model to SageMaker Inference
predictor = huggingface_model.deploy(
initial_instance_count=1, # number of instances
instance_type="ml.m5.xlarge", # ec2 instance type
)
predictor.predict({"inputs": "Once upon a time there "})
results in:
ModelError: An error occurred (ModelError) when calling the InvokeEndpoint operation: Received client error (400) from primary with message "{
"code": 400,
"type": "InternalServerException",
"message": "\u0027gpt_neox\u0027"
}
I've tried using json strings but no luck either.
Any help appreciated!
Cheers.
That's really a question for HF / Sagemaker, doesn't look related to this model per se
Hi there, you've got a few items to unpack here. First, you want to point to a more recent version of the transformers SDK, ideally one that has support for all of the model objects needed for dolly.
Second, this is a 12B parameter model. That means you are likely going to need more than one accelerator to host it. I'm testing this out on my end now, and will report back soon what seems to be the smallest number of accelerators. If you're compiling it, you need fewer.
Third, I would point to a hosting instance that uses accelerators, either inferentia (inf1) or NVIDIA (g's or p's).
I'll respond in a bit with more concrete guidance. In the meantime, Phillip has some great examples of doing this end-to-end here!
Error [ModelError]: Received client error (400) from primary with message "{ "code": 400, "type": "InternalServerException", "message": "\u0027gpt_neox\u0027" }
I got the same error when trying to run inference after deploying it as a SageMaker endpoint. I was trying to find out what went wrong and stumbled across this. I'm using dolly-v2-3b rather than 12b.
Just Googling, looks like this maybe (you need to tell it to use a newer transformers or something) https://towardsdatascience.com/unlock-the-latest-transformer-models-with-amazon-sagemaker-7fe65130d993
If I change my deployment configuration to update to the proper transformers version/pytorch/pyversion
huggingface_model = HuggingFaceModel(
transformers_version='4.26.0',
pytorch_version='1.13.1',
py_version='py39',
env=hub,
role=role,
)
I get a new error Load model failed: databricks__dolly-v2-3b, error: Worker died.
As above. Likely you aren't provisioning something too small for the model
Thanks for the responses.
I've been playing with EC2 directly—no SageMaker—and dolly-v2-12b
runs fine on a p3.2xlarge
instance (quick enough for my experiments, anyway!) running the following script:
import torch
from langchain import PromptTemplate, LLMChain
from langchain.llms import HuggingFacePipeline
from transformers import pipeline
print("Loading Dolly...")
generate_text = pipeline(
model="databricks/dolly-v2-12b",
torch_dtype=torch.bfloat16,
trust_remote_code=True,
device_map="auto",
return_full_text=True,
)
print("Prompting Dolly...")
# template for an instruction with input
prompt_with_context = PromptTemplate(
input_variables=["instruction", "context"],
template="{instruction}\n\nInput:\n{context}",
)
hf_pipeline = HuggingFacePipeline(pipeline=generate_text)
llm_context_chain = LLMChain(llm=hf_pipeline, prompt=prompt_with_context)
context = """George Washington (February 22, 1732 - December 14, 1799) was an American military officer, statesman,
and Founding Father who served as the first president of the United States from 1789 to 1797."""
print(
llm_context_chain.predict(
instruction="When was George Washington president?", context=context
).lstrip()
)
Now, back to SageMaker: I've then updated dependency versions as per the comments above, and I'm now getting a new error regarding running out of disk space. I'm using the following code:
import json
import boto3
import sagemaker
from sagemaker.huggingface import HuggingFaceModel
role = sagemaker.get_execution_role()
hub = {
'HF_MODEL_ID':'databricks/dolly-v2-12b',
'HF_TASK':'text-generation'
}
# Create Hugging Face Model Class
huggingface_model = HuggingFaceModel(
transformers_version='4.26.0',
pytorch_version='1.13.1',
py_version='py39',
env=hub,
role=role,
)
# Deploy model to SageMaker Inference
predictor = huggingface_model.deploy(
initial_instance_count=1,
instance_type="ml.p3.2xlarge",
volume_size=512,
)
This instance should have 512GB of storage, more than enough for dolly-v2-12b
so not sure what's going on.
Cheers!
I am trying to deploy the dolly-v2-12b in to sagemaker. when trying to run inference running in to below errors.
from sagemaker.huggingface import HuggingFaceModel
import sagemaker
role = sagemaker.get_execution_role()
hub = {
'HF_MODEL_ID': 'databricks/dolly-v2-12b',
'HF_TASK': 'text-generation',
}
huggingface_model = HuggingFaceModel(
transformers_version='4.17.0',
pytorch_version='1.10.2',
py_version='py38',
env=hub,
role=role
)
predictor = huggingface_model.deploy(
initial_instance_count=1,
instance_type='ml.m5.xlarge',
)
sample_input = {
'inputs': 'Can you please let us know more details about your'
}
output = predictor.predict(sample_input)
print(output)
This is leading to,
ModelError: An error occurred (ModelError) when calling the InvokeEndpoint operation: Received client error (400) from primary with message "{
"code": 400,
"type": "InternalServerException",
"message": "\u0027gpt_neox\u0027"
}
from sagemaker.huggingface import HuggingFaceModel
import sagemaker
role = sagemaker.get_execution_role()
hub = {
'HF_MODEL_ID': 'databricks/dolly-v2-12b',
'HF_TASK': 'text-generation',
}
huggingface_model = HuggingFaceModel(
transformers_version='4.26.0',
pytorch_version='1.13.1',
py_version='py39',
env=hub,
role=role,
)
predictor = huggingface_model.deploy(
initial_instance_count=1,
instance_type='ml.m5.xlarge',
)
sample_input = {
'inputs': 'Can you please let us know more details about your'
}
output = predictor.predict(sample_input)
print(output)
This is leading to,
ModelError: An error occurred (ModelError) when calling the InvokeEndpoint operation: Received client error (400) from primary with message "{
"code": 400,
"type": "InternalServerException",
"message": "Loading this pipeline requires you to execute the code in the pipeline file in that repo on your local machine. Make sure you have read the code there to avoid malicious use, then set the option trust_remote_code\u003dTrue
to remove this error."
}
I am not sure what is missing.
Any help appreciated!
Again , your hardware is far too small for this model. An m5.xlarge doesn't even have a GPU. See above.
That isn't the problem here. I'm not sure anyone has figured out here how to set trust_remote_code=True, which is needed to load the model's pipeline, in the SM integration.
I was able to set trust_remote_code=True by overriding the default method for loading a model following documentation here https://huggingface.co/docs/sagemaker/inference#user-defined-code-and-modules.
I created an inference.py
with the following code:
from transformers import pipeline
import torch
def model_fn(model_dir):
"""
Overrides the default model load function in the HuggingFace Deep Learning Container
"""
instruct_pipeline = pipeline(model="databricks/dolly-v2-3b", torch_dtype=torch.bfloat16, trust_remote_code=True, device_map="auto")
return instruct_pipeline
and requirements.txt
with:
accelerate==0.18.0
Then I followed instructions here for creating a model artifact and uploaded to s3. Then you can deploy an endpoint with:
from sagemaker.huggingface.model import HuggingFaceModel
# create Hugging Face Model Class
huggingface_model = HuggingFaceModel(
model_data="s3://your_bucket/your_dolly_path/model.tar.gz", # path to your trained SageMaker model
role=role, # IAM role with permissions to create an endpoint
transformers_version="4.26.0", # Transformers version used
pytorch_version="1.13.1", # PyTorch version used
py_version='py39', # Python version used
)
# deploy model to SageMaker Inference
predictor = huggingface_model.deploy(
initial_instance_count=1,
instance_type="ml.g5.4xlarge"
)
Note: I tested this with the databricks/dolly-v2-3b
model, so the ml.g5.4xlarge
may not be enough for the larger models
Here's a gist showing a working method for deploying the dolly-v2-12b
model on a g5.4xlarge
instance.
https://gist.github.com/timesler/4b244a6b73d6e02d17fd220fd92dfaec
@alvaropp
I believe the issue with running out of disk space was because the 512GB disk mount on SageMaker is at /home/ec2-user/SageMaker
, but HuggingFace libraries default to storing files in a cache at /home/ec2-user/.cache/...
. The solution is to set the HF_HOME
env var to a location under /home/ec2-user/SageMaker
. Importantly, if you set the env var in python, make sure you do it before importing HuggingFace libraries to make sure it gets used. I've included that in the linked gist.
To get the 12b model running on a g5.4xlarge
instance, I think you'll also need to set load_in_8bit
to True
.
Right, so I've followed @timesler 's instructions and I'm running into the following error, which seems to be some sort of overflow:
ModelError: An error occurred (ModelError) when calling the InvokeEndpoint operation: Received client error (400) from primary with message "{
"code": 400,
"type": "InternalServerException",
"message": "probability tensor contains either `inf`, `nan` or element \u003c 0"
}
I'm using a ml.p3.8xlarge instance, which is perfectly capable of running dolly-v2-12b in my experiments using EC2 directly, without SageMaker.
Here's a working version for ya!
https://github.com/dhawalkp/dolly-12b/blob/main/dolly-12b-deepspeed-sagemaker.ipynb
That's great, thanks!
After a bit of trial an error, noticed that @timesler 's code (https://gist.github.com/timesler/4b244a6b73d6e02d17fd220fd92dfaec) works perfectly fine as well.
I'm not 100% sure of why it works on g5.4xlarge
and not on ml.p3.8xlarge
—they seem to have similar specs!