DeepSeek R1 on AWS

A running document to showcase how to deploy and fine-tune DeepSeek R1 with AWS services — Amazon SageMaker and Amazon Bedrock

Davide Gallitelli
8 min readJan 28, 2025
Image from the author — as of Jan 28, 2025

Changelog

Jan 29, 2025: Added sections “Prompting for DeepSeek R1”, “Compiling for Amazon EC2 Trainium 2”, “Deployment using Inferentia 2 on Amazon SageMaker”
Jan 28, 2025: First version

What is DeepSeek R1 and what makes it so good

DeepSeek R1 is a groundbreaking AI model family that learned to think through practice and rewards, similar to how a child learns through trial and error. What makes it particularly impressive is its innovative Mixture of Experts (MoE) architecture, which contains 671 billion parameters but only activates 37 billion per task, making it highly efficient and cost-effective. The model excels at complex problem-solving, especially in mathematics and coding, achieving remarkable results — nearly 80% accuracy on advanced math tests and outperforming 96% of human competitors in programming challenges. DeepSeek R1 comes in two main versions: R1-Zero, which learned purely through reinforcement learning without examples, and R1, which combines example-based learning with reinforcement learning for better human-understandable explanations. Perhaps most notably, it offers similar or better performance than leading models like GPT-4 and Claude, while costing significantly less — about 27 times cheaper than OpenAI’s O1 for token output. As of Jan 28, 2025, DeepSeek has released the following R1 versions:

+-------------------------------+------------------------------------------------------------------------------------+------------------------------------------------------------------------------------+
| **Model** | **Base Model** | **Download** |
+-------------------------------+------------------------------------------------------------------------------------+------------------------------------------------------------------------------------+
| DeepSeek-R1-Distill-Qwen-1.5B | [Qwen2.5-Math-1.5B](https://huggingface.co/Qwen/Qwen2.5-Math-1.5B) | [🤗 HuggingFace](https://huggingface.co/deepseek-ai/DeepSeek-R1-Distill-Qwen-1.5B) |
+-------------------------------+------------------------------------------------------------------------------------+------------------------------------------------------------------------------------+
| DeepSeek-R1-Distill-Qwen-7B | [Qwen2.5-Math-7B](https://huggingface.co/Qwen/Qwen2.5-Math-7B) | [🤗 HuggingFace](https://huggingface.co/deepseek-ai/DeepSeek-R1-Distill-Qwen-7B) |
+-------------------------------+------------------------------------------------------------------------------------+------------------------------------------------------------------------------------+
| DeepSeek-R1-Distill-Llama-8B | [Llama-3.1-8B](https://huggingface.co/meta-llama/Llama-3.1-8B) | [🤗 HuggingFace](https://huggingface.co/deepseek-ai/DeepSeek-R1-Distill-Llama-8B) |
+-------------------------------+------------------------------------------------------------------------------------+------------------------------------------------------------------------------------+
| DeepSeek-R1-Distill-Qwen-14B | [Qwen2.5-14B](https://huggingface.co/Qwen/Qwen2.5-14B) | [🤗 HuggingFace](https://huggingface.co/deepseek-ai/DeepSeek-R1-Distill-Qwen-14B) |
+-------------------------------+------------------------------------------------------------------------------------+------------------------------------------------------------------------------------+
| DeepSeek-R1-Distill-Qwen-32B | [Qwen2.5-32B](https://huggingface.co/Qwen/Qwen2.5-32B) | [🤗 HuggingFace](https://huggingface.co/deepseek-ai/DeepSeek-R1-Distill-Qwen-32B) |
+-------------------------------+------------------------------------------------------------------------------------+------------------------------------------------------------------------------------+
| DeepSeek-R1-Distill-Llama-70B | [Llama-3.3-70B-Instruct](https://huggingface.co/meta-llama/Llama-3.3-70B-Instruct) | [🤗 HuggingFace](https://huggingface.co/deepseek-ai/DeepSeek-R1-Distill-Llama-70B) |
+-------------------------------+------------------------------------------------------------------------------------+------------------------------------------------------------------------------------+

Deployment on Amazon SageMaker

Deployment using GPU

Thanks to Surya Kari for helping me figure this out! 😄️ Official sample available on GitHub aws-samples/sagemaker-genai-hosting-examples. (Jan 28, 2025) Currently works for distilled Llama 🦙️ and Gwen versions. Full R1 not supported yet.

Make sure to update your dependencies first:

%pip install sagemaker boto3 botocore --quiet --upgrade

Decide which container image you want to use — TGI or LMI. Deep dive on each coming soon.

from sagemaker.huggingface import get_huggingface_llm_image_uri
from sagemaker.image_uris import retrieve as sagemaker_retrieve_image_uri

chosen_image = "tgi" # "lmi"
if chosen_image == "tgi":
# inference_image_uri = get_huggingface_llm_image_uri("huggingface",version="2.4.0", region=session.boto_session.region_name), #TGI - Version 3.0.1 not supported by the API as of Jan 29
inference_image_uri = "763104351884.dkr.ecr.us-east-1.amazonaws.com/huggingface-pytorch-tgi-inference:2.4.0-tgi3.0.1-gpu-py311-cu124-ubuntu22.04" # Courtesy of Simon Pagezy
elif chosen_image == "lmi":
# inference_image_uri = sagemaker_retrieve_image_uri(framework="djl-lmi", version="0.31.0", region=session.boto_session.region_name), # DJL LMI - Version 0.31.0 not supported by the API as of Jan 29
inference_image_uri = "763104351884.dkr.ecr.us-east-1.amazonaws.com/djl-inference:0.31.0-lmi13.0.0-cu124" #DJL LMI
print(f"using image to host: {inference_image_uri}")Set instance type and number of GPUs. Note: different models will require different instance types. A table will come soon.
# Set instance type and number of GPUs
instance_type = "ml.g5.2xlarge"
if instance_type in ["ml.g5.12xlarge", "ml.g6e.24xlarge", "ml.g6e.12xlarge", "ml.g6.12xlarge", "ml.g6.24xlarge", "ml.g4dn.12xlarge", "ml.g4ad.16xlarge"]:
number_of_gpu = 4
elif instance_type in ["ml.p5.48xlarge", "ml.p5e.48xlarge", "ml.p5en.48xlarge", "ml.p4d.24xlarge", "ml.p4de.24xlarge", "ml.g6e.48xlarge", "ml.g6.48xlarge"]:
number_of_gpu = 8
elif instance_type.startswith("inf"):
raise ValueError("Inference instance type is not supported.")
else:
number_of_gpu = 1

Create the SageMaker Model object with the Python SDK:

hf_model_id = "deepseek-ai/DeepSeek-R1-Distill-Llama-8B" # Replace with your model ID
model_name = hf_model_id.split("/")[-1].lower()

lmi_model = sagemaker.Model(
image_uri=inference_image_uri,
env={
"HF_MODEL_ID": hf_model_id,
"OPTION_MAX_MODEL_LEN": "10000",
"OPTION_GPU_MEMORY_UTILIZATION": "0.95",
"OPTION_ENABLE_STREAMING": "false",
"OPTION_ROLLING_BATCH": "auto",
"OPTION_MODEL_LOADING_TIMEOUT": "3600",
"OPTION_PAGED_ATTENTION": "false",
"OPTION_DTYPE": "fp16",
"MAX_CONCURRENT_REQUESTS": "10", # Reduce concurrent requests to increase context length
"PYTORCH_CUDA_ALLOC_CONF": "expandable_segments:True", # Add this line
"SM_NUM_GPUS": json.dumps(number_of_gpu),
},
role=sagemaker.get_execution_role(),
name=model_name,
sagemaker_session=sagemaker.Session()
)

Deploy the model to the endpoint:

endpoint_name = f"{model_name}-ep"

# Deploy to a SageMaker Endpoint
predictor = lmi_model.deploy(
endpoint_name=endpoint_name,
initial_instance_count=1,
instance_type=instance_type,
container_startup_health_check_timeout=600,
)

Test the endpoint:

# Test the Endpoint
predictor.predict({"inputs": "What is the meaning of life?"})

Learn how to properly engineer the prompt in a section further down in this blog, “Prompting for DeepSeek R1”.

Deployment using Inferentia 2

A great example on how to achieve this is now available on GitHub aws-samples/sagemaker-genai-hosting-examples.

In this example you will deploy your model using SageMaker’s Large Model Inference (LMI) Containers.

LMI containers are a set of high-performance Docker Containers purpose built for large language model (LLM) inference. With these containers, you can leverage high performance open-source inference libraries like vLLM, TensorRT-LLM, Transformers NeuronX to deploy LLMs on AWS SageMaker Endpoints. These containers bundle together a model server with open-source inference libraries to deliver an all-in-one LLM serving solution.

The model for this example can be deployed using the vLLM backend, which corresponds to the djl-neuronx container image.

from sagemaker.image_uris import retrieve as sagemaker_retrieve_image_uri

inference_image_uri = sagemaker_retrieve_image_uri(framework="djl-neuronx", version="latest", region=session.boto_session.region_name), # DJL LMI - Version 0.31.0 not supported by the API as of Jan 29
# inference_image_uri = "763104351884.dkr.ecr.us-east-1.amazonaws.com/djl-inference:0.31.0-lmi13.0.0-cu124" #DJL LMI
print(f"using image to host: {inference_image_uri}")

There are 2 methods to supply configuration to the LMI container:

  1. Create a serving.properties file and include it inside the compressed model artifact. This has the benefit of ensuring that no configuration information needs to be shared, as long as you have the model artifact. However it creates rigidity as it is tightly coupled and creates complexity when deploying on different instance types.
  2. Provide a set of Environment Variables to the SageMaker Model object. This provides flexibility by storing the LMI configuration information inside the SageMaker Model configuration step ← I will use this approach
hf_model_id = "deepseek-ai/DeepSeek-R1-Distill-Llama-8B"
##Add your Huggingface model token below
vllm_config = {
"HF_MODEL_ID": hf_model_id,
"OPTION_TENSOR_PARALLEL_DEGREE": "max",
"HF_TOKEN": "",
"OPTION_ROLLING_BATCH": "vllm",
"OPTION_OUTPUT_FORMATTER": "json",
"OPTION_MAX_ROLLING_BATCH_SIZE": "16",
"OPTION_MODEL_LOADING_TIMEOUT": "1600",
}

All I need to do now is create the sagemaker.Model object, and deploy it:

from sagemaker.model import Model

# Create the model
model_name = hf_model_id.split("/")[-1].lower()
lmi_model = Model(
image_uri = inference_image_uri,
env = vllm_config,
role = sagemaker.get_execution_role(),
name = model_name
)

# Deploy it
instance_type = "ml.inf2.24xlarge"
endpoint_name = f"{model_name}-ep"
lmi_model.deploy(
initial_instance_count = 1,
instance_type = instance_type,
container_startup_health_check_timeout = 1600,
endpoint_name = endpoint_name,
)

Once deployed (takes ~7 minutes), query as usual:

from sagemaker.predictor import Predictor

llm = Predictor(
endpoint_name = endpoint_name,
sagemaker_session = sess,
serializer = sagemaker.serializers.JSONSerializer(),
deserializer = sagemaker.deserializers.JSONDeserializer(),
)
response = llm.predict(
{
"inputs": prompt_template,
"parameters": {
"do_sample":True,
"max_new_tokens":1024,
"top_p":0.9,
"temperature":0.6,
}
}
)['generated_text']
print(response)

Fine-Tuning on Amazon SageMaker

Bruno’s LinkedIn post to announce the fine-tuning examples — source

My colleague Bruno Pistone has written an amazing example for this. It is available on GitHub under aws-samples/amazon-sagemaker-llm-fine-tuning-remote-decorator . Check it out!

Serverless Inference with Amazon Bedrock Custom Model Import

Thanks to Raghvender Arni for the inspiration! 😄️
(Jan 28, 2025) Note: Currently works for distilled Llama versions only 🦙️. Gwen and full R1 not supported yet. The only regions that support Amazon Bedrock Custom Model Import are us-east-1 and us-west-2 — ref: documentation.

Looking for a complete notebook? Refer to aws-samples/amazon-bedrock-samples/custom-models/import_models/llama-3/DeepSeek-R1-Distill-Llama-Noteb.ipynb

If you’re using the Llama distilled version, you can use Amazon Bedrock Custom Model Import to use DeepSeek R1 in a serverless fashion! 🚀

Make sure you have installed git lfs before proceeding. On Amazon SageMaker notebooks, you can run sudo apt-get install git-lfs . If you don’t install this library, you won’t be able to pull large files from HuggingFace Hub. Make sure you have enough space on disk to clone and package the files. Start by cloning locally the model you want to import to Bedrock:

hf_model_id = "deepseek-ai/DeepSeek-R1-Distill-Llama-8B"
model_name = hf_model_id.split("/")[-1]
!git clone https://huggingface.co/$hf_model_id

Upload it to Amazon S3 — here I’m using CLI but you can use Python if you prefer:

# Replace the below with custom value if you're not using Amazon SageMaker
session = sagemaker.Session()
default_bucket = session.default_bucket()
default_bucket_prefix = session.default_bucket_prefix
s3_model_uri = f"s3://{default_bucket}/{default_bucket_prefix}/{model_name}/}"
!aws s3 sync $model_name/ s3://$default_bucket/$default_bucket_prefix/$model_name/

Launch the CreateModelImportJob API:

# Create Bedrock Model Import job
import boto3
import json

bedrock = boto3.client(service_name='bedrock')

JOB_NAME = f"{model_name}-import-job"
IMPORTED_MODEL_NAME = f"{model_name}-bedrock"
ROLE = sagemaker.get_execution_role() # Replace with custom IAM role if not using Amazon SageMaker for development

# createModelImportJob API
create_job_response = bedrock.create_model_import_job(
jobName=JOB_NAME,
importedModelName=IMPORTED_MODEL_NAME,
roleArn=ROLE,
modelDataSource={
"s3DataSource": {
"s3Uri": s3_model_uri
}
},
)
job_arn = create_job_response.get("jobArn")

Once the job is complete, you can test it with API or in the Console:

Image courtesy of Sam Palani

Compiling for Amazon EC2 Trainium 2

DeepSeek-R1-Distill-Llama models are compatible with out Inferentia 1 and 2 chips! This makes running them for inference even cheaper — up to 70%. Pinak Panigrahi wrote an amazing guidance on how to compile these models to run on Inferentia 2.

Prompting for DeepSeek R1

The following tests have been performed with the Distill-Llama model hosted by Amazon SageMaker. Thanks Massimiliano Angelino for the collaboration on this topic!

In order to get a proper answer from DeepSeek R1, you need to properly format the prompt. The Llama variant uses the following prompt template:

system_prompt = "You are a helpful AI Assistant."
query = "What is Machine Learning? What is DeepSeek R1?"

prompt_template = "<|begin▁of▁sentence|><|User|>{system_prompt} {query}<|Assistant|>"
prompt = prompt_template.format(system_prompt=system_prompt, query=query)

You can then send the query like so:

response = predictor.predict(
{
"inputs": prompt,
"parameters": {
"top_p":0.9,
"temperature":0.1,
"max_new_tokens":1024,
"do_sample":True,
# "return_full_text":True,
}
}
)
print(response["generated_text"])

The answer generated by DeepSeek R1 will have this format:

YOUR-INPUT-PROMPT # here only if "return_full_text":True
<think>
MODEL-THINKING-PROCESS
</think>
THE-ACTUAL-ANSWER

Parse the response accordingly to the information you want to retrieve.

Alternatively, it is also possible to use the Tokenizer.apply_chat_template() method to create the prompt to use as an input of the .predict() function:

from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained("deepseek-ai/DeepSeek-R1-Distill-Llama-8B")
messages = [
# {"role":"system", "content": system_prompt}, #DeepSeek HF page suggests not to use system prompts
{"role":"user", "content": system_prompt+" "+query}
]
prompt = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
print(f"Prompt: {prompt}")

Happy coding! 🚀 If this content has been useful, please leave a clap 👏 or a comment 🗯. This will let us know that our work has been appreciated! 😄

--

--

Davide Gallitelli
Davide Gallitelli

Written by Davide Gallitelli

A young Data Engineer, tech passionate and proud geek.

Responses (5)