Deploying an HuggingFace embedding model to Amazon SageMaker and consuming it with Llama-Index

9 min readFeb 16, 2024

TL;DR: In this blog post, you will learn how to choose an embedding model from HuggingFace and use it with Amazon SageMaker to generate embeddings. I will show you how to use SageMaker real-time endpoints, SageMaker Batch Transform, and how SageMaker integrates with Llama-Index.

Generating embeddings with Amazon SageMaker

What is an Embedding Model?

Embedding models are a type of machine learning model that converts input data, such as text or images, into numerical representations called embeddings. These embeddings capture the semantic meaning or context of the input data in a continuous vector space. By encoding data into embeddings, complex relationships and similarities between different data points can be captured and used for various tasks like similarity search, recommendation systems, and natural language processing. Embeddings are key when implementing architecture like Retrieval Augmented Generation (RAG) in the world of Large Language Models (LLM).

In other words, as humans we know that the word “king” is semantically closer to the word “queen” than it is to the word “zebra”, just like we know that “zebra” is closer to “lion” than “queen”. Embeddings are a way to represent this “semantic relationship” via numbers, which computers can understand. Similarity can be computed via metrics like cosine similarity, Euclidean distance, Manhattan distance, Kullback-Leibler Divergence, etc.

Where to find embedding models?

One of the best resources to find embedding models is the Hugging Face Hub, in particular I like to choose from the Massive Text Embedding Benchmark (MTEB) Leaderboard.

MTEB Leaderboard - a Hugging Face Space by mteb

Discover amazing ML apps made by the community

huggingface.co

The MTEB Leaderboard contains a ranked list of model embeddings, ranked by overall performance across a series of benchmarks across multiple datasets and multiple tasks. Some important parameters to look at when choosing the embedding models are:

Size of the model — bigger models require more resources to be loaded into memory, as well as might require GPU in order to obtain predictions fast enough to be usable
Max tokens — AKA context window, how many tokens the model can accept as input; higher values are better, however this means bigger compute and memory requirements;
Embedding Dimension — AKA how many floating point numbers will be produced by the model, how long the embedding will be; higher is better for increased nuances of the language but introduces additional computational complexity;
Support for multiple languages (unless you only work with English documents)

How to deploy to Amazon SageMaker?

Even before deploying our model to Amazon SageMaker, we’ll need to consider the use case that we’re trying to solve, and what kind of inference we will require. Amazon SageMaker supports four deployment types:

real-time endpoint: an instance that runs 24/7 and is paid by the hour; ideal for production workloads that require real-time predictions
batch transform: an ephemeral task that spins up when the API is called, performs prediction on a dataset uploaded to Amazon S3, then stops; ideal when you have big datasets, for initial ingestion of vectors in a database, or when you are ok waiting for execution to be complete
asynchronous endpoint: a solution in between real-time endpoint and batch transform, being a real-time endpoint that scales down to zero when no requests are in queue and loads inputs from S3; large files to be transformed while still having the flexibility to quickly receive an output once the endpoint is up, as well as in general cost optimization, are great use cases for asynchronous endpoints
serverless endpoint: a true serverless implementation of a real-time endpoint, loading the model only when needed, generating inferences, then shutting down the endpoint; great for smaller models in production (low cold start) and cost optimization purposes, ideal during the development phase

In this blog, I will detail how to implement the first two approaches. Reach out if you’re interested in deploying to asynchronous endpoints, and I might write a new post on the topic ☺️️

Note that the below code snippets can be run anywhere — it doesn’t have to be in Amazon SageMaker Studio!

Step 1 — Configure Amazon SageMaker

#### SAGEMAKER SETUP ####
import sagemaker
import boto3

try:
 role = sagemaker.get_execution_role()
except ValueError:
 iam = boto3.client('iam')
 role = iam.get_role(RoleName='[REPLACE WITH YOUR EXECUTION ROLE NAME')['Role']['Arn']
 
session = sagemaker.Session()
default_bucket = session.default_bucket()

The first step is to set up the permissions for SageMaker. If you’re running this code in a SageMaker Studio Notebook, the get_execution_role() API will automatically load the SageMaker execution role assigned to your user. Otherwise, you will need to retrieve the role from AWS IAM. You can provide it manually, or retrieve it dynamically, like in this code snippet.

Step 2 — Load the model from the Hub

# Hub Model configuration. <https://huggingface.co/models>
hub = {
  'HF_MODEL_ID':'intfloat/multilingual-e5-large-instruct', # model_id from hf.co/models
  'HF_TASK':'feature-extraction', # NLP task you want to use for predictions
  # 'SM_NUM_GPUS': '1',
}

Here, we’ll configure everything that is needed by the model in the HF Hub to be loaded with SageMaker. In particular, you’ll need at least two parameters:

HF_MODEL_ID — fill this with the repository and model name from the HF Hub, in my case intfloat/multilingual-e5-large
HF_TASK — important parameter to tell the Hugging Face container that you’re trying to load an embedding model, and should therefore use this to generate embeddings, so let’s set this to feature-extraction

Optionally, configure the SM_NUM_GPUS, which tells the container how many GPUs are there in the instance, so that it can use them all. For this model, I don’t actually need GPUs, so I’m not setting any parameter — we’ll configure the instance in the next step. To figure out how many GPUs your instance has, head over to the SageMaker Pricing page and check out the instance details towards the middle of the page.

Machine Learning Service - Amazon SageMaker Pricing - AWS

Amazon SageMaker is free to try. With Amazon SageMaker, pay only for what you use for your machine learning. Learn more…

aws.amazon.com

Example: GPUs for G5 instances hosted by SageMaker

Step 3 — Configure the HF container

huggingface_model = HuggingFaceModel(
    env=hub, # configuration for loading model from Hub
    role=role, # iam role with permissions to create an Endpoint
    py_version='py310',
    transformers_version="4.37.0", # transformers version used
    pytorch_version="2.1.0", # pytorch version used
)

Until the Text-Embedding-Inference container from Hugging Face officially comes to Amazon SageMaker, we need to use the generic Hugging Face Transformers container to load our model. To do that, we will need to specify, aside from the previous model configuration, the Python version, the Transformers library version, and the PyTorch version.

Trick: there is no way you can learn by heart the combination of these three parameters. I always refer to the list of available images for Hugging Face Inference containers from the official AWS Deep Learning Containers repository.

Step 4 — Deploy

We’re now ready to deploy! Let’s start with real-time deployment:

embedding_predictor = huggingface_model.deploy(
    endpoint_name=sagemaker.utils.name_from_base("intfloat-multilingual-e5-large-instruct"),
    initial_instance_count=1,
    instance_type="ml.c5.2xlarge"
)

As you can see, for this relatively small but powerful model I chose a CPU instance, the ml.c5.2xlarge. This will be more than enough to load the model in memory, and significantly cheaper than a GPU instance, although a bit slower in generating inference:

embedding_predictor.predict({"inputs": [
        "In un giorno di pioggia, Andrea e Giuliano incontrano Licia per caso.", 
        "Poi Mirko, finita la pioggia, si incontra e si scontra con Licia."
    ]
})

Since the instance is not very powerful, this transformation can take up to 10 seconds to be generated, for two relatively short sentences! When you look at the operational metrics in the SageMaker endpoint page, you can see that the limiting factor here is memory:

Pretty high memory usage. We can scale up the instance or switch to one with more memory

The nice things about Amazon SageMaker and Hugging Face integration, is that you can very easily change instance, with very few code changes:

#### EMBEDDING MODEL DEPLOYMENT ####

# Hub Model configuration. <https://huggingface.co/models>
hub = {
  'HF_MODEL_ID':'intfloat/multilingual-e5-large-instruct', # model_id from hf.co/models
  'HF_TASK':'feature-extraction' # NLP task you want to use for predictions
  'SM_NUM_GPUS': '1',
}

# create Hugging Face Model Class
huggingface_model = HuggingFaceModel(
    env=hub, # configuration for loading model from Hub
    role=role, # iam role with permissions to create an Endpoint
    py_version='py310',
    transformers_version="4.37.0", # transformers version used
    pytorch_version="2.1.0", # pytorch version used
)

embedding_predictor = huggingface_model.deploy(
    endpoint_name=sagemaker.utils.name_from_base("intfloat-multilingual-e5-large-instruct"),
    initial_instance_count=1,
    instance_type="ml.g5.2xlarge"
)

You can now test and see that predictions are generated much faster 🚀️ For comparison, the predictions now take 1~2 seconds on GPU instead.

The same configuration can be used to use Batch Transform to generate predictions in batches. Batch Transform requires a file to be in Amazon S3, with a header “inputs”, and the lines to embed. For example, if you want to obtain the embedding of the song “Here comes the sun” by The Beatles, your JSONLines file should look like this:


{"id":1,"text_inputs":"Here comes the sun (Doo-d-doo-doo)"}
{"id":2,"text_inputs":"Here comes the sun"}
{"id":3,"text_inputs":"And I say, 'It's alright'"}

Starting from the same huggingface_model class from before, we can now define a transformer, which will allow us to execute a Batch Transform job:

transformer = huggingface_model.transformer(
    instance_count=1,
    instance_type="ml.m5.4xlarge",
    assemble_with="Line",
    max_payload=1
)

transformer.transform(
    data=s3_path, 
    split_type="Line", 
    content_type="application/jsonlines", 
    job_name=sagemaker.utils.name_from_base("intfloat-multilingual-e5-large-instruct")
)

This is going to take some time — a bit under 10 minutes. You will be able to check the outputs stored in Amazon S3 by copying the files in the output path:

!aws s3 sync $transformer.output_path ./embedding-outputs/
!ls -la ./embedding-outputs/

The output format is:

{"id":1, "embedding":[0.025507507845759392, 0.009654928930103779, -0.01139055471867323, ………]}
{"id":2, "embedding":[-0.018594933673739433, -0.011756304651498795, -0.006888044998049736,…..]}
[...]

Using embedding models in Amazon SageMaker with Llama-Index

Talking cricket? No, thanks, I want a talking llama!

Not every single time you will have a very nice text file split line by line. Often, you have just a big document that you would like to embed, so that you can then perform searches on it. By not doing a smart chunking/splitting of the text, we lose semantic relations between words and sentences inside the book.

Let’s solve this by using Llama-index. Llama-index has introduced support for embedding models deployed on SageMaker endpoints in January 2024.

First, install llama-index (make sure you’re running llama-index>=0.10.6) :

pip install llama-index

You can manually query the SageMaker endpoint from llama-index:

from llama_index.embeddings.sagemaker_endpoint import SageMakerEmbedding

embedding_model = SageMakerEmbedding(
  endpoint_name=embedding_predictor.endpoint_name
)

# Get one embedding
embedding = embedding_model.get_text_embedding(
    "An Amazon SageMaker endpoint is a fully managed resource that enables the deployment of machine learning models, specifically LLM (Large Language Models), for making predictions on new data."
)
print(embedding)

# Get embeddings in batch
embeddings = embedding_model.get_text_embedding_batch(
    [
        "An Amazon SageMaker endpoint is a fully managed resource that enables the deployment of machine learning models",
        "Sagemaker is integrated with llamaIndex",
    ]
)
print(embeddings)

Or you can set it so that it’s the default embedding model for your pipelines:

from llama_index.core import Settings

# Configure Llama-index so that the default embedding model is our endpoint
Settings.embed_model = embedding_model

Then you can use this embedding model in your RAG pipelines.

For this example, I will be using the book “Le avventure di Pinocchio: Storia di un burattino” by C. Collodi, available for free on Project Gutenberg.

from llama_index.core import VectorStoreIndex, SimpleDirectoryReader
from llama_index.core.node_parser import SentenceSplitter

# Step 1 - Load Data
documents = SimpleDirectoryReader("data").load_data()
# Step 2 - Index Data
index = VectorStoreIndex.from_documents(
  documents, 
  transformations=[SentenceSplitter(chunk_size=250, chunk_overlap=50)]) # Max tokens size is 512 for this model
)

Once the index has been built, you can now query it, and use your favorite LLM to generate an answer:

# I have previously deployed an LLM on an endpoint
# I will use SageMaker endpoint for this step as well!
from llama_index.llms.sagemaker_endpoint import SageMakerLLM
Settings.llm = SageMakerLLM(endpoint_name=llm_predictor.endpoint_name)

# Step 3- Query
query_engine = index.as_query_engine()
response = query_engine.query("Come si chiama il padre di Pinocchio?")
print(response)

Congratulations on making it this far! In this blog post, you’ve learned:

what are embeddings
how to deploy them on Amazon SageMaker
how to use SageMaker endpoints in Llama-index

Check out the other blog posts I’ve written here on Medium, you will definitely find something to your liking! See you on the next Medium post!

Deploying an HuggingFace embedding model to Amazon SageMaker and consuming it with Llama-Index

What is an Embedding Model?

Where to find embedding models?

MTEB Leaderboard - a Hugging Face Space by mteb

Discover amazing ML apps made by the community

How to deploy to Amazon SageMaker?

Step 1 — Configure Amazon SageMaker

Step 2 — Load the model from the Hub

Machine Learning Service - Amazon SageMaker Pricing - AWS

Amazon SageMaker is free to try. With Amazon SageMaker, pay only for what you use for your machine learning. Learn more…

Step 3 — Configure the HF container

Step 4 — Deploy

Using embedding models in Amazon SageMaker with Llama-Index

Sign up to discover human stories that deepen your understanding of the world.

Free

Membership

Written by Davide Gallitelli

Responses (1)

More from Davide Gallitelli

Amazon Bedrock explained with memes — Converse API and Tool Usage w/ Anthropic Claude 3

Let’s make sense of the new Amazon Bedrock Converse API and Function Calling with Anthropic Claude 3 with quality communication: memes.

Agentic Workflows on AWS with Amazon Bedrock, Claude 3 Haiku, and CrewAI

EDIT (Oct 2, 2024): updated to use CrewAI v0.65.2, removed LangChain dependency (unnecessary in latest CrewAI) EDIT (May 18, 2024): updated…

DeepSeek R1 on AWS

A running document to showcase how to deploy and fine-tune DeepSeek R1 with AWS services — Amazon SageMaker and Amazon Bedrock

Access ALL of your data with SageMaker Unified Studio with SQL queries, AWS SDK for pandas and…

Authors: Davide Gallitelli, Bruno Pistone

Recommended from Medium

How to Fine-Tune Embedding Models for RAG (Retrieval-Augmented Generation)?

A Step-by-Step Guide With Code

Goodbye RAG? Gemini 2.0 Flash Have Just Killed It!

Alright!!!

Comparing Cohere, Amazon Titan, and OpenAI Embedding Models: A Deep Dive

Text embedding models have become pivotal in the rapidly evolving AI landscape, driving advancements in Natural Language Processing (NLP)…

Why AWS Bedrock RAG’s Serverless Model Isn’t Truly Pay-Per-Us

This blog is just a rant, born out of frustration after discovering an unexpectedly high bill for AWS Bedrock Knowledge Base (KB) and…

This new IDE from Google is an absolute game changer

This new IDE from Google is seriously revolutionary.

Testing 18 RAG Techniques to Find the Best

crag, HyDE, fusion and more!