Using Aya-101 in Amazon SageMaker

3 min readFeb 15, 2024

Aya-101 is now available on HuggingFace. This means we can run it in Amazon SageMaker!

What is Aya-101?

Quoting Cohere for AI in their Model Card page:

The Aya model is a massively multilingual generative language model that follows instructions in 101 languages. Aya outperforms mT0 and BLOOMZ a wide variety of automatic and human evaluations despite covering double the number of languages. The Aya model is trained using xP3x, Aya Dataset, Aya Collection, a subset of DataProvenance collection and ShareGPT-Command. We release the checkpoints under an Apache-2.0 license to further our mission of multilingual technologies, empowering a multilingual world.

✅ 13B parameters model — relatively small model
✅ 101 languages supported — complete list here
✅ Apache 2.0 license — can be used for commercial purposes
✅ Paper available — Aya Model: An Instruction Finetuned Open-Access Multilingual Language Model
✅ Available on the HuggingFace Hub and works with the HuggingFace LLM Inference Container
❌ Kind of short answers
❌ No public embedding-specific model
⚠️️ No quantized version yet — requires > 24GB VRAM GPU
⚠️️ Based on the mT5 architecture (multilingual T5), same as FLAN-T5. SOTA for multilingual tasks — translation etc. — performances might be lacking on other tasks

Deploying Aya-101 to Amazon SageMaker

Thanks to the HuggingFace LLM Inference Container, it is very easy to deploy the Aya-101 model to an Amazon SageMaker Endpoint:

Get your SageMaker Execution Role:

try:
 role = sagemaker.get_execution_role()
except ValueError:
 iam = boto3.client('iam')
 role = iam.get_role(RoleName='AmazonSageMaker-ExecutionRole-20210217T115101')['Role']['Arn']

2. Load the model from the HF Hub:

# Hub Model configuration from https://huggingface.co/CohereForAI/aya-101
hub = {
 'HF_MODEL_ID':'CohereForAI/aya-101',
 'SM_NUM_GPUS': json.dumps(4) # this depends on the instance to be used
}

# create Hugging Face Model Class
huggingface_model = HuggingFaceModel(
 image_uri=get_huggingface_llm_image_uri("huggingface",version="1.1.0"),
 env=hub,
 role=role, 
)

3. Deploy the model (takes 5~7 minutes)

Note that you need to use a GPU that has more than 24 GB VRAM to load the model in memory, at least until there is a quantized version available. For that, we will be using a ml.g5.12xlarge, which costs $7.09$/hour, with a cheaper option being a ml.g4dn.12xlarge, which costs $5.44/hour but with lower performances (untested).

# deploy model to SageMaker Inference
predictor = huggingface_model.deploy(
    endpoint_name=sagemaker.utils.name_from_base("aya-101"),
    initial_instance_count=1,
    instance_type="ml.g5.12xlarge",
    container_startup_health_check_timeout=15*60,
)

4. Use the model to generate predictions:

# send request
from textwrap import dedent
predictor.predict({
  "inputs": dedent("""
      Translate the following sentence to Thai: 
      "Thanks to Cohere Aya-101, now I can make it look like I speak Thai!".
  """),
  "parameters": {
      "do_sample": True,
      "top_p": 0.7,
      "temperature": 0.2,
      "top_k": 50,
      "max_new_tokens": 512,
      "repetition_penalty": 1.03
  }
})

Result of Inference — someone from Thailand, please let me know if this is correct 🇹🇭️

Using Aya-101 with Llama-Index for RAG use cases

If Pinocchio is making up stuff with the help of AI, is Pinocchio lying or is the AI hallucinating?

Once the model is up and running in Amazon SageMaker, it can be used with orchestrators and for LLM applications, for example Retrieval Augmented Generation (RAG) use cases. In this case, I will use the text from the book “Le avventure di Pinocchio: Storia di un burattino”, by C. Collodi, which is openly accessible online as part of the Gutenberg Project.

# Setting up Bedrock
from llama_index.llms.sagemaker_endpoint import SageMakerLLM
from llama_index.embeddings.bedrock import BedrockEmbedding

# Model that will be used to generate the embeddings 
# I'm using Cohere in Bedrock but could be using any other embedding model
Settings.embed_model = BedrockEmbedding(model="cohere.embed-multilingual-v3", region_name="us-west-2")
# Model that will be used to generate the answer given the most relevant chunks
Settings.llm = SageMakerLLM(endpoint_name=predictor.endpoint_name)


# Llama Index basic RAG example
from llama_index.core import VectorStoreIndex, SimpleDirectoryReader, Settings
from llama_index.core.node_parser import SentenceSplitter
# Step 1 - Load Data
documents = SimpleDirectoryReader("data").load_data()
# Step 2 - Index Data
index = VectorStoreIndex.from_documents(documents, transformations=[SentenceSplitter(chunk_size=300)])
# Step 4 - Query
query_engine = index.as_query_engine()
response = query_engine.query("Come si chiama il padre di Pinocchio?")
print(response)

Using Aya-101 in Amazon SageMaker

What is Aya-101?

Deploying Aya-101 to Amazon SageMaker

Using Aya-101 with Llama-Index for RAG use cases

Written by Davide Gallitelli

Responses (1)