Using Aya-101 in Amazon SageMaker
What is Aya-101?
Quoting Cohere for AI in their Model Card page:
The Aya model is a massively multilingual generative language model that follows instructions in 101 languages. Aya outperforms mT0 and BLOOMZ a wide variety of automatic and human evaluations despite covering double the number of languages. The Aya model is trained using xP3x, Aya Dataset, Aya Collection, a subset of DataProvenance collection and ShareGPT-Command. We release the checkpoints under an Apache-2.0 license to further our mission of multilingual technologies, empowering a multilingual world.
- ✅ 13B parameters model — relatively small model
- ✅ 101 languages supported — complete list here
- ✅ Apache 2.0 license — can be used for commercial purposes
- ✅ Paper available — Aya Model: An Instruction Finetuned Open-Access Multilingual Language Model
- ✅ Available on the HuggingFace Hub and works with the HuggingFace LLM Inference Container
- ❌ Kind of short answers
- ❌ No public embedding-specific model
- ⚠️️ No quantized version yet — requires > 24GB VRAM GPU
- ⚠️️ Based on the mT5 architecture (multilingual T5), same as FLAN-T5. SOTA for multilingual tasks — translation etc. — performances might be lacking on other tasks
Deploying Aya-101 to Amazon SageMaker
Thanks to the HuggingFace LLM Inference Container, it is very easy to deploy the Aya-101 model to an Amazon SageMaker Endpoint:
- Get your SageMaker Execution Role:
try:
role = sagemaker.get_execution_role()
except ValueError:
iam = boto3.client('iam')
role = iam.get_role(RoleName='AmazonSageMaker-ExecutionRole-20210217T115101')['Role']['Arn']
2. Load the model from the HF Hub:
# Hub Model configuration from https://huggingface.co/CohereForAI/aya-101
hub = {
'HF_MODEL_ID':'CohereForAI/aya-101',
'SM_NUM_GPUS': json.dumps(4) # this depends on the instance to be used
}
# create Hugging Face Model Class
huggingface_model = HuggingFaceModel(
image_uri=get_huggingface_llm_image_uri("huggingface",version="1.1.0"),
env=hub,
role=role,
)
3. Deploy the model (takes 5~7 minutes)
Note that you need to use a GPU that has more than 24 GB VRAM to load the model in memory, at least until there is a quantized version available. For that, we will be using a ml.g5.12xlarge
, which costs $7.09$/hour, with a cheaper option being a ml.g4dn.12xlarge
, which costs $5.44/hour but with lower performances (untested).
# deploy model to SageMaker Inference
predictor = huggingface_model.deploy(
endpoint_name=sagemaker.utils.name_from_base("aya-101"),
initial_instance_count=1,
instance_type="ml.g5.12xlarge",
container_startup_health_check_timeout=15*60,
)
4. Use the model to generate predictions:
# send request
from textwrap import dedent
predictor.predict({
"inputs": dedent("""
Translate the following sentence to Thai:
"Thanks to Cohere Aya-101, now I can make it look like I speak Thai!".
"""),
"parameters": {
"do_sample": True,
"top_p": 0.7,
"temperature": 0.2,
"top_k": 50,
"max_new_tokens": 512,
"repetition_penalty": 1.03
}
})
Using Aya-101 with Llama-Index for RAG use cases
Once the model is up and running in Amazon SageMaker, it can be used with orchestrators and for LLM applications, for example Retrieval Augmented Generation (RAG) use cases. In this case, I will use the text from the book “Le avventure di Pinocchio: Storia di un burattino”, by C. Collodi, which is openly accessible online as part of the Gutenberg Project.
# Setting up Bedrock
from llama_index.llms.sagemaker_endpoint import SageMakerLLM
from llama_index.embeddings.bedrock import BedrockEmbedding
# Model that will be used to generate the embeddings
# I'm using Cohere in Bedrock but could be using any other embedding model
Settings.embed_model = BedrockEmbedding(model="cohere.embed-multilingual-v3", region_name="us-west-2")
# Model that will be used to generate the answer given the most relevant chunks
Settings.llm = SageMakerLLM(endpoint_name=predictor.endpoint_name)
# Llama Index basic RAG example
from llama_index.core import VectorStoreIndex, SimpleDirectoryReader, Settings
from llama_index.core.node_parser import SentenceSplitter
# Step 1 - Load Data
documents = SimpleDirectoryReader("data").load_data()
# Step 2 - Index Data
index = VectorStoreIndex.from_documents(documents, transformations=[SentenceSplitter(chunk_size=300)])
# Step 4 - Query
query_engine = index.as_query_engine()
response = query_engine.query("Come si chiama il padre di Pinocchio?")
print(response)