Sitemap

Tool Calling with Amazon SageMaker AI and DJL Serving Inference

Blog post inspired by the Tool Calling Support in LMI doc.

5 min readMay 15, 2025
Photo by Hannah Gibbs on Unsplash

Agent-based workflows are emerging as one of the most powerful design patterns for enterprise AI. These workflows rely on the ability of Large Language Models (LLMs) to reason, make decisions, and call tools (functions) dynamically to retrieve external knowledge or perform operations. In this blog, we demonstrate how you can implement this pattern using Amazon SageMaker AI, DJL Serving v0.33, and the Mistral-Small-24B-Instruct-2501 model available in SageMaker JumpStart.

We will follow the official DIY Agents with SageMaker and Bedrock — Tool Calling Notebook, which provides a practical example of enabling tool calling using DJL Serving with the vLLM backend.

The Architecture

Function calling loop for autonomous AI agents — picture from the author
  1. Model Deployment: You deploy a model using Amazon SageMaker LMI container powered by DJL Serving v0.33 (e.g: djl-inference:0.33.0-lmi15.0.0-cu128) — complete list of available images here
  2. Prompt with Function Schema: You send a prompt along with a schema that describes your tool (function) using the OpenAI-style JSON format supported by DJL Serving.
  3. Model Suggests a Tool Call: The model responds by asking to call a function with arguments.
  4. Function Execution and Response Injection: Your application executes the requested function, injects the result back into the conversation, and continues the interaction.

Step-by-Step Example

1. Deploy the endpoint

In our example, we will use mistral-small-24b-instruct-2501 , which is available on Amazon SageMaker JumpStart. You can do this via the UI, or with the API:

from sagemaker.jumpstart.model import JumpStartModel
from sagemaker.serializers import JSONSerializer
from sagemaker.deserializers import JSONDeserializer
from sagemaker.enums import EndpointType
from sagemaker.compute_resource_requirements.resource_requirements import ResourceRequirements


resources = ResourceRequirements(
requests = {
"num_accelerators": 4, # Number of accelerators required
"memory": 72*1024, # Minimum memory required in Mb (required)
"copies": 1,
}
)

model = JumpStartModel(
model_id="huggingface-llm-mistral-small-24B-Instruct-2501", model_version="2.0.1",
instance_type="ml.g6.12xlarge"
)
predictor = model.deploy(
accept_eula=True,
initial_instance_count=1,
instance_type="ml.g6.12xlarge",
serializer=JSONSerializer(), deserializer=JSONDeserializer(),
endpoint_type=EndpointType.INFERENCE_COMPONENT_BASED,
resources=resources,
)

You can also use an arbitrary model from the HuggingFace Hub. Here is the code to deploy it. I strongly recommend checking out the in-comments links to learn more about the DJL LMI before proceeding with the deployment.

import sagemaker
from sagemaker.enums import EndpointType
from sagemaker.compute_resource_requirements.resource_requirements import ResourceRequirements

model_id = "watt-ai/watt-tool-8B"
model_name = sagemaker.utils.name_from_base(model_id.split("/")[-1])
endpoint_name = sagemaker.utils.name_from_base("tool-calling-endpoint")

model = sagemaker.model.Model(
image_uri="763104351884.dkr.ecr.us-east-1.amazonaws.com/djl-inference:0.33.0-lmi15.0.0-cu128",
name=model_name,
env={
"HF_MODEL_ID": model_id,
"SM_NUM_GPUS": "4",
# "OPTION_QUANTIZE": "fp8", # Does not work with watt-tool-8B, but can be used with other models
"OPTION_MAX_MODEL_LEN": f"{1024*5}",
"OPTION_GPU_MEMORY_UTILIZATION": "0.95",
"OPTION_ROLLING_BATCH": "vllm", # Mandatory for Tool Calling with DJL LMI
# Learn more about Tool Calling with DJL LMI using vLLM here: https://bit.ly/djl-lmi-tools
"OPTION_ENABLE_AUTO_TOOL_CHOICE": "true",
"OPTION_TOOL_CALL_PARSER": "pythonic", # List of tool call parsers: https://bit.ly/vllm-tool-parsers
},
sagemaker_session=sagemaker.Session(),
role=sagemaker.get_execution_role(), # Provide your own IAM ARN if running outside of SageMaker Studio
)

resources = ResourceRequirements(
requests = {
"num_accelerators": 4, # Number of accelerators required
"memory": 70*1024, # Minimum memory required in Mb (required)
"copies": 1,
}
)
predictor = model.deploy(
endpoint_name=endpoint_name,
instance_type="ml.g6.12xlarge",
initial_instance_count=1,
endpoint_type = EndpointType.INFERENCE_COMPONENT_BASED,
inference_component_name=sagemaker.utils.name_from_base(model_id.split("/")[-1]+"-comp"),
resources = resources,
)

In this example, I use the SageMaker inference components both for SageMaker JumpStart and HuggingFace Hub model. This is the suggested approach as it makes your endpoints future-proof and compatible with scale down to zero (ref: this blog) and allows you to delete one component without having to tear down the endpoint itself (faster deployment of new models).

2. Define the Tool (Function) to be used and its Schema

First, define the function in Python. Then, describe your tool in a JSON format.

# Function definition
def get_top_song(sign):
"""Returns the most popular song for the requested station.
Args:
call_sign (str): The call sign for the station for which you want
the most popular song.

Returns:
response (json): The most popular song and artist.
"""

song = ""
artist = ""
if sign == 'WZPZ':
song = "Elemental Hotel"
artist = "8 Storey Hike"

else:
raise Exception(f"Station {sign} not found.")

return {
"song": song,
"artist": artist
}

# Schema definition
tools = [
{
"type": "function",
"function": {
"name": "get_top_song",
"description": "Get the most popular song played on a radio station.",
"parameters": {
"type": "object",
"properties": {
"sign": {
"type": "string",
"description": "The call sign for the radio station for which you want the most popular song. Example calls signs are WZPZ and WKRP."
}
},
"required": ["sign"],
},
},
}
]

Query the endpoint

You send the request to the deployed endpoint, using the input schema defined by DJL:

from sagemaker import Predictor
import json

predictor = Predictor("your-endpoint-name")

payload = {
"messages": [
{"role": "user", "content": "What is the most played song on WZPZ?"},
],
"tools": tools,
"tool_choice": "auto", # Requires: OPTION_TOOL_CALL_PARSER, OPTION_ENABLE_AUTO_TOOL_CHOICE
"max_tokens": 1024,
"temperature": 0.1,
"top_p": 0.9,
}

response = predictor.predict(payload)
print(response)

This is the expected output:

{'id': 'chatcmpl-6fb61aeab4294502bb03dcb04b9e3c04',
'object': 'chat.completion',
'created': 1747317996,
'model': 'lmi',
'choices': [{'index': 0,
'message': {'role': 'assistant',
'reasoning_content': None,
'content': None,
'tool_calls': [{'id': 'SsUEOjFNU',
'type': 'function',
'function': {'name': 'get_top_song', 'arguments': '{"sign": "WZPZ"}'}}]},
'logprobs': None,
'finish_reason': 'tool_calls',
'stop_reason': None}],
'usage': {'prompt_tokens': 224,
'total_tokens': 249,
'completion_tokens': 25,
'prompt_tokens_details': None},
'prompt_logprobs': None}

Note two things:

  • the tool_calls object in the assistant answer, with a tool call ID, the tool name and the arguments required
  • the finish_reason = 'tool_calls' which indicates that the request is not complete but need functions to be executed

You can now parse this information and execution the function as required, then append an additional message of type tool after tool execution is complete:

if stop_reason == "tool_calls":
tool_calls = output['choices'][0]['message']['tool_calls']
for tool_call in tool_calls:
if tool_call['type'] == 'function':
name = tool_call['function']['name']
args = json.loads(tool_call['function']['arguments'])
tool_use_id = tool_call['id']
# Execute the function with name from tool_call['function']['name']
tool_foo = getattr(sys.modules[__name__], name)
output = tool_foo(**args)
print(output)

tool_result_message = {
"role": "tool",
"tool_call_id": tool_use_id,
"name": name,
"content": json.dumps(output)
}
messages.append(tool_result_message)

Finally, invoke again the endpoint with the updated list of messages to complete the request:

payload["messages"] = messages
response = predictor.predict(payload)
print(response)

Note that in the final output, the finish_reason is now stop, meaning that no more calls are required and final answer has been reached.

## Output ##
{'id': 'chatcmpl-be1f5cb4e016437d8b91b2ffb8eca279',
'object': 'chat.completion',
'created': 1747318020,
'model': 'lmi',
'choices': [{'index': 0,
'message': {'role': 'assistant',
'reasoning_content': None,
'content': 'The most popular song on WZPZ right now is "Elemental Hotel" by 8 Storey Hike.',
'tool_calls': []},
'logprobs': None,
'finish_reason': 'stop',
'stop_reason': None}],
'usage': {'prompt_tokens': 150,
'total_tokens': 176,
'completion_tokens': 26,
'prompt_tokens_details': None},
'prompt_logprobs': None}

Key Learnings

  • Model Awareness: DJL Serving v0.33 allows models to reason about what external tools are available by embedding the tool schema into the prompt.
  • Standardized Communication: By following the chat schema defined by DJL, your application can seamlessly manage multi-turn tool calls.
  • Agentic AI Workflows: This pattern enables your model to move beyond static Q&A, supporting dynamic workflows where the model reasons, calls external services, and incorporates results into the conversation.

Conclusion

This blog demonstrated how to implement Tool Calling on Amazon SageMaker AI using DJL Serving v0.33, following the official DIY Agents with SageMaker and Bedrock Workshop.

By leveraging this approach, you can unlock powerful, production-ready agentic AI patterns directly on AWS, integrating SageMaker-hosted LLMs with external tools and APIs.

--

--

Davide Gallitelli
Davide Gallitelli

Written by Davide Gallitelli

A young Data Engineer, tech passionate and proud geek.

No responses yet