Access ALL of your data with SageMaker Unified Studio using AWS SDK for pandas (awswrangler) or PySpark

Davide Gallitelli
5 min readJan 15, 2025

--

Authors: Davide Gallitelli, Bruno Pistone
TL;DR: Learn how to efficiently access and analyze datasets in SageMaker Unified Studio using three powerful methods: direct SQL querying, AWS SDK for pandas (awswrangler) and PySpark with Glue Interactive sessions. Using awswrangler or PySpark overcome the 10,000-row SQL query limitation in JupyterLab notebooks and enable seamless data exploration and ML model training workflows at scale.

Figure 0 — Meet the next generation of SageMaker — Source

SageMaker Unified Studio represents a significant advancement in how data scientists and analysts work with their data and AI workflows. It introduces a connected workspace where teams can easily browse and access data assets, perform exploratory analysis, and seamlessly transition into training ML/GenAI models.

In today’s data-driven organizations, a common scenario unfolds like this: Data Engineering teams meticulously prepare and make data assets available in the SageMaker Data Catalog. As Data Scientists, we subscribe to these assets, and thanks to SageMaker Data Governance’s fine-grained access control, data access is securely managed through the SageMaker Lakehouse. This ensures data security and compliance while creating an interesting technical consideration: how do we efficiently access and analyze these datasets in our notebooks?

While the platform offers various ways to interact with data through JupyterLab notebooks, including SQL queries across different compute engines (Athena, Redshift, SageMaker Lakehouse) there are opportunities to enhance this experience even further using the AWS SDK for pandas (awswrangler) library, or by adopting the native serverless PySpark integration of SageMaker Unified Studio with Glue Interactive sessions. These powerful tools can help maximize the potential of your integrated analytics and AI environment.

In this blog, we’ll explore two straightforward methods that enable Data Scientists to:

✅ Access and query subscribed data assets effectively
✅ Work with complete datasets without size limitations
✅ Utilize both Python pandas and SQL approaches
✅ Leverage the full power of AWS SDK for pandas

Whether you’re performing exploratory data analysis or preparing data for ML model training, these approaches will help you establish an efficient workflow within your SageMaker Unified Studio environment.

Direct SQL Querying in the Notebook

SageMaker Unified Studio has introduced a connected workspace where it’s possible to easily browse and access data assets, perform exploratory analysis and data preparation on it, then use it for training ML/GenAI models.

Figure 1 — Some of the data sources available in your SageMaker Unified Studio project

One of the ways this can be achieved is directly from the JupyterLab notebooks available in SageMaker Unified Studio. JupyterLab notebooks support changing the connection type and the compute type at the cell level, and with this we can run SQL queries directly in the notebook on our preferred compute engine (Athena, Redshift, Lakehouse). You can retrieve the names for the database and the table from the Data tab in the JupyterLab sidebar, as shown in Figure 1.

Figure 2 — Native SQL Querying in JupyterLab notebooks, thanks to SageMaker Unified Studio

Make sure to select the right configuration (SQL and project.athena or project.lakehouse) before running your SQL code, like in Figure 2. Once the query is complete, it is as simple as loading data into a Pandas DataFrame, thanks to these lines of code:

import pandas as pd
df = _.to_pandas()
df.head()

However, SQL access to data from Amazon Athena or Redshift directly from notebooks is currently (as of: Jan 14, 2025) limited to 10,000-rows when querying from JupyterLab in SageMaker Unified Studio. This limitation can force us to write multiple queries and combine results, resulting in additional costs and complexity, or sample our data, potentially missing important patterns and signals. None of these workarounds are ideal when we need to work with complete datasets for analysis or model training. The following examples will overcome this issue.

Using Python and AWS SDK for pandas (awswrangler)

Figure 3 — AWS SDK for pandas (awswrangler) logo — source

AWS SDK for pandas (awswrangler) is an open-source Python library that extends the power of Pandas to AWS services. It provides a more efficient way to handle data between Pandas DataFrames and AWS data services like Amazon Athena, Amazon Redshift, AWS Glue, Amazon S3, and more. Let's walk through how to set up and use AWS SDK for pandas in SageMaker Unified Studio environment, to read data in the Data Catalog using Amazon Athena.

First, install the AWS SDK for pandas library:

%pip install awswrangler

To access the SageMaker Unified Studio Project’s Athena workgroup, we’ll need to get the Environment ID. Here’s how:

import boto3
from sagemaker_studio import Project

session = boto3.Session()
project = Project()
environment_id = project.connection("project.athena").environment_id

Specify database and table names:

db_name = "your_database_name" # e.g. glue_db_ABCDEFGHIJKLMN
table = "your_table_name" # e.g. my-sales-record

Now comes the magic part. Using AWS SDK for pandas, we can read the entire table without row limits:

import awswrangler as wr

df = wr.athena.read_sql_table(
table=table,
database=db_name,
workgroup=f"workgroup-{environment_id}",
ctas_approach=False, # required as glue:DeleteTable is not allowed
)

Using PySpark and Glue Interactive session

SageMaker Unified Studio enhances JupyterLab notebooks by enabling direct execution of PySpark code through serverless Glue Interactive sessions. This integration provides a powerful way to process large-scale data transformations without managing infrastructure.

Figure 4 — Configuring PySpark backend to the notebook cell

To maintain secure data access aligned with our governance model, we first need to enable fine-grained access control through Lake Formation using this magic command:

%%configure —name project.spark
{
"enable-lakeformation-fine-grained-access": "true"
}

Once configured, we can seamlessly access data through SageMaker Lakehouse and leverage PySpark’s powerful data transformation capabilities:

from pyspark.sql import SparkSession

spark = SparkSession.builder \
.appName("connection") \
.getOrCreate()
df = spark.table("your-database-2.your-table-name-2")
df.show()
Figure 5 — Result of the execution of the PySpark cell

Conclusions

SageMaker Unified Studio’s vision of bringing together analytics and AI capabilities in a single environment marks a significant step forward in streamlining data science workflows. By incorporating AWS SDK for pandas into your toolkit, you can further enhance this integrated experience, making data processing more efficient and scalable. This approach not only simplifies your data access patterns but also provides a robust foundation for seamless transitions between data exploration, analysis, and model training phases. The combination of SageMaker Unified Studio's connected workspace with powerful tools like AWS SDK for pandas creates an optimal environment for data scientists and analysts to collaborate and innovate effectively.

Happy coding! 🚀 If this content has been useful, please leave a clap 👏 or a comment 🗯. This will let us know that our work has been appreciated! 😄

--

--

Davide Gallitelli
Davide Gallitelli

Written by Davide Gallitelli

A young Data Engineer, tech passionate and proud geek.

No responses yet