Avoiding Rookie Mistakes when using AWS Glue

Why AWS Glue?

AWS Glue, a managed extract, transform, and load (ETL) solution based on Apache Spark, has quickly gained popularity due to the wide breadth of functionality contained in the service. AWS Glue makes it easy to discover, prepare, and combine data for analytics, machine learning, and application development. 

AWS Glue contains features such as the AWS Glue Data Catalog that allows you to catalog data assets, making them available across all the AWS analytics services; the AWS Glue Crawler, which performs data discovery on data sources; and AWS Glue jobs that execute the ETL in your pipeline in either Scala or PySpark. Additionally, a large amount of focus has been dedicated to enabling users with various backgrounds to interact with AWS Glue via a GUI interface using services, such as AWS Glue DataBrew and AWS Glue Studio. The variety of features and functionality result in a drastic reduction in the time spent in development of an ETL solution: from months down to minutes! 

While working with AWS Glue can result in a greatly accelerated time frame for an operational ETL pipeline, there are some common mistakes that new AWS Glue users often make. Even those experienced in ETL, and even Spark, can make these rookie mistakes if they are unaware of certain caveats specific to the AWS Glue service.  This article will discuss rookie mistakes made by AWS Glue users in the areas of source data, AWS Glue crawlers, and ETL jobs.

Source Data   

Many of the issues encountered across a data pipeline could have been mitigated if extra attention had been given to the source data being ingested. Source data issues can manifest in many different forms and can be difficult to diagnose. Therefore, investing the time to review these common customer mistakes will save customers time and future headaches.  

Using Incorrect File Formats

File formats are one of the most critical aspects of your data pipeline. One of the primary reasons for this is due to a fundamental Spark concept wherein the input data set needs to be splittable or needs to be able to fit into the Spark executor memory. Splittable file types include formats like CSV, Parquet, and ORC, while non-splittable file types include XML and JSON. Please note that you should also consider the codec type for the input dataset as this will influence whether a file is splittable or not. Wherever possible, try to use columnar file formats like Parquet for maximum performance efficiency.

Not Verifying Quality of Data

Failing to verify the quality of data in your pipeline can lead to strange symptoms such as AWS Glue crawlers misinterpreting data, data located in incorrect columns after being processed by your ETL, and all sorts of unexpected behavior. One of the key points here is that the AWS Glue service will have issues if the file is not UTF8 encoded. The pro tip that can help to overcome this is to use a service like AWS  Glue DataBrew to explore the quality of the dataset you are working with. Additionally, you will also want to incorporate data cleansing and verification into your ETL processes to handle these “bad records”.  If you are having issues identifying where certain fields are located in the underlying data, Amazon Athena offers a pseudo column via “$path” that will include the source file alongside the fields. Once you can identify the file, further steps can be taken to cleanse and reconcile the data.  

Operating Over Small Files

During the creation of the dynamic frame or DataFrame, the Spark driver will create a list in memory of all the files to be included, and only when it has completed, will it create the object. There are a few pro tips that can be implemented to overcome this issue. You will want to take advantage of the “useS3ListImplementation” feature in AWS Glue which will allow lazily loading the S3 objects via a batching mechanism, resulting in the driver being less likely to run out of memory. Please note that for max efficiency, this feature should be enabled with job bookmarks per AWS best practice. Additionally, utilizing the grouping option in AWS Glue for reading files will coalesce multiple files into a group so that tasks are performed on the entire group instead of a single file. Less tasks to maintain means less memory pressure on the driver, increased performance and a more robust ETL.

AWS Glue Crawler

AWS Glue crawlers are used by many customers to offload the weight of determining and defining the schema of structured and semi-structured datasets. Optimally configuring the crawler plays a key role in the ETL pipeline because it is often run before or after ETL jobs, meaning the length of time to crawl a dataset impacts the overall time of the completion of the pipeline. Additionally, the metadata of the dataset determined by the crawler is used across most of the AWS analytics services, so it’s critical to ensure that the data catalog definition is correct.

Not Crawling Subset of Data 

When an AWS  Glue crawler is performing discovery over a dataset, it will crawl the datastore and keep track of what has already been crawled, what has changed, and will update the data catalog with its findings. If you have not enabled your crawler to perform incremental crawls via the option “Crawl new folders only”, your crawler will be performing data discovery on files that have already been crawled in past runs, resulting in very long running crawlers. Additionally, if possible, take advantage of using include and exclude patterns to further identify specific data that needs to be crawled.

One of the most helpful things you can do to reduce the overall time of your crawl if you are dealing with a very large data set, is to use several smaller crawlers to point at a subset of the data rather than having one large crawler pointing to the parent location. Lastly, you can take advantage of one of the more recently released features: data sampling for databases and “sample size” for S3 data stores. With this configuration, only a sample of data will be crawled, rather than all the files or records in a table.

Crawler Discovery not Optimally Configured

One of the most puzzling issues AWS Glue users come across is AWS Glue crawlers creating far more tables than were expected. A pro tip to mitigate this issue is to make sure the option “Create a single schema for each S3 path” is selected. A crawler will consider both data compatibility and schema similarity, which means if each S3 path meets those requirements, it will group the data into a single table. If the crawler is still creating more tables than expected, an investigation will need to be performed into the underlying data to determine why exactly the schema is being inferred as dissimilar.

AWS Glue Jobs

AWS Glue jobs are the heart of the AWS Glue service and are responsible for performing the execution of the ETL. Although the AWS Glue console and services such as AWS Glue Studio will generate scripts to perform the ETL, many customers opt to write their own Spark code and a lot can go wrong. I have highlighted a few of the most common rookie mistakes made on AWS Glue jobs, as well as some pro tips to overcome them. 

Not Using DynamicFrames Correctly

A DynamicFrame is a feature that is specific to the AWS Glue service and offers many optimizations over the traditional Spark DataFrame. This is because a DataFrame requires a schema to be specified. This means that two passes will be made over the data: one to infer the schema, a second pass to load the data. In contrast, a DynamicFrame is self-describing so it can immediately begin transformations on the dataset. DynamicFrames also make it easier to handle unexpected values in your dataset. There is a word of caution here: there are some operations that may still require the use of DataFrames. If a conversion from a DynamicFrame to a DataFrame or vice versa is required, ensure that if possible, you are not performing that conversion since it is a costly operation. The goal should be to start with a DynamicFrame, convert it to a DataFrame if more complex methods are required, and finish the ETL in the final write with a DynamicFrame.

Not Using Job Bookmarks Correctly

Job bookmarks are a powerful functionality of AWS Glue that allow for the tracking of data that has been processed in each run of an ETL job from S3 or JDBC sources. Using a bookmark, you can "rewind" a job to reprocess a subset of data or reset the bookmark in a backfilling scenario. Most customers run into issues with job bookmarks when they use jobs they have developed themselves due to the requirement of several parameters specific to AWS Glue. If these critical components are missing, your job bookmark will not function correctly. If you are struggling to understand how to use job bookmarks correctly, the best tip is to allow AWS Glue to generate an ETL script and use that as a reference for implementation.

Not Partitioning Correctly

Correctly partitioning your data may be the single greatest contributor to performance across your pipeline and is often an opportunity many customers miss. It is important to note that if you are new to Spark and are taking advantage of the AWS Glue generated scripts from the console, by default they do not partition when writing job output. When developing an AWS Glue job, you will need to make sure that the output being written is partitioned to take advantage of powerful features such as partition pruning. The example code below demonstrates how this can be achieved:

datasink4 = glueContext.write_dynamic_frame.from_options(
frame = dropnullfields3,
connection_type = "s3",
connection_options = {"path": "s3://dataoutput/","partitionKeys": ["year", "month", "day"]}, format = "parquet",
transformation_ctx = "datasink4")

Another closely related partition functionality that is often missed by AWS Glue users is taking advantage of pushdown predicates. Pushdown predicates allow filtering on partitions without having to list and read all the files in the dataset and are even evaluated before S3 Lister. Below is a code sample of how pushdown predicates can be implemented in a DynamicFrame:

datasource0 = glueContext.create_dynamic_frame.from_catalog(database = "mission_db", table_name = "data", transformation_ctx =
AWS ("datasource0",push_down_predicate = "(partition_0 == '2021' and partition_1 == '01')")

AWS Glue is a powerful service; when best practices are followed, it can be one of the most useful, versatile, and robust tools in your AWS data ecosystem. In this article, I have offered some suggestions and best practices for overcoming common customer mistakes encountered when using AWS Glue. For more pro tips in implementing your AWS Glue pipeline, follow the link below to watch a video where I walk through this content and more.

Written by
Data and Analytics

Related Stories