Recapping re:Invent 2021 - Amazon EMR, Amazon Redshift, and AWS Lake Formation.
AWS recently released a succession of improvements to the data and analytics services, so expectations were high for most of us heading into re:Invent. Let me tell you, AWS did not disappoint.
AWS launched a series of game changers to the analytics world, primarily the release of several analytics services that have been notoriously difficult for customers to manage: Amazon EMR serverless, Amazon Redshift Serverless, and Amazon MSK Serverless, and Amazon Kinesis Data Streams On-Demand. There were also major improvements to AWS Lake Formation, AWS Glue, and Amazon Athena. In this update, I will focus on some of the AWSome updates from re:Invent, primarily on Amazon EMR, Amazon Redshift, and AWS Lake Formation.
One of the common pain points of users attempting to run their workloads on Amazon EMR is working on some aspects specific to the EMR service that require fine-tuning and testing to ensure that the settings selected are optimal for the given workload. This can be a massive undertaking for customers newer to EMR and may be migrating from other Spark platforms or are just getting started in the big data framework.
Examples of tasks that customers frequently struggle with include right-sizing the configuration of the cluster in terms of correct instance type, the number of nodes in the cluster, and the autoscaling configuration. EMR Serverless boasts the ability to handle this and more by removing the burden of managing the EMR cluster from the customer. EMR serverless, which is currently in Preview mode, promises to allow customers to only worry about the job itself. How does it do this?
EMR serverless will automatically add or remove workers throughout the job so that you are only billed for the number of workers you require in that exact second. This is a big improvement over the long-battled issue of configuring the correct metrics to autoscale the EMR cluster. It is even an improvement over EMR managed scaling that required a minimum node size upon cluster launch. Allowing the EMR service to control the number of “workers” allocated to an application will hopefully result in cost savings and a more performant application for customers everywhere.
Another common customer pattern is interactively utilizing EMR to allow personas such as data scientists to run their applications in a familiar setting, such as via Amazon EMR studio or other methods that allow fast response times. EMR Serverless provides functionality to pre-initialize workers so that when users begin to interact with EMR, they are immediately able to begin working, with additional workers added as needed. This goes a long way in improving the overall end-user experience, as waiting for EMR clusters to launch is one of the most common customer complaints.
Aside from EMR Serverless, additional recent EMR enhancements include functionality such as:
- Improvements in cluster startup times resulting in clusters starting 35% faster
- EMR Managed Scaling awareness of nodes with active shuffle data to prevent shuffle data being lost when nodes are scaled down
- EMR Managed Scaling capacity awareness, for instance, groups resulting in task groups being intelligent enough to know the pool depth for a given instance type
A conversation about analytic services with clusters difficult to manage wouldn’t be complete without discussing Amazon Redshift. Many of the difficult configurations described in the EMR section above apply here as well. Data warehouses suffer from problems adjusting from increased workloads, maintenance configurations that can be difficult to manage, and tuning that requires expertise in this field and even specific to Redshift. For quite some time, the edge that competitors have had on Redshift was the ability to scale down the data warehouse so that you are not billed if the cluster is idle. Amazon Redshift Serverless provides this and more!
Redshift Processing Units: Redshift Serverless utilizes a new compute unit called Redshift Processing Unit (RPU) that is metered on a per-second basis built on the idea of pay for use. So, no matter how many queries are executed against the Redshift Serverless cluster, you are only billed per the RPU consumed at that exact second. If there are no queries running, then there are no charges acquired. To implement cost control, you can set up thresholds for the RPU maximum to ensure that the system will automatically scale within the limits you have defined for users and query concurrency while delivering the same consistent performance at any scale.
Data Sharing: The idea of data sharing with architecture patterns such as a data mesh is also gaining popularity in analytic architectures. Redshift Serverless utilizes the concept of a Serverless endpoint that points to the Redshift Serverless service specific to that account. Redshift Serverless facilitates the sharing of data seamlessly and instantly for both traditional provisioned Redshift clusters and Redshift Serverless deployments, even providing easy data sharing across AWS accounts and AWS regions.
AWS Lake Formation
AWS Lake Formation has always been popular among customers due to its promises of simplified data governance and a security layer that makes managing your data lake far less of a headache. However, in the past, customers often found implementing Lake Formation to be difficult in practice. Additionally, there were a host of feature requests for Lake Formation that customers long-awaited, such as row-level access. AWS has invested a huge amount of effort into bringing Lake Formation forward as the premier solution for governance and security in data lakes, as well as adding functionality like governed tables that will revolutionize S3 data lakes. Please note that the features and enhancements discussed below have been gradually released in recent months, but re:Invent packaged all this functionality together to present the big picture of what Lake Formation is capable of.
Row and Cell-Level Access: Row-level access has been long-awaited by Lake Formation customers everywhere. Prior to this, customers had limited choices for providing row-level access, such as creating views or using Apache Ranger. Lake Formation has raised the bar in this respect by also providing the ability to restrict access even to the cell level. Cell level security permissions provide an added enhancement on row-level permissions and allow you to specify a reusable data filter that is based on familiar SQL syntax such as “SELECT * where country=US”.
ACID Transactions: One of the big data architectural patterns that is becoming more common is a data lake with support for ACID transactions(INSERT, UPDATE, DELETE etc). As the request for this functionality increases, services such as Apache Hudi, Apache Iceberg, and Delta Lake continue to gain popularity. Lake Formation has released its own flavor of data lake ACID transaction support via governed tables. With governed tables, the highly sought-after time travel query becomes possible over S3 data without requiring frameworks such as Apache Hudi to facilitate this functionality. This is a huge win for customers who want to perform ACID transactions on an S3 data lake without having a compute-intensive engine such as EMR running over the data lake. Although governed tables were released earlier this year, they have now become GA.
Storage Optimization: If anyone has worked in big data, they have faced the issue of small files. Many services, such as streaming services, generate numerous small files that are not optimal to work within the big data ecosystem. Solving the small file problem requires incorporating compactions or other configurations to coalesce these small files into an optimal file size. Lake Formation Governed tables now automatically incorporate storage optimization into the framework by compacting small files on the underlying S3 data so that queries are consistent and performant.
These are just a few of the notable Lake Formation features presented at re:Invent; there were many more that merit consideration and implementation in your AWS data lake, such as tag-based access control.
In this blog, I offered a recap of just a few of the highlights from re:invent on EMR Serverless, RedshIft Serverless, and Lake Formation, but I encourage everyone to check out recent updates on AWS Glue and Amazon Athena as well. They will no doubt make your analytic workloads even more performant and robust. Why not sign up for the preview mode for some of these new features and find out for yourself just how you can benefit from serverless analytics.