Redefining S3 Data with Amazon Athena and AWS Glue
One of the things I really like about my job is I get paid to stay on the cutting edge of new technology features to make sure our clients are getting the most out of their AWS cloud infrastructure and spend. In the last few months, I’ve been working with some features Amazon released for AWS S3. These new features change the scope of what you can do with S3 for analytics. In the past, S3 was a straight data storage repository. Nothing wrong with that, but not particularly exciting, even for a Solutions Architect like me. But these new features really change what we can do with and in S3.
Amazon Athena: S3 Data Queries without ETL
Amazon Athena is an interactive query service addon that makes it easy to analyze data in Amazon S3 using standard SQL. Athena is serverless, so there is no infrastructure to manage, and you pay only for the queries that you run directly in S3. It’s Amazon’s turnkey data lake since it currently supports up to one terabyte CSV files.
Before Athena was released, that amount of data would require an ETL (Extract, Transform and Load) to analyze. You would have had to use a complicated (and expensive) ETL process to transform the CSV into a standard database like RDS or a NoSQL solution. Now though, Athena allows you to query that data directly in S3, saving you (or, in our case, our clients) a lot of money without needing to maintain a secondary database.
As an illustration of how big this is, consider this example. A customer of ours in the insurance industry was faced with a time-consuming and labor-intensive process every time they onboarded a new customer. The process required receiving and inserting terabytes of data into their existing database system running on EC2 with EBS, but it excruciatingly slow: It took literally days for each process. Thanks to Athena, we no longer need to take the time to insert the data into a relational database. Now, as soon as the file appears in S3, we can query the CSV directly using Athena. That not only saves us and our clients a ton of time, but it also eliminates a major source of potential data errors and validation: human error.
AWS Glue: Redefining Data Structures
If you need to maintain an ETL process for security or third-parties, Amazon has introduced AWS Glue with the ability to structure your unstructured data without the need for an operating system. Glue is a fully managed ETL (extract, transform and load) service from AWS that makes is a breeze to load and prepare data.
With a few clicks in the AWS console, you can create and run an ETL job on your data in S3 and automatically catalog that data so it is searchable, queryable and available. From there, you can upload it to your analytics engine. AWS Glue is serverless on the customer’s end, so there is no infrastructure to buy, set up or manage. Instead customers pay only for the compute resources consumed while running ETL jobs that are automatically provisioned in the environment as needed. This tool allows data to be available for analytics in minutes.
Picking the Right Data Tool for Your AWS S3 Data Needs
At the end of the day, both Amazon Athena and AWS Glue have drastically changed the data game in AWS S3. These two tools ensure that almost any data you have in the cloud is available to be analyzed at a moment’s notice, without worrying about re-architecting or human error. Amazon Web Services has covered both bases with these two tools depending on your specific needs.
Start reducing your AWS spend and optimize your infrastructure by scheduling a 30-minute SA On-Demand where you can talk to one of our engineers about the steps you need to take to start saving today!