Understanding the Basics of Data Lakes

With the rise of IOT devices (including your cell phone), stand alone sensor packages like wifi enabled cameras, and the exponential growth in people selling data via APIs (such as weather information, Shopify data, and marketing intelligence), companies are being inundated with data from both internal and external sources.

All of these data sources can drive business growth and success if you are able to harness it into insights. A well-organized dashboard that visualizes and contextualizes relevant analytics is a powerful tool for making informed decisions. Machine learning algorithms can be trained with your historical data to enhance your customer experience as well as derive insights into how your business is running.

As companies undergo digital transformation, many have amassed a tremendous amount of both structured, semi-structured and unstructured raw data. Often, a company’s on-premises relational database management systems (RDBMSs) were already at capacity for their current volume of data stored, preventing them from connecting additional data sources and forcing them to make tough decisions about what data to save and what data to delete.

A cloud-based data lake provides a comprehensive foundation for a company’s short- and long-term data initiatives. It offers the ability to get started quickly and ingest an almost limitless amount of unstructured or structured data. Even when companies do not have the resources to organize and analyze all this data (or may not even have a specific use in mind), it is important to start storing historical data as soon as possible. That way, you can always preserve storage space by shifting it to AWS Glacier.

Once collected, this data must be stored somewhere. That’s where data lakes come in. In this article, we’ll dive into the concept of data lakes and help you choose between buying or building one of your own.

What is a Data Lake?

A data lake is a central repository -- think big hard disk -- to store all your raw data, both structured and unstructured. The data lake has information that is flowing in from various data sources (ingestion), then undergoes transformations from its current state, and then loaded into a new format or a new data warehouse/data mart. The data can be queried in various ways that allow for analysis of the data to attract new customers, improve creativity and productivity, and make complex decisions.

It is important to note that data lakes are distinct from data warehouses and data marts. Data warehouses are pre-optimized for storing and querying processed, structured information. By contrast, data lakes often have multiple zones that include raw and processed data in various formats. Data marts are similar to data warehouses, but are usually created for specific purposes or departments.

Currently, there is a lot of buzz around “data lakehouses,” which (as the name suggests) combine the functionalities of a data warehouse and data lake. A data lakehouse provides the infrastructure and capability to ingest a large amount of raw, unstructured data. At the same time, it supports core warehouse features, such as transactions, strong schemas, and business intelligence.

Where Does a Data Lake Fit In?

You may wonder if your organization still needs a data lake if you already have a data warehouse and several online transactional processing (OLTP) systems. The answer is yes.

Your data lake can be pivotal in increasing sales and making result-driven decisions. It is much bigger than your data warehouse and OLTP systems because it stores all forms of data from various sources. Meanwhile, data warehouses and database management systems help run day-to-day operations with tools to query and analyze data from your data lake.

In the twenty-first century, data is power. Enterprises around the world use data lakes to produce valuable insights. Combining this information with artificial intelligence (AI) and machine learning (ML) tools enables predictive actions.

Currently, there are two ways to obtain a data lake: building one and buying one. Both options have their advantages and disadvantages, so the decision completely depends on your organization’s specific needs. Let’s explore both.

link to free ebook about machine learning on aws

Building a Data Lake

How you build your data lake and which specific services you include depend on what you want to achieve. Typically, your central storage system can cost-effectively store an infinite amount of data, potentially hundreds of petabytes. Then, a processing and analytics module cleans and reels in meaningful information from this central storage. Your data lake can generate multiple OLTP systems or data warehouses. A data pipeline transfers data quickly and securely to your primary storage in coordination with various security and monitoring services.

To understand the architecture better, let’s look at how you would build a data lake on Amazon Web Services (AWS). You can use the AWS Simple Storage Service (S3) centralized object-based storage to cost-effectively store as much data as you need. If you have massive amounts of on premise data that you want to migrate to the cloud utilizing AWS Snowball which ships a storage system to your location that you can connect to on premise system and transfer the data. You can ingest the data into S3 using AWS Direct Connect to transfer on-premises data at ease. If you are looking to utilize a hybrid cloud approach Amazon Data Migration Service can act as a link between your cloud and on premise solutions.

You can leverage Amazon Glue, Amazon Elastic MapReduce (EMR) and Amazon Redshift for data transformation and processing. Amazon Machine Learning can train the machine learning model while AWS Key Management Service (KMS), Identity and Access Management (IAM), and CloudWatch secure the data.

Building a data lake is not a simple task, so why would you choose to build a data lake instead of buying a packaged solution?

Building your own data lake maximizes flexibility and control. When you own the solution end to end, you can mould the architecture to meet your business needs. When you choose a managed solution, you don’t control the architecture, which may not be suitable for organizations with stricter compliance and regulation policies. A build approach enables organizations to take precise control over their data.

One disadvantage of building a data lake is the higher upfront cost. Instead of paying per usage or monthly, you pay the entire cost to set up the infrastructure.

Buying a Data Lake

If your business wants to dive into data lake capabilities, but lacks sufficient personnel to create and maintain a data lake, buying a packaged solution is a great option. Buying a data lake provides an immediate, cost- and resource-effective solution.

Buying a ready-made data lake is quick to implement. You don’t need to manually connect various sub-systems as the service provider does most of the hard work. Ready-made data lakes are also easy to use and maintain, with scalability and cutting-edge technologies so you don’t need to worry about your platform’s future operability. There is also a lower upfront cost since you pay for usage, not architecture.

However, buying a data lake does have its downsides. Although you have flexibility, you don’t have ownership or complete control over your architecture. The long-term cost can also be higher.

Amorphic is one example of a fully-managed data lake solution for AWS that integrates with various services. Mission offers Amorphic, a fully-managed AWS data lake platform , designed to simplify, accelerate and reduce the cost of data lake management.

There are many additional companies that have designed and built buy solutions for various pieces of the data lake. If you are looking to easily ingest data then Fivetran has built over 500 connectors to various data sources that you can set up and quickly get data flowing. Upsolver is a great option once you have data in the cloud to start using their automated ETL processes to quickly get your data into the proper formats for analysis, machine learning, and visualization.

Next Steps

In this article, we looked at how data is gold, providing valuable insights and a new approach to attracting customers for business growth. Then, we learned what a data lake is and how it fits within your organization to power your data journey.

There are two ways to create a data lake: buying one from a managed provider and building it. Both options have their advantages and disadvantages, depending on your business needs. Working with an experienced cloud services provider can help you make the most performance- and cost-effective decision based on your cloud environment, organizational goals, and technical resources.

Additionally, should you decide to build, you don’t need to build a data lake from scratch. You can build a custom, semi-managed data lake with Mission, a premier cloud services provider. Mission Cloud partners with Fivetran to ingest data and Upsolver to extract, load, and transform data. Mission helps you harness the power of a customizable data lake without investing the significant time, resources, and focus required to develop it on your own.