Even the term “Data Lake” brings to mind certain analogies that would never fit with the term “Data Warehouse.” Some of these analogies can be seen in titles for blogs and articles such as:
- Ways to Keep Your Data Lake from Becoming a Data Swamp
- A Smarter Way to Jump into the Data Lake
- Everyone Wants to Drink from the Data Lake
- Fishing for Insights in the Data Lake
Although clever, these analogies may be obfuscating the fact that a Data Lake is a simple but effective data management strategy that has surfaced because of advancements in distributed processing and storage. In this blog post, we’ll take a closer look at the advent of the Data Lake by exploring the differences (and some of the similarities) to its spiritual predecessor, the Data Warehouse.
Let’s start with data, which is a common denominator between the Data Warehouse and the Data Lake. In a Data Lake, data is stored in its natural form. For example, if the data source is a table on an application database, the Data Lake might store delta files (new-adds and changes) extracted from the table with all of the columns included and delimited in the order they are stored in the table. These files may be organized either in a folder structure or by following a naming convention that allows them to be grouped and sorted when they are needed for analytics.
The Data Lake can also be configured to ingest unstructured data such as tweets and images. For structured data, as well as semi structured data like JSON or XML, the schema is still relevant but it is only required at the time of analytics – where it might undergo transformations as required by the analytical process. This approach is referred to as ELT (Extract, Load and Transform). Since the transformation happens in the future, it is very important that the meta-data is indexed and catalogued during ingestion in a manner that simplifies the identification and subscription of data for analytics.
In contrast, in a Data Warehouse, data is highly curated and organized. Data is extracted from Systems of Record (SOR) by ETL (Extract, Transform and Load) processes that then transform and organize the data typically as facts and dimensions following standard schemas such as Star or Snowflake.
There are both advantages and disadvantages to the ETL approach of maintaining a Data Warehouse. The main advantage is that the data modeling / architecture process, which is a prerequisite to building a Data Warehouse, creates a clear representation of the enterprise by realigning and resolving the purpose oriented nature and other quirkiness (cryptic column names that resemble license plates for example) of proprietary application databases. It also makes Business Intelligence a transferable skill where experience gained in one domain can be applied in a completely different domain with minimal training since the new business concepts are still built on familiar data structures.
However, many data modeling decisions are made without a complete view of current and future use cases. This scenario could result in not having a data entity or even an attribute available in the Data Warehouse at a time in the future when you realize that you need it. Conversely, this might also result in an overly conservative modeling activity that includes and maintains entities that have no value.
In a Data Warehouse, BI analysts and ETL developers are typically the first point of contact for the data. BI analysts perform analytics with the help of reports and OLAP cubes. In some scenarios, the Data Warehouse might not solve a business need in a straightforward manner. In these cases, ETL developers may also be engaged to extract and transform the data from the Data Warehouse to load data marts that are purpose built and optimized for specific business functions. In both scenarios, either implicitly or explicitly, the data in the Data Warehouse is mostly accessed via SQL. It might then be further transformed by ETL tools such as Informatica and Talend or visualized in BI tools such as Tableau, MicroStrategy and Power BI.
Data scientists are the key players when it comes to extracting value out of Data Lake investments. They apply the schema on the data while reading and perform lightweight transformations as needed before proceeding with the highly customized analytic activities by applying various statistical techniques. R is a popular programming language among the data science community. In the Data Lake environment, R is now often configured to use Apache Spark to take advantage of the distributed processing capabilities. In addition to R, Spark is also used with Python, Java and Scala to build statistical and machine learning models. Depending on scale and complexity, some of the technical processes related to subscribing and transforming data for analytics may be performed by Data Engineers.
Types of Analytics
The Data Warehouse is well suited for descriptive analytics where it aims to provide a clear answer to the question: What happened? Facts and Dimensions in the Data Warehouse can be combined to create multidimensional OLAP cubes. These OLAP cubes can then help visualize the data in a familiar manner and allow analysts to apply straightforward OLAP functions such as slicing, dicing, drilling down / up and roll-ups with good query performance.
In a Data Lake, in addition to descriptive analytics, other types of advanced analytics such as Diagnostic Analytics, Predictive Analytics, Prescriptive Analytics are also performed. Diagnostic Analysis tries to establish causality and answer the question: ‘Why did something happen?’ Predictive Analysis deals with forecasting an event before it happens and Prescriptive analytics looks to provide recommendations that could shape future outcomes as per our expectation. Although faster execution times are preferred, the increased processing time is considered a good tradeoff for the ability to store and perform distributed analytics on exhaustive datasets at lower costs.
While a Data Warehouse can also be used for these types of analytics with potentially superior performance, it is also limited to the data entities that have been modeled and maintained in the Data Warehouse.
A Data Warehouse is typically optimized for faster query times. Storage I/O is usually the bottleneck in a Data Warehouse and hence storage subsystems are optimized for performance by deploying high performance SSDs and/or by configuring a portion the Data Warehouse with high usage as in-memory databases. In stark contrast, Data Lakes can run be on commodity hardware, and they are more optimized to reliably ingest large amounts of data and to perform distributed processing at a lower cost.
Both Data Warehouses as well Data Lakes are now being deployed on the Cloud to take advantages that the Cloud offers. However, there are many characteristics of a Data Lake that make them the ideal candidate for Cloud.
Storage scalability allows the Data Lake to add space on demand instead of having to procure hardware ahead of time, often using conservative estimates increasing expenses. Tiered storage options as well as rule based data movement between tiers allow for the bulk of the data to be put in cheaper deep storage until the it is needed. In addition, the availability of leading edge analytical tools and machine learning platforms facilitate advanced analytics in the Cloud without requiring data movement across the network.
So Which is Better? The Data Lake or the Data Warehouse?
Both! Instead of a Data Lake vs Data Warehouse decision, it might be worthwhile to consider a target state for your enterprise that includes a Data Lake as well as a Data Warehouse. Just like the advanced analytic processes that apply statistical and machine learning techniques on vast amounts of historical data, the Data Warehouse can also take advantage of the Data Lake.
Newly modeled facts and slowly changing dimensions can now be loaded with data from the time the Data Lake was built instead of capturing only new changes. This also takes the pressure off the data architects to create each and every data entity that may or may not be used in the future. They are instead allowed to focus on building a Data Warehouse exclusively based on current reporting and analytical needs, thereby allowing it to grow naturally. Some organizations envision the Data Warehouse as a tier within the Data Lake to which trustworthy data trickles down and is available for consumption – sort of like bottled water. (A quick web search tells me that I can’t take credit for this analogy as well). This strategy would also work for companies that already have an established traditional BI operation and are looking to expand into advanced analytics using a Data Lake. Here, the Data Lake would be put in place of a staging environment, where any data being added to the Data Warehouse would arrive via the Data Lake.