Written by: Edward Yeh, Principal Big Data Consultant
Today’s Data Warehouse Problems:
What are the biggest problems with today’s Enterprise Data Warehouses?
– It typically takes six months to a year to add/implement a new data source.
– There are always problems with scalability (vertical and horizontal) as well as query complexity & load because today’s “business questions” have increased exponentially.
– High initial capital investment and ongoing operations cost.
Benefits of a Data Lake:
How do we solve these issues? You have to love and embrace the “Hadoop-Based Data Lake”. Some of the advantages of the Data Lake includes providing a landing zone for all data, allowing historical, archived cold data to be queried and enabling an environment for ad hoc data discovery. Please keep in mind that a Data Lake is meant to complement your existing Enterprise Data Warehouse, not replacing it.
Using a Hadoop-Based Data Lake:
Data-centric organizations should leverage Hadoop’s large-scale batch processing efficiencies to preprocess and transform data for the warehouse. Hadoop/HDFS is mean to simplify acquisition and storage of diverse data sets. Use the Data Lake to power both the Enterprise, SLA-Driven BI/DW environment as well as the Analytics/Sandbox environment. The Hadoop Data Lake can feed both the BI/EDW and Analytics environments.
Data Integration and Pre-Processing:
As far as Data Integration is concerned, offloading the ETL routines to the Data Lake has many performance and cost benefits and existing ETL routines can be dramatically accelerated using the native parallel nature of Hadoop. The Hadoop-Based Data Lake can offload some data processing work from an EDW and host new analytics applications. The filtered, cleansed, and pre-processed data sets or summarized results can then be sent to the data warehouse for further analysis by business users.
Use an MPP to Supplement Your DW:
While the Data Lake is an excellent way to stage and integrate your organization’s diverse data sets from their native sources, it might not be the best place to do the massive SQL Joins, metadata standardization, and data enrichment (SQL Updates, Deletes, etc.). An MPP platform such as AWS Redshift, Greenplum, Vertica, and many of the other parallel relational databases should still be considered.
With a well-built Hadoop Data Lake, it could make scaling DW capacity, increasing the number of users, and growing data reports much simpler. Nowadays 3-10TB is the norm in today’s average sized organization. Soon the norm will be 10-100TB and probably growing into the Petabytes. Even experienced Hadoop data lake users say that a successful implementation requires a robust architecture and disciplined data governance policies. Without these, as I have seen in many organizations, the Data Lake can become out-of-control dumping grounds for data that will never see the light of day.
Edward Yeh is a Principal Big Data consultant and solution architect with Intersys. He has over 20 years of data technology experience with a variety of data systems.