Traditional data warehousing was pioneered by Ralph Kimball and Bill Inmon in 1970s to address growing problem of accessing data for reporting and analytics from multiple disparate data sources. Key drivers and goals of data warehousing are:
- Collect data from multiple sources into a single repository
- Expose data for reporting and analytics in a performant way
- Keep full history of data changes
- Provide consistent, single source of truth view of the organization data
- Manage and improve data quality
- Expose the data to the business users in intuitive way for ad-hoc reporting and analytics
Kimball’s and Inmon’s data warehousing methodologies worked well to address these goals and data warehousing became fundamental part of enterprise IT infrastructure in the 1990s. The global information explosion is starting to change that. Data warehousing struggled to keep up with increasingly growing and evolving data sources, falling short in several categories:
- Integration of new data source is lengthy and expensive process requiring data modeling, custom development of ETL jobs and full regression testing
- Rigid traditional data models are often pre-disposed to the questions and are unable to accommodate dynamic, ad-hoc data analytics process. Unstructured and semi-structured data can’t be easily integrated
- Traditional architectures based on relational database technologies become prohibitively expensive to scale
To address these deficiencies, organizations increasingly shift from data warehousing to Data Lakes. While the standard definition of Data Lake is still evolving, the key concepts are:
- Store data from each source system in original, native format
- Ingest all data from each source system
- Perform analytics by combining data form different sources on the fly
This popular brute force approach to analytics has been made possible by technological advances in distributed storage and processing frameworks, primarily Hadoop and Spark. It quickly gained popularity with organizations that experienced pains and limitations of traditional data warehousing and needed new approach to support agile analytics. Data Lake excels as data exploration platform: it allows to quickly bring together all relevant internal and external data assets, profile and analyze the data. It scales out easily and inexpensively to accommodate very large volumes of structured and semi-structured data. Good way to think of the data lake is data sandbox: open to experimentation, flexible and not restricted. This openness helped it to gain popularity and some even declared that data warehousing may be dying. Is data warehousing really dead?
It is probably safe to say that traditional way of implementing the data warehouse as we know it is dead. Modern data-driven organizations simply cannot afford long development cycles and high costs associated with traditional data warehousing. It has to evolve to become more agile and efficient.
Can Data Lake fully replace traditional data warehouse? It falls short in several critical areas:
- Regular data refresh is very difficult to implement (Hadoop doesn’t support updates)
- Security is immature
- Data Governance (metadata, master data, data quality) is very difficult to implement due to lack of row-level processing capability
- Because data is stored in raw form, consuming it is difficult and often limited to data science community
- Inadequate query performance due to batch-oriented nature of Hadoop
What would future analytics platform look like?
It needs to retain strong features of traditional data warehousing, namely:
- Data refresh and history
- Data governance
- Real-time query performance
It has to evolve to outgrow limitations of traditional data warehouse, specifically:
- Rigidity and fragility
- High costs
- Long implementation timelines
Most importantly, it needs to be lean and agile to support constantly growing and evolving analytics.
Interested to learn more? Stay tuned for continuation of this blog, Building Lean Reporting and Analytics Data Platform.