In my previous post The Holy Grail of the Data Lake I offered a definition of a Data Lake and reviewed the key elements that make up a Data Lake. I have been referring to an IBM RedBook on this topic. The most obvious element is a set of Data Repositories – you need a place to store different types of data coming into the Data Lake from your Systems of Record ie. in the business critical systems and Systems of Engagement ie. mobile, web interfaces.
A viewpoint that was disseminated by Hadoop-only vendors at one point, and one that I still encounter with clients is that the only solution you need for a Data Lake is Hadoop. The problem is no single storage format or processing engine is appropriate or best for all workloads. Hadoop and associated Apache projects themselves comprise multiple data formats and engines. It’s notable that one of the Hadoop players, Cloudera, now positions itself as a data management platform provider rather than Hadoop provider and breaks it’s offerings into “Analytic DB, Operational DB, Data Science and Engineering, and Cloudera Essentials” – screenshot from their site below. Bottom line is that you should not try to jam a round peg into a square hole and do your due diligence on what repository/engine, whether Open Source or proprietary, best addresses a certain use case.
The diagram below, taken from the previously mentioned RedBook, summarizes the different data domains Data Lake Repositories support. Something interesting called out here is that a Data Lake contains not just the data analytics will be performed on, but also metadata, or descriptive data about the data (Descriptive Data) – this data, as I will discuss, is critical to making the Data Lake useful for an organization’s lines of business.
There are different ways to architect a Data Lake in terms of repositories and their uses, and this largely depends on an organization’s needs, but the above provides an idea of the different types of uses for data and corresponding repositories needed:
- Descriptive Data: metadata about the data assets in the Data Lake and a search index to allow users to easily find data for analytics use cases, supporting a “shop for data” experience. Information views refer to semantic or virtualized views of data providing a simplified view of some data sets for subsets of users.
- Deposited Data: an area for users to contribute their own data, or store intermediate data sets or analytic results they have developed.
- Historical Data:
- Operational History – historical data from systems of record, could be used for some reporting, as archive for active application or maintained for compliance reasons for decommissioned applications. Can be considered a landing zone where data is in similar format to that in operational source.
- Audit – record of who is accessing what data in the reservoir.
- Harvested Data: data from outside Data Lake which may have been cleansed, combined, converted into a different form than that in source applications in order to support Analytics.
- Deep Data – supports different types of data at high volume, storing data with or without a schema, and supporting analytics on structured and semi-structured/unstructured data.
- Information Warehouse – Consolidated historical view of structured data for high performance analytics.
- Context Data – organization’s operational data – master data e.g. customer record reference data e.g. country code tables, and business content and media e.g. PDFs, audio files.
- Published Data – data refined and targeted at particular consumers.