Data Lake Repositories

In my previous post The Holy Grail of the Data Lake I offered a definition of a Data Lake and reviewed the key elements that make up a Data Lake. I have been referring to an IBM RedBook on this topic. The most obvious element is a set of Data Repositories – you need a place to store different types of data coming into the Data Lake from your Systems of Record ie. in the business critical systems and Systems of Engagement ie. mobile, web interfaces.

A viewpoint that was disseminated by Hadoop-only vendors at one point, and one that I still encounter with clients is that the only solution you need for a Data Lake is Hadoop. The problem is no single storage format or processing engine is appropriate or best for all workloads. Hadoop and associated Apache projects themselves comprise multiple data formats and engines. It’s notable that one of the Hadoop players, Cloudera, now positions itself as a data management platform provider rather than Hadoop provider and breaks it’s offerings into “Analytic DB, Operational DB, Data Science and Engineering, and Cloudera Essentials” – screenshot from their site below. Bottom line is that you should not try to jam a round peg into a square hole and do your due diligence on what repository/engine, whether Open Source or proprietary, best addresses a certain use case.


(No longer just Hadoop – Cloudera website)

The diagram below, taken from the previously mentioned RedBook, summarizes the different data domains Data Lake Repositories support. Something interesting called out here is that a Data Lake contains not just the data analytics will be performed on, but also metadata, or descriptive data about the data (Descriptive Data) – this data, as I will discuss, is critical to making the Data Lake useful for an organization’s lines of business.


There are different ways to architect a Data Lake in terms of repositories and their uses, and this largely depends on an organization’s needs, but the above provides an idea of the different types of uses for data and corresponding repositories needed:

  1. Descriptive Data: metadata about the data assets in the Data Lake and a search index to allow users to easily find data for analytics use cases, supporting a “shop for data” experience. Information views refer to semantic or virtualized views of data providing a simplified view of some data sets for subsets of users.
  2. Deposited Data: an area for users to contribute their own data, or store intermediate data sets or analytic results they have developed.
  3. Historical Data:
    1. Operational History – historical data from systems of record, could be used for some reporting, as archive for active application or maintained for compliance reasons for decommissioned applications. Can be considered a landing zone where data is in similar format to that in operational source.
    2. Audit – record of who is accessing what data in the reservoir.
  4. Harvested Data: data from outside Data Lake which may have been cleansed, combined, converted into a different form than that in source applications in order to support Analytics.
    1. Deep Data – supports different types of data at high volume, storing data with or without a schema, and supporting analytics on structured and semi-structured/unstructured data.
    2. Information Warehouse – Consolidated historical view of structured data for high performance analytics.
  5. Context Data – organization’s operational data – master data e.g. customer record  reference data e.g. country code tables, and business content and media e.g. PDFs, audio files.
  6. Published Data – data refined and targeted at particular consumers.

The Holy Grail of the Data Lake

Many, many businesses today are striving to build a “Data Lake” (/Data Reservoir/Logical Data Warehouse) for their organization. In my experience they all are undertaking this with the goal of making more agile, self-service and IT independent analytics available to the LOBs.  Often they also do not have a clear idea of what a successful Data Lake initiative really entails. Some simply deploy a Hadoop cluster and load in all their data with the expectation that this is all that is required, which leads to that other often referenced concept, a “Data Swamp”.

A simple early definition of a Data Lake is “a storage repository that holds a vast amount of raw data in its native format until it is needed”.

IBM’s definition emphasizes the central role of a governance and metadata layer and that the Data Lake is a set of data repositories rather than single store: “a group of repositories, managed, governed, protected, connected by metadata and providing self service access”.

So keep in mind:


As well as this high level view of the components:


Mandy Chessel is a Distinguished Engineer and Master Inventor in IBM’s Analytics CTO office who is a thought leader on the Data Lake and has worked with customers such as ING on implementations of it. Her RedGuide and RedBook on the topic provide a wealth of information. She defines the three key elements of a Data Lake as follows:

Data Lake Repositories – provide platforms both for storing data and running analytics as close to the data as possible.

Data Lake Services – provide the ability to locate, access, prepare, transform, process, and move data in and out of the data reservoir repositories.

Information Management and Governance Fabric – provides the engines and libraries to govern and manage the data in the data reservoir. This set of capabilities includes validating and enhancing the quality of the data, protecting the data from misuse, and ensuring it is refreshed, retained, and eventually removed at appropriate points in its life cycle.

In my view it’s the Data Lake Services that pose the greatest challenge to deliver for most customers. This is because:

a) being able to locate the right data requires commitment and ownership from the LOBs to continuously catalog/label their data via a data catalog and,

b) while there are many tools for enabling self-service data movement, data virtualization/federation, and metadata management I don’t believe there is a single out of the box silver bullet for all applications and the right solution may vary depending on your data repositories and priorities.


Quantifying the Value of Data Governance

Data Governance is often viewed as a cost of doing business rather than a driver of business value and innovation. But it is in fact critical to the agility of an organization’s analytics initiatives and its ability to make informed business decisions. The explosion of available data, the reduced cost of analyzing it, and pressure from the business to leverage it make delivering access to quality data more important than ever. What good is all that data if it takes you weeks to find what you need, get access to it, and validate that it is appropriate to base critical decisions on? So how do you put a dollar value on an investment in data governance software? governance

A starting point for the business case has to do with the amount of time employees searching for, understanding, and cleansing data via highly manual processes and what kind of an improvement a fit for purpose tool set can deliver. How much time do your business analysts or IT staff spend doing the following:

  • Understanding what data sets they need to gain certain analytic insights
  • Determining the quality and origin of each data set
  • Understanding/agreeing on what a certain term or formula in a report means
  • Cleansing data

The other side of the coin is what is the lost revenue to the business of poor data quality, a lack of insight from the business on what data they are basing analytic insights on or what data sets they have available to them. Some potential ways to approach this are looking at:

  • Delayed rollout or failure of analytics initiatives
  • Cost of false claims not identified (fraud)
  • Fines associated with bad data, ungoverned processes or data exposure

Data Integration on Hadoop – Why Reinvent the Wheel?

It seems a common reason that a new project is started on Hadoop which seems to duplicate already existing capability is that the existing solution is just not built to scale to large data volumes. Often that’s a valid argument, but in the case of Data Integration/Data Quality, there are many mature existing solutions out there in the market. Are they all hamstrung when it comes to big data integration?

IBM’s Information Server, a well-established Data Integration solution, initially featured some capability that allowed pushdown of its workload to Hadoop via MapReduce. Of course MapReduce has in time been shown to not be the most performant tool and been essentially superceded by Spark’s in-memory engine. But customers have been using the Information Server Engine itself in its scale-out configuration for big data transformation for many many years, in very large clusters. From this reality I surmise came the decision to unleash the Information Server engine directly as an application on YARN, as BigIntegrate and BigQuality. The below diagram shows how the engine runs on YARN, but at the core of it an Information Server Application Master which negotiates resources for IS processes with the ResourceManager.


How have other integration vendors designed their Big Data solution? Talend, which initially also pushed workload down into MapReduce has switched over to converting its jobs to Spark. This is logical since Spark is much faster than MapReduce, but I expect also involves some significant coding effort to get right. Informatica’s approach seems a bit more confused or nuanced – they promote their “Blaze” Informatica engine also running on YARN but suggest that their solution “supports multiple processing paradigms, such as   MapReduce, Hive on Tez, Informatica Blaze, and Spark to execute each workload on the best possible processing engine” – link. I think this is just because at the end of the day the Informatica engine wasn’t built to handle true big data volumes.

There’s always the option of doing data integration directly with hadoop itself, but there’s not much in the way of a solution there. You can use Sqoop to bring data in, or out but you’ll still end up writing HiveQL and hundreds of scripts.




Master Data Management (Still) at the Core of a Customer-Centric Strategy

Being able to compile an accurate view of an existing or prospective customer is a tremendous competitive differentiator. Knowing your customer can allow you to make the right upsell offer at the right time to drive increased revenue, to take the right action at the right time to prevent losing a dissatisfied customer, and to provide a tailored experience which builds customer loyalty/reduces customer churn.

Knowing your customer involves understanding who they are across your entire business and in today’s social age, even outside of it. Different parts of the business may identify one customer slightly differently, and have different relevant pieces of information about that customer. How do you bring all this together? This is the domain of Master Data Management – building a single “golden” view of a customer from many different sources and allowing seamless access to this view to consuming applications. If you think enabling this is straightforward, and can be built from scratch in-house, you should give your head a good shake. You can only re-invent so many wheels with your IT department.

With the explosion of available data, Master Data Management is more important than ever. Now, in addition to your various internal sources of record from which to build a picture of your customer you have countless social and other third-party sources. Matching customer identity across all of these and viewing relationships between customers is more crucial than ever!



Speeding up Your Big Data Journey

Hadoop and now Spark can be made out to be the answer to every question, to be the cure for cancer – the Koolaid is very powerful. It’s important to remember that while these technologies are certainly game-changers they do not solve every problem (yet!) and come with familiar challenges.


It seems that when it comes to Hadoop/Spark, considerations around ease of implementation and total cost of ownership can easily go out the window. Some businesses believe they will stand up a cluster and reap tremendous insights in a week. They may also proceed into a Hadoop initiative without having defined use cases for their business. The reality is that with just the core components of the platform you will probably need to hire an army of data scientists/developers to get value out of your data. Not an issue for companies named Google and Amazon but more so for most others.

Two things can make Hadoop a more palatable and realistic proposition for the masses:

  1. Analytics accelerators which make analytics on Hadoop accessible to people without a PhD in mathematics. IBM offers these kinds of “value-adds” – for example BigSheets, which allows analysis of data on Hadoop via a spreadsheet-like interface. SQL on Hadoop also provides a great easy win for the platform, allowing data discovery or the querying of archived data directly on Hadoop with SQL tools. IBM’s BigSQL has been shown to be a strong contender in this area.
  2. Cloud – eliminating the upfront setup of a cluster as well as on-going administration is huge. Not many of the major Hadoop vendors have a true SaaS offering for Hadoop. Nearly every customer is considering existing or new workloads to put on the cloud; Hadoop-as-a-service makes a lot of sense. IBM’s “eHaaS” offering is BigInsights on Cloud.


The In-Memory DB Hypewagon – Quick Thoughts

Recent years have seen many stories touting the impact of in-memory database management systems (IMDBMSes), which load up all data into memory, as the new direction for ultra-fast, real-time processing. With each week a new VC-funded entrant seems to arrive with claims of being the fastest database ever and accelerating processing by 100X, 1000X, 10000X! Gartner late last year put out a Market Guide for In-Memory DBMS which advises IT decision-makers to investigate this disruptive technology. Poor old relational database management systems are slagged as relics, dinosaurs optimized for on-disk processing.

There’s no doubt the performance of all DBMSes benefits from providing enough memory to fit all data in memory. The reason IMDBMSes are presented by their proponents as gazillions of times faster than traditional databases (even when they are given enough memory to load up all the data into the cache) is because they are designed from the ground up to optimize for in-memory operations vs. disk-based processing.

It seems to me however, that the optimizations that really account for the biggest performance gains can and in many cases have been successfully built into our aforementioned poor “traditional” RDBMSes. In the case of analytics, I would say it is the columnar data structure that is really the game-changer, facilitating efficient aggregation and calculation on individual columns. DB2 10.5 added BLU Acceleration – with this capability you are able to create columnar storage tables in the database. You can provision enough memory for all of them to be loaded up but BLU is also optimized for working on large data sets which do not fit into memory, minimizing I/O from disk. When SAP HANA, one of the supposed “pure” in-memory database players, first came out SAP said that all data should/will be in memory. They have since backtracked with SAP IQ for Near-Line Storage for HANA and more recently HANA Dynamic Tiering for Extended Storage (for “non-active” data). The customer performance results I have seen with BLU are no less impressive than those of HANA, and are typically accomplished with a ton less hardware. Microsoft SQL Server and Oracle have also come out with their own columnar store offerings, the Columnstore Index and Database In-Memory, respectively.

On the OLTP side of the house, I have not seen any real evidence that in-memory optimizations have resulted in tremendous performance acceleration. The major difference maker, which is nothing new, is turning off logging so you avoid touching disk altogether but this is not applicable to all application uses. While grandiose claims of XXXX times speedup for OLTP are common by IMDBMS vendors, substantiated results are harder to find. I would refer you to a very funny old article on how these claims or “benchmarks” are often put together – MySQL is bazillion times faster than MemSQL, a response to a MemSQL “result”. SQL Server In-Memory OLTP, an offering by a traditional vendor, has received some plaudits. But from what I have seen, acceleration is not broadly applicable to an entire workload, is quite limited without the use of C stored procedures (not exactly an in-memory optimization or even a new one), and perhaps has something to do with addressing SQL Server locking issues. I absolutely understand that there are optimizations that can and have been done for more efficient in-memory processing for OLTP, I just don’t know that they have delivered a game-changing performance boost.


My Thoughts on Microsoft-sponsored Report on Analytics Platform System

Having had clients reach out to me for my perspective on Value Prism Consulting’s Microsoft-sponsored report Microsoft Analytics Platform System Delivers Best TCO – to – Performance, I thought I would take some time to share post some of my thoughts on this piece of work. I will provide some comparisons to the IBM PureData System for Analytics, given I have background in this offering.

1. First, the report makes the case that what I would call a yet unproven high-end data warehousing offering will deliver the best performance for $ invested because it has the most cores, storage, I/O bandwidth, and memory. I could go ahead and take any given free open-source DBMS with an MPP scale-out architecture, deploy it on loads of hardware and make the same argument, but performance for complex analytics on large data sets, without extensive tuning, is hard!

The Microsoft Analytics Platform System’s first iteration was available in 2010. Last I checked there were only 3 case studies available for customers running the latest SQL Server 2012 Parallel Data Warehouse Edition on a Microsoft appliance. I had also seen a number of “20+” given in one Microsoft presentation as the total number of customers running the Analytics Platform System with Microsoft’s Clustered Columnstore Indexes (only available in the latest 2012 software). By contrast, the IBM PureData System for Analytics customer count is in the many hundreds, including data sizes of over a petabyte. $ per TB/core/GB ram is not equal to $ per unit of performance, especially when you don’t have that many proof points.

2. I see some issues with how Full Time Equivalent (FTE) per rack, which represents the labor costs for administering a system, is calculated for both the Analytics Platform System and the competition. As I said already, performance for complex analytics over large data sets is hard. For some offerings this kind of performance is achievable, but requires extensive tuning work. As the report itself says, many vendors brand their offering an appliance and claim leading Total Cost of Ownership.

In the case of the Analytics Platform System, Microsoft and Value Prism claim simplicity of use and hence lowest labor costs by suggesting that the product is just SQL Server, SQL Server skills are easily transferable, and so labor costs will be no higher than for administering SQL Server. This strikes me as nonsense. Vanilla SQL Server is SMP, not scale-out, and average deployments still tend to the smaller side. Administering SQL Server on a few cores is not the same as administering an MPP/Scale-out system running deep analytic queries against tens or hundreds of TB of data. Scale-out introduces additional tasks like monitoring for data skew. Also, my understanding is SQL Server Parallel Data Warehouse Edition is a hybrid of Microsoft’s DataAllegro acquisition and SQL Server, and it still lacks the complete feature set/syntax compatibility with regular SQL Server (recently delivered some improvements for this in their AU3 update).

By comparison, the PureData System for Analytics has long been proven for ease of management, with many many customer testimonials attesting to performance out of the box with no tuning as well as simplicity of administration (unlike the APS sytems with hardware from Dell, HP, or Quanta it was build from the ground with HW+SW with unique hardware acceleration). Somehow the Value Prism report claims 4 FTEs required per rack of PureData System for Analytics while referencing an ITG report which says something completely different.

“Among 21 PureData System for Analytics users, 18 reported that they employed less than one FTE administrator. The exceptions were an organization that declined to state the number of systems employed, but described the installation as over one petabyte (one FTE was employed); and others reporting more than 20 and more than 30 systems respectively (two FTEs were employed).”

For the 4 customers profiled in detail in the ITG report, the FTEs per rack is 0.32 (2.4 FTEs/7 racks). Where on earth did the 4 FTE per rack number come from??


3. I don’t think I need to say much more, but the last thing I’ll add is that even the numbers for hardware resources per rack appear to have inaccuracies as well. The amount of cores in a rack of PureData System for Analytics is given as 112, where as the number is in fact double, with another 112 cores doing filtering of data off the storage tier. This would bring the total number of cores on the system significantly more than on the Analytics Platform System, but these cores are conveniently excluded.

Part 2: Governance in the Big Data Age

An organization’s volume of generated unstructured or semi-structured data has tremendous value. It may be less valuable say on a per mb basis than the information stored in your data warehouse and it may be some time after beginning to collect the data before the value is realized, but there’s no doubt that the details of the minute interactions between customers and systems can be leveraged to transform a business. Moreover, Hadoop is increasingly positioned as a landing zone for ALL an organization’s data, structured and unstructured, where exploratory analysis can be performed, as well as an archive for aged data from a data warehouse – see this blog post by a colleague to see where Hadoop fits in the IBM Watson Foundations vision. It is obvious then that an organization’s Hadoop store will generally contain sensitive data. It could be sensitive personal information governed by regulation or simply valuable and proprietary information, but it needs to be secured just the same as it would in a traditional relational data store.

As I hinted in my last post, the importance of governance of Big Data initiatives was something that was considered early on in IBM’s BigInsights development. Fortunately, IBM already had leading capabilities in-house for security and data privacy and extended these capabilities to the Big Data space. InfoSphere Data Privacy for Hadoop allows an organization to secure their Hadoop environments by:

  1. Defining and sharing big data project blueprints, data definitions – define a big data glossary of terms, define sensitive data definitions and policies
  2. Discovering and classifying sensitive big data – discover sensitive data and classify it
  3. Masking and redacting sensitive data within and for Hadoop systems – de-identify sensitive data either at the source or within Hadoop, and obfuscate data whether structured or unstructured
  4. Monitoring Hadoop Data Activity – monitor big data sources and the entire Hadoop stack and issue alerts as necessary, gather audit information for reporting purposes


Part 1: Governance in the Big Data Age

The revelation earlier this year that 100s of thousands of Facebook users were unknowingly subjects in a psychology experiment in 2012 caused widespread negative reaction. According to this WSJ article “Researchers from Facebook and Cornell University manipulated the news feed of nearly 700,000 Facebook users for a week in 2012 to gauge whether emotions spread on social media.” Another interesting read comes from Doug Henschen of InformationWeek titled “Mining WiFi Data: Retail Privacy Pitfalls”. In this article Doug speaks to the value that retailers can realize by mining Wifi data but also the potential pitfalls of being able to track and store the minute behaviors of individuals.

So of course Facebook is not the only organization with a burgeoning wealth of personal customer data; every business looking to gain an edge in its industry is looking to store every piece of data it generates (including data on every single customer interaction) and at some point gain valuable insight from it. Every business with a Big Data initiative needs to carefully consider data privacy and security ramifications. And beyond the ethical decisions around use of data that must be considered is how technology supports governance of data – how is access to data limited and tracked, how do you know what personal data you are storing and how do you mask it?

The critical importance of governance for the success of a Big Data initiative is something IBM recognized very early and something it has invested heavily in for its BigInsights Hadoop offering. I wanted to take a few posts to take a closer look at capabilities for governance included in BigInsights – where they come from, how they work and the business problems they address.