Part 1: Governance in the Big Data Age

The revelation earlier this year that 100s of thousands of Facebook users were unknowingly subjects in a psychology experiment in 2012 caused widespread negative reaction. According to this WSJ article “Researchers from Facebook and Cornell University manipulated the news feed of nearly 700,000 Facebook users for a week in 2012 to gauge whether emotions spread on social media.” Another interesting read comes from Doug Henschen of InformationWeek titled “Mining WiFi Data: Retail Privacy Pitfalls”. In this article Doug speaks to the value that retailers can realize by mining Wifi data but also the potential pitfalls of being able to track and store the minute behaviors of individuals.

So of course Facebook is not the only organization with a burgeoning wealth of personal customer data; every business looking to gain an edge in its industry is looking to store every piece of data it generates (including data on every single customer interaction) and at some point gain valuable insight from it. Every business with a Big Data initiative needs to carefully consider data privacy and security ramifications. And beyond the ethical decisions around use of data that must be considered is how technology supports governance of data – how is access to data limited and tracked, how do you know what personal data you are storing and how do you mask it?

The critical importance of governance for the success of a Big Data initiative is something IBM recognized very early and something it has invested heavily in for its BigInsights Hadoop offering. I wanted to take a few posts to take a closer look at capabilities for governance included in BigInsights – where they come from, how they work and the business problems they address.

imperatives_bigdata

Part 4: SQL Server Parallel Data Warehouse – Best Thing Since Sliced Bread?

Read Part 1

Read Part 2

Read Part 3

I was going to keep this series to three parts but the announcement this month by Microsoft of the upcoming next iteration of its Parallel Data Warehouse-based offerings. The Parallel Data Warehouse-based systems have been re-branded as Analytic Platform Systems, and the next set of offerings will include an optional HDInsights component (Microsoft’s Windows-based Hadoop offering based on HortonWorks Data Platform). Quanta, which I had never heard of before listening in to the announcement is added as a PDW-based system provider in addition to HP and Dell, though apparently their offering can only be sold in the US and China. Though it looks like the basic architecture of the solutions will not change in V3 beyond the HDInsight addition, I assume some HW updates are possible e.g. processor updates.

My main thought on the major part of this announcement, the HDInsights addition, is that adding Hadoop will not address the shortfalls of PDW as a high-end warehousing solution. It is certainly a nice checkbox to check off, but customers evaluating different solutions should not be blinded by a hadoop bolt-on. Also, HDInsights first became generally available at the end of 2013, so the maturity of this piece is also somewhat in question – I see it limited to use as a “warm archive” for warehouse data.

 

 

Part 3: SQL Server Parallel Data Warehouse – Best Thing Since Sliced Bread?

Read Part 1

Read Part 2

In-Database Operations – Analytics and Data Transformation

An important principle for working with big data sets is to minimize data movement, to keep the processing close to the data. One aspect of this is leveraging in-database functions. PureData System for Analytics has far more pre-built in-database functions for data transformation, predictive or statistical analysis, and data mining tasks among others compared to other established DBMSes like Oracle and Teradata. Many third-party analytics vendors have also built their own functions in PDA to keep the processing within the db (the PDA SDK allows you to write your own functions in MapReduce, Java, Python, Lua, Perl, C, C++, Fortran, PMML). I have not heard or seen much about these capabilities in PDW.

To keep up with the processing speed of an MPP system, you need a data transformation/integration process that can keep up. Many basic ETL tools like SSIS would have difficulty managing to do this as enabling parallelization of the workload is manual and complex. The best practice for PDW I understand is landing the data in the system first, and then using SQL to work with it. How much time will your developers spend creating and modifying these functions if PDW lacks the equivalent pre-built data transformation functions of PDA?

Another option for PDA is the use of data transformation tooling which actually allows the building of data flows in a graphical interface but then pushes the actual execution of them down to PDA automatically. Both IBM Information Server and Informatica PowerCenter support this.

Cost

Microsoft sponsored a report by Value Prism Consulting published March 2013 comparing the cost of Parallel Data Warehouse to competitor offerings including IBM PureData System for Analytics. The total estimated costs of a rack of PDA and PDW in this report actually are within $20K of each other. The report argues that based on cost per TB of user data, per CPU, and per GB of memory, PDW is the best choice for high-end warehousing. This to me is like evaluating a meal at a restaurant based solely on pound/$ of food. If you served a giant bag of low end dog food to me, I wouldn’t recommend your place to my friends. Analytics is about whether or not you can get the deep insights you need from your data in a timely and cost-effective manner. Additional notes are that the number of cores for IBM’s PDA does not include FPGA CPU cores which dramatically changes the number, and there is now an updated version of PureData System for Analytics with more processing resources overall.

Part 2: SQL Server Parallel Data Warehouse – Best Thing Since Sliced Bread?

Read Part 1

Appliance or Bunch of Packaged Components and Ease of Use

It’s easy to call something that you can order with a single product number an appliance or integrated system. Certainly prepackaging all the components together and taking the work of designing a balanced architecture and initial configuration out of the equation saves on time-to-value or deployment. The other side of the coin is the on-going effort for administration and new development. PureData System for Analytics or Netezza was built from the ground up as an integrated hardware + software solution for optimal analytic performance without extensive tuning work. It uses a unique Asymmetric Massively Parallel Processing architecture in which a dedicated FPGA layer filters out 95-98% of table data, keeping only the data needed to answer a specific query – you can find more detail on the architecture here.

PDW as I mentioned was born out of some combination of DATAllegro and SQL Server. What elements of which technology are used where in PDW is not clear to me, though Microsoft papers speak of SQL Server running on each of the nodes. So what you have is an MPP variant of SQL Server running on HP or Dell hardware densely packed with processing resources. SQL Server as I recall does involve administration and performance tuning, and moving into a shared nothing architecture can dramatically increase complexity. I understand that if you use PDW’s ColumnStore Indexes, you don’t need to deal with traditional index creation, but I’m not convinced this translates into zero to minimal tuning. And you can’t get a good understanding of PDW administration or tuning tasks because the PDW documentation and community is only accessible to PDW customers (wonder why…).

Performance, Concurrency, Workload Management 

Performance and concurrency go hand in hand. It’s great that if you send a single complex query through a system and it runs in seconds – you often hear shiny statements like “this query ran in 5 seconds, 5 gazillion times faster than on previous system“, where the previous system could be my 2 year old laptop or an abacus for all you now (an in memory db with a four letter name comes to mind). But if a DBMS cannot effectively managed the execution of requests to make the most of its resources, performance can hit rock bottom.

Parallel Data Warehouse uses a fairly basic workload management scheme. A rack of PDW has 32 concurrency slots, or shares of CPU and memory. By default a query coming in gets one of these, a 1/32 share of resources. After V1, PDW development must have realized that more complex queries might require more than this 1/32 share and created classes that you could map queries to provide them with a larger share of resources (small, medium, large, and extra large). PureData System for Analytics provides a number of approaches to managing workloads efficiently. Two of these are Guaranteed Resource Allocation (GRA) and Short Query Bias (SQB). GRA assigns a minimum and maximum share of resources to groups of queries, so that while a set of queries is guaranteed a minimum share, it can use up to its maximum if resources are available.  SQB automatically lets queries which will have a short run time execute even if larger queries are in progress.

Though I do not have detailed comparisons to share externally at this point, I will say that from accounts I have heard of head to head situations, PDW has struggled with a high number of concurrent queries running across a large data set compared to PDA/Netezza. Don’t take my word for it, put it to the test!

 

Part 1: SQL Server Parallel Data Warehouse – Best Thing Since Sliced Bread?

Update: As of an April 2014 announcement Microsoft is calling its upcoming next iteration of Parallel Data Warehouse Edition-based offerings Analytics Platform Systems and relative unknown Quanta joins Dell and HP as a HW provider.

With my first post, I wanted to take a look at the capabilities of Microsoft’s SQL Server Parallel Data Warehouse offering and contrast it with a more established offering, IBM’s PureData System for Analytics – still probably better known today as Netezza.

Parallel Data Warehouse (PDW) is offering you can order from HP called the AppSystem for Parallel Data Warehouse or from Dell called the Dell Parallel Data Warehouse, both running SQL Server 2012 Parallel Data Warehouse edition. Parallel Data Warehouse Edition combines or leverages capabilities from Microsoft’s SMP-only SQL Server and that of their 2008 DATAllegro acquisition.

PDW Background/History

When general availability of PDW V1 was first announced in November of 2010, the message seemed to me to be that the MPP (massively parallel processing) or shared-nothing architecture of PDW was something new and revolutionary, rather than a technology leveraged by some other vendors for two decades for very large databases. IBM introduced DB2 Parallel Edition, today called the DB2 Database Partitioning Feature or DPF, in 1995; Netezza, today called PureData System for Analytics, came out in 2003; Teradata had an offering in the 80s. While it is positive that Microsoft introduced an option for customers hitting the wall with BI on SQL Server (where Oracle for example persists with their RAC shared data architecture for everything), many customers long ago recognized that shared nothing was the right approach for working with large data sets and have leveraged shared nothing platforms to gain insights from their data. The small number of PDW case studies being highlighted by Microsoft up to two years after the V1 release suggest adoption has been slow.

In first half of 2013, PDW V2 came out with a significantly different architecture, moving from deployment of the software directly on the servers in a rack to using Hyper-V virtualization, using JBODs (just a bunch of disks) vs. a SAN, and a 1 rack starting format (vs. 2 rack in V1) with more CPU, memory, and disk. There were also a few database-level enhancements, the most notable being use of ColumnStore Indexes for query performance improvement.

Proof Points

Reading about how a product offering’s benefits in a vendor’s solution brief is great. Hearing from actual customers is much better. PureData System for Analytics is used by over 1000 customers, with all of them echoing the same key points. Customers see excellent Total Cost of Ownership, not only from the software and hardware cost side but in terms of on-going management cost – big data volumes tend to create big data complexity. They also invariably see excellent out-of-the box performance for their most demanding analytic workloads. According to the a Jan 2014 Information Week article on Big Data Analytics platforms by Doug Henschen, “There’s no doubt that Microsoft is amassing all the pieces, but it’s early days for HDInsight, and we still don’t see many PDW deployments after three years in the market.” While proof points are not everything in a world of rapidly evolving technology, they are something worth paying attention to.

Stay tuned for following parts…