Read Part 1
Read Part 2
In-Database Operations – Analytics and Data Transformation
An important principle for working with big data sets is to minimize data movement, to keep the processing close to the data. One aspect of this is leveraging in-database functions. PureData System for Analytics has far more pre-built in-database functions for data transformation, predictive or statistical analysis, and data mining tasks among others compared to other established DBMSes like Oracle and Teradata. Many third-party analytics vendors have also built their own functions in PDA to keep the processing within the db (the PDA SDK allows you to write your own functions in MapReduce, Java, Python, Lua, Perl, C, C++, Fortran, PMML). I have not heard or seen much about these capabilities in PDW.
To keep up with the processing speed of an MPP system, you need a data transformation/integration process that can keep up. Many basic ETL tools like SSIS would have difficulty managing to do this as enabling parallelization of the workload is manual and complex. The best practice for PDW I understand is landing the data in the system first, and then using SQL to work with it. How much time will your developers spend creating and modifying these functions if PDW lacks the equivalent pre-built data transformation functions of PDA?
Another option for PDA is the use of data transformation tooling which actually allows the building of data flows in a graphical interface but then pushes the actual execution of them down to PDA automatically. Both IBM Information Server and Informatica PowerCenter support this.
Microsoft sponsored a report by Value Prism Consulting published March 2013 comparing the cost of Parallel Data Warehouse to competitor offerings including IBM PureData System for Analytics. The total estimated costs of a rack of PDA and PDW in this report actually are within $20K of each other. The report argues that based on cost per TB of user data, per CPU, and per GB of memory, PDW is the best choice for high-end warehousing. This to me is like evaluating a meal at a restaurant based solely on pound/$ of food. If you served a giant bag of low end dog food to me, I wouldn’t recommend your place to my friends. Analytics is about whether or not you can get the deep insights you need from your data in a timely and cost-effective manner. Additional notes are that the number of cores for IBM’s PDA does not include FPGA CPU cores which dramatically changes the number, and there is now an updated version of PureData System for Analytics with more processing resources overall.