Sunday, December 16, 2012

Binning 40 Million Rows on GreePlum, SAS HPBIN


Binning often happens once a model universe is built. A typical credit risk modeler could spend >20% of a project cycle on binning. Let me call this Type I binning application. Another often seen area where analysts bin data is data exploration/ management where the exercise is more ad hoc settlement than analytically premeditated ("Can we just break these 2 billion rows into 10 bins and check out the distribution? I cannot read anything meaningful from the original curve"). Let me call this Type II.
  1. Whether Type I practitioners are going to build models on bigger data is anybody’s Q & A. Or if they do decide to embark on, say, building a random forest model using 40 million *100 input variables, binning may not be considered necessary. Some, regardless of what models to build, are against binning anyway. I believe while domains like credit risk score cards are predicated on binning, many analytical applications involving rich details of big data need to carefully weigh the pro and cons of binning: it is really hard to say whether binning makes signals clearer or not; “binning does not guarantee ‘good’ binning” is a strong argument. It is more about information value of individual data elements, a decision that ought not to be strategic, but more like 'game time decision'
  2. Type II areas, in the past 12 months or so, are quietly orienting towards serious analytical practice. While monitors used to display analytics are getting sharper and sharper, advanced analytics is invading enterprise operations. While profiling remains ‘things to do’ on big data, methods like sequence alignment methods (SAM) are entering batting practice. To align successfully, to large extent, is to bin successfully; analytics is not photography after all. The tallest table I have heard is from a SAS customer who wants to comb through ~20 billion rows several times during a day
SAS did not have a stand-alone procedure for binning until August 2012 when proc HPBIN was introduced as a part of SAS 12.1 HPA (High-Performance Analytics). The focus of this writing is threefold. First is to show syntax example of proc HPBIN. Second is to show how it is like binning a variable that has 40 million rows on a Greenplum cluster.  Last, some new algorithms in the HPBIN procedure
Part One: Syntax of HPBIN (this is procedure has a lot in common with Enterprise Miner)
"
proc hpbin data=&indsn numbin=8 /*supports <=1000*/ computequantile computestats
output=_xin; /*you can write out binned data with or withour replacing original data*/
bucket /* the other 2 methods are winsorized and pseudo_quantile. Pseudo_quantile is one novel way to do quantile binning on big data. Details below */
performance host="&GRIDHOST" install="&GRIDINSTALLLOC";
/*if you have SAS HP grid installed, you can leverage parallel processing there. You can also run locally if you so choose*/
 var cr:;
freq freq1; 
 id acct; 

 /*if ID statement is used, only the ID variable and binned results are included in the output data set. Support multiple IDs. When ID statement is not present, use REPLACE to replace original data*/
 /*code file=code;*/
run;
"
Part Two: Performance Impression

On a Unix server box where SAS Grid Manager governs 32 worker nodes with ~1.5 TB RAM, a credit score variable that has exact 40 million rows resides in a data set that is ~10GB, on a Greenplum cluster. Below are some SAS log details from applying HPBIN to the variable
"
3921  proc hpbin data=&indsn numbin=8 /*pseudo_quantile*/ bucket
3921! out=_xin;
3922      var annual_profit /*cr: os: pur:*/;
3923      id acct_number;
3924  run;

NOTE: Binning methods: BUCKET BINNING .
NOTE: The number of bins is: 8
NOTE: The HPBIN procedure is executing in the distributed computing environment with 32 worker
      nodes.
NOTE: The data set _XIN has 40000000 observations and 2 variables.

NOTE: PROCEDURE HPBIN used (Total process time):
      real time           14.41 seconds
      cpu time            2.25 seconds

"
  1. If today's top big UNIX boxes are Ph.D., this box is more like undergraduate freshman. Still, it is a decent BIG box. Unless your application is like fraud detection, this speed is acceptable
  2. There was 2 concurrent jobs running at the same time.
  3. While I know the performance is not very linear with the # of rows, under the same condition, if you bin 400 million rows on this variable, your total real time likely will be < 1 minute
  4. The gap between CPU time and Real time is significant, indicating some data movement
  5. I have not had chance to test this on Hadoop. I ran some SAS HPA work on Hadoop clusters. I suspect the performance there will be comparable with this box. You just need to reset your libname
Part Three: Something novel

In quantile binning where sorting is very particular, tall variables can pose enormous computation challenges. Tall variables in big data certainly get much taller. Procedure syntax appears very simple but the sorting algorithm in the background is complex (Winsorization does not get much better). I don't want to turn this writing into manual reading or technical training. Just a brief: " The pseudo–quantile binning method in the HPBIN procedure can achieve a similar result with far less computation time...." You can indulge in more details when you get to read SAS HPA documents. Important is this may give you a good glimpse on how SAS tackles big data computation: leverage its industry-dominant strength in algorithm research, innovation and implementation.







2 comments: