Sunday, December 1, 2013

Why many have the sentiment ROI on big data is not paying off? 12/2013

        In the past 2 years, I have seen Java/computer scientist firing up Java/fancy programming tools to do ... One example: to build objective functions for a regression. With all due respect, for statistical software companies like SAS and SPSS this was cutting-edge of a time when Clinton was in WH. Some today asked (SAS) to break out its much advanced big data analytics software, to show them how objective function is built, so to validate their building using R. If you don't know David Ricardo's 'comparative advantage', Yahoo it now. Please don't tell me you should spend >>ten weeks of your time (charging your client ~$300/hour?) to build a pair of shoes, instead of spending $2000 to just buy one off-shelf (very likely better built than your cookout), because what? Because you are not statistician?

    If your goal is to start up business in advanced analytics, hoping to go for IPO, striking big $, that is fine, and probably a necessary path to start from scratch, if not coding from ground zero (if not asking open source community to contribute to your cause for free). For 99% of us into big data analytics, it is about enhancing our core business on hand. Why today more and more seem to have the sentiment that ROI on big data is not paying off? Forgetting your core competence/specialty and core business amongst big data fever is one key reason: if you are not able to articulate "why not", this inability becomes "why Yes" quickly. This way of investment has obvious logic problems, and is anti-analytics per se. What is SIC (standard industry code) for analytics? None, because it permeates each SIC. Your job is to hold onto your SIC, adopt and modernize your analytics.

    I see speech and blog where people toss up new terms and concepts on big data analytics. Often 5 minutes later I realize "oh, is that just what statisticians call (clustering)? Kernel estimation? ....."  Not many read deeply into (statistics) past literature these days. For some, if they cannot find it at, they start to think they have one innovation on their hand. One day I was asked to take a look at 'a design'. I suggested applying a KS test. That test eventually eliminated ~750K lines of Java code the developer was writing for >3 months. KS test?  Is that what statisticians have been doing behind banks' firewall in the past >15 years? Now you spent another 3 months to code KS using Java. You could not match SAS. You switched to SPSS. Still nowhere close, while SAS and SPSS have turned in consistent results/+cosmetic differences on the same data set... My point? Integrity, regardless on big data or small data, is way more important than scalability; scalability actually is the easier part.

   Instead of checking out fashion labels on our jackets, statisticians and non-thereof should work together on big data analytics. Recently I had honor to review a friend paper. I was very impressed by her creativity and ability to use R. Then she asked "when do you think SAS is going to implement it?" "Why do you ask?" I smiled on the webcam. A sheepish look on her face "you know the authenticity part....." Creativity, nimble, flexibility shall meet and marry the 'king of algorithm'. The offspring should benefit all of us. If you want to exceed a giant, try to stand on its head or shoulder to grow. If you choose to start afresh along its side, the chance is you will live in its shadow for long time, if not for life.

   Another friend is a division chief at a big NYC hospital. Two days ago he told me his medical school is hiring computer scientists to work with biostatisticians. I am also seeing banks hiring analysts with more diverse background, like physics major doing predictive modeling. This trend towards multidisciplinary mix is healthy. Let us don't dumb down and out any major. Statistics is going to be a stalwart in big data for a long time to come. If you don't learn and adapt quickly, you become irrelevant, regardless which major you are in. My experience in the past is learning statistics is harder than learning machine learning stuff. If you 'hate' statistics for that reason, I fully appreciate and am with you, especially if the market does not appear to pay statistician as much it pays data sciences. On the other hand, if you take away coding/programming/system building, how much analytics really is left in many data science? See for yourself.

Tuesday, May 21, 2013

Fitting Logistic Regression on 67 Million Rows Using SAS HPLOGISTIC

This blog is mainly focused on model processing performance details using PROC HPLOGISTIC under SAS HPAS12.1. Given the data set is the same, the target and input are also the same, AICC statistics is shown for the models; when data set is different, or the target or the inputs are different, it makes little sense to talk AICC or performance statistics in general about models constructed.

The jobs are processed on a Greenplum parallel system running SAS High Performance Analytics Server 12.1. The system has 32 worker nodes. Each node has 24 threads with 256GB RAM. The data set has 67,164,440, ~67 million rows. Event rate is 6.61% or 4,437,160. The data set is >80 GB. After EDA steps, 59 variables are entered to fit all the models.

One executive summary style of observation is : we are talking seconds, and minutes.

Below is sample code for fitting a binary logistic regression using HPLOGISTIC

The following is SAS log for the job running inside SAS Enterprise Guide project flow

While much of HPLOGISTIC's model output is similar to PROC Logistic from SAS STAT, some new output reflects strong and renewed emphasis on computational efficiency in modeling. One good example "Procedure Task Timing" is now part of all SAS HP PROCs. Here is example + details for three Newton Raphson with Ridge (NRRIDG) models.

Finally, a juxtaposition of several technique and selection mixes, iteration, time spent and AICC

Some observations from this 'coarse' exercise (refined, depth work is planned for 2014)

1. WHERE statement is supported. You can use PROC HPSAMPLE to insert a variable _partind_ into your data set. Then use WHERE statement to separate training from validation and test partitions. In this way, you don't have to populate separate data sets which now have big footprint

2. RANUNI function is also supported. Unlike PROC HPREG, though, as of today, PARTITION statement where you can leverage external data set for variable selection and feature validation, is not yet supported in HPLOGISTIC. The popular C statistic will be available in June 2014 release of SAS HPAS 13.1.

3. It becomes more and more obvious that the era of Fisher Scoring is moving on where most, if not all modelers, build one Exploratory Data Analysis (EDA) set to fit all techniques and selection.  To max out what the big data set can afford for lift, one may need to build one set per technique. That the build using technique=NMSIMP crashed in this exercise examplified this. NMSIMP is typically suitable for small problems. It is recommended one samples down the universe. The technique best for large data set is Conjugate-Gradient Optimization, which is not shown in this blog (I should have one tested on Hadoop cluster ready for blogging soon)

4. Apparently it is not the case the more iteration the better performance. This seems also the case with random forest or neural networks. In other words, there ought to be a saturation lift point, for a given EDA on a given data set. One needs to reach that 'realization point' sooner than later to be more productive

5. Over data ranges such as 67 million, I doubt if HPSUMMARY, HPCORR... all the usual summary exercise is sufficient to understand the raw input at the EDA stage. Multicollinearity is another intriguing subject. The trend seems that it is being overridden by observation-wise swap test for best subsets of variables. Kernel estimation and full visual data analysis such as offered by SAS Visual Analytics are two other strong options.

6. Logistic regression, especially armed by facilities such as SAS HPLOGISTIC, opens the door for innovative modeling scheme. More and more are moving beyond "one row per account" to model directly on transactions. In the context of transactions, 67 million is really not considered big at all. So be bold. Because if today with HPLOGISTIC, if your performance still suffers, what is holding you back is hardware, not the software. (Well, if you insist on using SAS 9.1 of 2008, and refuse to move onto HPLOGISTIC of 2013, and keep on declaring 'SAS is slow', there is nothing anyone can do about it).

7. Analytically, if you build directly on transactions, instead of rolling up and summing up, you are purely modeling behavior. However, if your action requires account or individual level, how to roll up your transaction model scores to the level is interesting.

Sunday, May 19, 2013

Mining 108 Million Text Messages in 7 Minutes: SAS High Performance Text Mining HPTMINE

The job is processed on a Greenplum parallel system running SAS High Performance Analytics Server 12.1. The system has 32 worker nodes. Each node has 24 threads with 256GB RAM.

The text data is a text type column in a SAS data set. The total file size is  ~187 GB. Total text cells /messages processed are ~108 million. Cell weight, document weight and SVD are computed

The following picture shows detailed processing log of the SAS job

Below is detailed speed info of each computing step inside the whole job. Parsing takes ~70% time

Finally, a snapshot of the frequency-term table

Saturday, April 13, 2013

Generalized Linear Model Structure and Nonlinear Model Structure in SAS STAT

SAS STAT product has so many model tools to offer sometime one is confused which covers what
cases and data structure. Below is a summary diagram I took from a training course SAS offers.

Again a picture speaks volume. This diagram is two years old. I believe, 90%, stay the same since.

Some such as GENMOD and GLIMMIX may be considered to move to HP platform. And NLIN and MIXED already have their big data counterpart in SAS HPA's HPNLIN and HPMIXED

Friday, April 12, 2013

SAS Clustering Solution Overview, just One Picture

More and more encounters and friends lately told me they see many SAS procedures that are related

to clustering, but not clear about interrelations among them (which one does what). From a training

course offered by SAS titled "Applied Clustering Techniques", I found a diagram that does a good

job explain it

As we often say, a picture is better a thousand words. Take a look

SAS High Performance Text Mining: SAS HPTMINE

Currently there is one text mining procedure in SAS HPA, HPTMINE (experimental) which actually works fair well. This writing presents one working example.
The text file contains ~216K news entries, total file size ~384MB. The example runs on a Windows client with 16GB RAM.

proc HPTMINE data=doc2.news2;
   doc_id id2;   /*ID variable is required*/
   variable description; /*listing multiple variables may cause confusion*/
   parse outterms = doc2.out_terms_news reducef=2;

 /*frequency for term filtering: minimum frequency of occurrence by which a term is dropped*/
   /*nostermming entities= stop= start= multiterm= syn= termwgt= cellwgt= outchild= outterms=*/

  /*all these options can be turned on and off. Weighting is important in tweaking process*/
   svd k=10 outdocpro=doc2.docpro_news

  /*this is critical math part in the whole exercise. In some cases you act on direct frequency*/
   svdu=doc2.news_svdu  /*left singular vector*/
   svdv=doc2.news_svdv; /*right singular vector*/
   /*tol=  tolerance value for singular value*/
   /*resolution =low|med|high
   performance host="&GRIDHOST" install="&GRIDINSTALLLOC" details;*/
run ;


This procedure integrates several separate procedures available in regular SAS Text Miner, so as to reduce I/O traffic due to the separations. The advantage from this integration is more pronounced when the input text file is huge. This integration also is a logic centralization to happen before parallel computation is invoked to execute the job.  This specific example is not executed on parallel nodes.

Below are some log details, less than 2 minutes for the operation


Below are screen shots of term probability table and term-frequency matrix. The mechanics of the whole operation is very intuitive. To get desired outcome often requires time-consuming tweaking. The upside is using all defaults could very well


Wednesday, March 13, 2013

SAS Random Forest Modeling on Hadoop, Proc HPFOREST

The example is essentially the same as the random forest model I covered in my 11/23/2012 blog post, with some minor adjustments. The key difference is this instance is now implemented on Cloudera CDH 4.0 Hadoop cluster, as compared to the previous one built on GreenPlum appliance. The SAS High Performance Analytics client runs on a Linux box. The grid is computing with the Hadoop cluster, 16 worker nodes +a head node, a total of 1.5 TB RAM.

This is how the client interface looks like. Jboss is not the best. Works OK.

This random forest model uses 280 interval variables, only 3 categorical variables, against a binary target, ~1.6 million rows. A snapshot of SAS log is below

  1. About 22 minutes to finish a random forest model, 5 other concurrent big jobs running
  2. I ran it 5 times. It gives the same result, very consistent. The quickest run takes 20 minute 14 seconds. The longest is >26 minutes. Does not vary much. I can reduce it to seconds. But real-time is not always necessary
  3. I changed vars_to_try from 3 to 17: 17*17=289, the closest number to 283, the total number of input variables. The model improves quite a bit, in terms of misclassification rate. It costs on average ~5 more minutes
  4. This data set I have is small. So I ran it on a small Hadoop cluster to test. For jobs involving bigger data sets, you need to maintain and expand your clusters and grid network
  5. This mode, to use a term stolen from large-scale predictive learning community, is a in-memory model. It appears that SAS is getting ready to 'industralize' random forest models on large scale of data.
  6. I plan to publish some practice on how to prepare data for random forest modeling. Many have the mind set to build random forest models like pushing IPhone buttons, to avoid typically lengthy exploratory data analysis in building, say, a logistic regression model. GOOD random forest models, however, require data preparation and tuning, just like GOOD logistic regression. The difference, in terms of dollar and sense, can be heaven and earth in some cases
  7. SAS HPA currently already supports Apache Hadoop. Will SAS run on MapReduce? Will see. 
Thanks for viewing.

Wednesday, March 6, 2013

SAS PROC HPNEURAL vs. PROC NEURAL: Statement List Side by Side

PROC HPNEURAL is neural network modeling PROC available in SAS High Performnace Analytics release. PROC NEURAL has been the solid workhorse behind SAS Enterprise Miner's Neural Network Node. Many have asked me what are the difference. Sadly I have not found time to write about it.

Below is side-by-side of what statements are avaiable in either, tip of an overview


Sunday, February 17, 2013

TOBIT and Limited Dependant Variable Modeling on Big Data: Making the Case for SAS HPQLIM

Tobit models are regression models where range of the target variable is constrained or
limited in some way. For example, testing kit used to measure lead level in households below
10 parts per billion; the target takes on value 0 for measures of 0 to 9 parts per billion.

Tobit models apply where the response variable has bounds and takes on limiting values for
a large percentage of the respondents. Tobit models take into account the (over)
concentration of the observations at the limiting values when estimating and testing hypotheses
about the relationship between the target and the explanatory variables.

In SAS product family, limited dependent variable modeling could use PROCs QLIM, NLMIXED, or PHREG. In some design, other variants of survival models could be formulated. This writing focuses on the newly released high performance Proc HPQLIM from SAS Institute. The focus is on how HPQLIM, as compared with traditional Proc QLIM in SAS ETS, can support estimation involving much more variables and carrying out modeling in time several magnitudes faster.

Two computer environments are used for this comparison. The first one is a Windows 2008 Server R2, running Intel Xeon X7560, 2.27 GHz, 2.26 GHz (2 processors), 16 GB RAM, 64 bit OS. Let us call it OC (old client). The second one is a GreenPlum Appliance, with 1 worker node managing a grid of 32 worker nodes. 96 GB per node. Let us call it GA

The data used are from a training course that has ~19,372 observations and 27 variables (10 used as explanatory variables). To compare, I run one model using this data set as it is (call it model 0). Then stack the data set 400 time to get ~7.7 million observations, still the same 27 variables. I then run three models: using Proc Qlim running on OC (model 1), using Proc HPQLIM on OC (model 2) and using Proc HPQLIM on GA (model 3). Models 1, 2, 3 yield identical meaningless models. The computational performance provided by HPQLIM hopefully will interest you, if not impress you. The size of 7.7 million observations used for modeling, to data base marketing models, is still considered small. As far as I know, for QLIM models, this size is exceptionally large. While there are other reasons not many variables are used in limited dependent variable models, I am sure computational constraint in Proc QLIM is major one.

As of this release, HPQLIM does not support CLASS statement. I therefore comment out Class statement from all four models.

Target distribution is below. So you see a Proc QLIM may be suitable

 The log below shows process details for model 0 using 19K observations. Real time 4.6o seconds.

The next log is from running model 1, 7.7 million observations, 38 minutes and 7 seconds real time

When HPQLIM is engaged (still running on OC, the old computer), performance improved dramatically (see below). This is due to the fact SAS High-Performance PROCs default to take advantage of symmetric multiprocessing (SMP) mode, if SMP is enabled on the computer. This is the case with OC

Then I ran HPDS2 to copy the big data set to GA appliance (25.92 seconds) and repeated the model using HPQLIM. The model is identical. The real time is 1 minute 24.20 seconds, 21.21 second CPU time. (I replicated it in 29 seconds before, but tonight GA appears a bit busy)

I don't necessarily believe one should engage table as tall as 7.7 million for QLIM. I think better potential lies in that you may widen the table by including more input variables for consideration. Some of my friends told me they cannot realistically launch sample selection models and stochastic frontier models if the data set is relatively wide, under Proc QLIM. Hopefully HPQLIM changes that.

Saturday, February 9, 2013

Moving Logistic Regression toward Big, Complex Data:SAS HPLOGISTIC Optimization Techniques

In August 2012, SAS Institute released its first version of high performance based analytics server (HPAS12.1). Throughout the ~25 HP procedures released, new designs and changes are consistently made to help users better meet today's big data challenges in predictive modeling in terms of more efficient and finer algorithms, among others.

Proc Logistic is one of the most popular and widely used procedures in SAS products for logistic regression model building. To cover all major big data facilities in the new Proc HPLOGSITIC likely will yield a big paper. This writing focuses on one key aspect, the optimization techniques for maximum likelihood estimation and includes some excerpts from HPLOGISTIC user guide where presentation / explanation is the best

Under Proc Logistic, the default optimization technique is Fisher Scoring. One can change it to Newton Raphson (NF). One can also set Ridge option to Absolute or Relative. All is done under Model Statement. Under Proc HPLOGISTIC, Fisher Scoring disappears entirely. The default optimization technique is set at Newton Raphson Ridge, or NRRIDE. The table lists all the options

While in practice (and in theories) there is little consensus as to which option fits what data conditions, HPLOGISTIC's user guide provides excellent guideline

          "For many optimization problems, computing the gradient takes more computer time than computing the function value. Computing the Hessian sometimes takes much more computer time and memory than computing the gradient, especially when there are many decision variables. Unfortunately, optimization techniques that do not use some kind of Hessian approximation usually require many more iterations than techniques that do use a Hessian matrix, and, as a result the total run time of these techniques is often longer. Techniques that do not use the Hessian also tend to be less reliable. For example, they can terminate more ".

Time taken to computer gradient, function value, Hessian (where applicable), number of decision variables involved are among the key choice factors
  1. Second-derivative methods include TRUREG, NEWRAP, and NRRIDG (best for small problems for which the Hessian matrix is not expensive to compute. This does  not necessarily say calculating Hessian matrix, for small problems or not, is not expensive. 'Small problems' still vary a lot)
  2. If you want to replicate your old model where Fisher Scoring is used, you can use NRRIDG. Where your target is binary, you may get identical results. Otherwise results may be slightly different (mainly estimation coefficients)
  3. First-derivative methods include QUANEW and DBLDOG (best for medium-sized problems for which the objective function and the gradient can be evaluated much faster than the Hessian). In general, the QUANEW and DBLDOG algorithms require more iterations than the Second-derivative methods above, but each iteration can be much faster. The QUANEW and DBLDOG algorithms require only the gradient to update an approximate Hessian, and they require slightly less memory than TRUREG or NEWRAP.
  4. Because CONGRA requires only a factor of p double-word memory, many large applications can be solved only by CONGRA. However, I personally feel the computational beauty of CONGRA may actually be overstated a bit
  5. All these insights and guidelines are of course to be vetted and reckoned with other key aspects such as selection criteria (selection in HPLOGISTIC, by the way , has become a separate statement under the procedure, unlike Proc Logistic where Selection is a Model statement option)
While SAS remains the very best, strongest statistical powerhouse, the big data orientation and embedment in its HPAS release, hopefully manifested by this writing, has demonstrated its leading position in commerical machine learning solution in tackling big data; SAS is very computational today. New HP procedures like HPLOGISTIC require the modeler to be very sensitive and conscious of data conditions, complexities and residuals in the model universe on hand. The ultimate value of SAS HPAS, like many other SAS solutions and tools, lies in its productivity implication: You don't build anything from ground zero. You don't even write a line of code.

My next writing on SAS logistic regression will cover selection criteria.

Saturday, February 2, 2013

Turning Score into Probabilistic Grouping: The Fraction Option In Proc Rank

The SAS code below turns raw score into probability based groups

%let rankme =crscore;

proc rank data=indsn(keep=&rankme.)  fraction ties=mean out=outdsn.;
  var &rankme.;
  ranks &rankme._ranked ;

proc means data=outdsn. n nmiss min mean median max range std;
run ;

Variable N N Miss Minimum Mean Median Maximum Range Std Dev
CrScore 39779 0 365 493.69683 495 610 245 28.819612
crscore_ranked 39779 0 2.5139E-05 0.5000126 0.506637 1 0.999975 0.288662

The Fraction option is in parallel to the Group option that is used most often and longest. The Fraction option allows for probability based grouping, normalizes the distribution and caps it between 0 and 1.  One variation of the Fraction option is NPLUS1 that yields similar results.

In this case, the original 39,779 observations are collapsed to 257 groups. The following is a portion of the group distribution


Computation wise, the Fraction and NPLUS1 options are among those Proc Rank options supported through SAS in-DB technology. As of today February 2nd, 2013, the supported databases include Oracle, Teradata, Netezza and DB2. The probablistic grouping can be executed inside supported database tables without having to query and move big data to SAS environment.

Wednesday, January 23, 2013

Machine Learning Using SAS Enterprise Miner: A Basic Comparison Example

        This writing is to show how one can leverage SAS Enterprise Miner 12.1(“EM”), released August 2012, to build large number of leading machine learning models in short amount of time, by point-n-click. The comparison shown is mainly to organize built models, not to support any conclusion about the strengh of the methods.  The selected data set has ~40K observations, with 12 predictor variables. The binary target variable ATTRITE has ~16%=1 (The data set is from a published data mining book. Forgot which one it is from). 

The following screen shot shows 16 models are built (3 logistic regressions, 2 neural nets, 2 random forests, 1 memory-based reason (K nearest neighbor), 1 decision trees, 2 stochastci gradient boosting, 1 LARS regression and 4 SVM models)

Below is comparison details of the 16 models

The two random forest models stand above the rest in misclassification rate and KS. Notice
  1.  these models are built without much EDA (exploratory data analysis) work.
  2. A traditional decision tree is not far behind
  3. Neither of MBR, Boosting, SVM and NN does very well due to the fact there are only a dozen input variables. However, random forest still outshines them using few variables
  4. Logistic regression (the two HPREG models) models perform low probably due to the default cutoff selection as well
I like Enterprise Miner because I can load and set up large number of models (sometimes >100) quickly, easily tweak and manage their subtle differences, and pick the one that fits my domain business the best. Model lineage and knowledge sharing are other two reasons.