Sunday, February 17, 2013

TOBIT and Limited Dependant Variable Modeling on Big Data: Making the Case for SAS HPQLIM

Tobit models are regression models where range of the target variable is constrained or
limited in some way. For example, testing kit used to measure lead level in households below
10 parts per billion; the target takes on value 0 for measures of 0 to 9 parts per billion.

Tobit models apply where the response variable has bounds and takes on limiting values for
a large percentage of the respondents. Tobit models take into account the (over)
concentration of the observations at the limiting values when estimating and testing hypotheses
about the relationship between the target and the explanatory variables.

In SAS product family, limited dependent variable modeling could use PROCs QLIM, NLMIXED, or PHREG. In some design, other variants of survival models could be formulated. This writing focuses on the newly released high performance Proc HPQLIM from SAS Institute. The focus is on how HPQLIM, as compared with traditional Proc QLIM in SAS ETS, can support estimation involving much more variables and carrying out modeling in time several magnitudes faster.

Two computer environments are used for this comparison. The first one is a Windows 2008 Server R2, running Intel Xeon X7560, 2.27 GHz, 2.26 GHz (2 processors), 16 GB RAM, 64 bit OS. Let us call it OC (old client). The second one is a GreenPlum Appliance, with 1 worker node managing a grid of 32 worker nodes. 96 GB per node. Let us call it GA

The data used are from a training course that has ~19,372 observations and 27 variables (10 used as explanatory variables). To compare, I run one model using this data set as it is (call it model 0). Then stack the data set 400 time to get ~7.7 million observations, still the same 27 variables. I then run three models: using Proc Qlim running on OC (model 1), using Proc HPQLIM on OC (model 2) and using Proc HPQLIM on GA (model 3). Models 1, 2, 3 yield identical meaningless models. The computational performance provided by HPQLIM hopefully will interest you, if not impress you. The size of 7.7 million observations used for modeling, to data base marketing models, is still considered small. As far as I know, for QLIM models, this size is exceptionally large. While there are other reasons not many variables are used in limited dependent variable models, I am sure computational constraint in Proc QLIM is major one.

As of this release, HPQLIM does not support CLASS statement. I therefore comment out Class statement from all four models.

Target distribution is below. So you see a Proc QLIM may be suitable


 The log below shows process details for model 0 using 19K observations. Real time 4.6o seconds.
 

The next log is from running model 1, 7.7 million observations, 38 minutes and 7 seconds real time


When HPQLIM is engaged (still running on OC, the old computer), performance improved dramatically (see below). This is due to the fact SAS High-Performance PROCs default to take advantage of symmetric multiprocessing (SMP) mode, if SMP is enabled on the computer. This is the case with OC


Then I ran HPDS2 to copy the big data set to GA appliance (25.92 seconds) and repeated the model using HPQLIM. The model is identical. The real time is 1 minute 24.20 seconds, 21.21 second CPU time. (I replicated it in 29 seconds before, but tonight GA appears a bit busy)


I don't necessarily believe one should engage table as tall as 7.7 million for QLIM. I think better potential lies in that you may widen the table by including more input variables for consideration. Some of my friends told me they cannot realistically launch sample selection models and stochastic frontier models if the data set is relatively wide, under Proc QLIM. Hopefully HPQLIM changes that.

Saturday, February 9, 2013

Moving Logistic Regression toward Big, Complex Data:SAS HPLOGISTIC Optimization Techniques

In August 2012, SAS Institute released its first version of high performance based analytics server (HPAS12.1). Throughout the ~25 HP procedures released, new designs and changes are consistently made to help users better meet today's big data challenges in predictive modeling in terms of more efficient and finer algorithms, among others.

Proc Logistic is one of the most popular and widely used procedures in SAS products for logistic regression model building. To cover all major big data facilities in the new Proc HPLOGSITIC likely will yield a big paper. This writing focuses on one key aspect, the optimization techniques for maximum likelihood estimation and includes some excerpts from HPLOGISTIC user guide where presentation / explanation is the best

Under Proc Logistic, the default optimization technique is Fisher Scoring. One can change it to Newton Raphson (NF). One can also set Ridge option to Absolute or Relative. All is done under Model Statement. Under Proc HPLOGISTIC, Fisher Scoring disappears entirely. The default optimization technique is set at Newton Raphson Ridge, or NRRIDE. The table lists all the options

While in practice (and in theories) there is little consensus as to which option fits what data conditions, HPLOGISTIC's user guide provides excellent guideline

          "For many optimization problems, computing the gradient takes more computer time than computing the function value. Computing the Hessian sometimes takes much more computer time and memory than computing the gradient, especially when there are many decision variables. Unfortunately, optimization techniques that do not use some kind of Hessian approximation usually require many more iterations than techniques that do use a Hessian matrix, and, as a result the total run time of these techniques is often longer. Techniques that do not use the Hessian also tend to be less reliable. For example, they can terminate more ".

Time taken to computer gradient, function value, Hessian (where applicable), number of decision variables involved are among the key choice factors
  1. Second-derivative methods include TRUREG, NEWRAP, and NRRIDG (best for small problems for which the Hessian matrix is not expensive to compute. This does  not necessarily say calculating Hessian matrix, for small problems or not, is not expensive. 'Small problems' still vary a lot)
  2. If you want to replicate your old model where Fisher Scoring is used, you can use NRRIDG. Where your target is binary, you may get identical results. Otherwise results may be slightly different (mainly estimation coefficients)
  3. First-derivative methods include QUANEW and DBLDOG (best for medium-sized problems for which the objective function and the gradient can be evaluated much faster than the Hessian). In general, the QUANEW and DBLDOG algorithms require more iterations than the Second-derivative methods above, but each iteration can be much faster. The QUANEW and DBLDOG algorithms require only the gradient to update an approximate Hessian, and they require slightly less memory than TRUREG or NEWRAP.
  4. Because CONGRA requires only a factor of p double-word memory, many large applications can be solved only by CONGRA. However, I personally feel the computational beauty of CONGRA may actually be overstated a bit
  5. All these insights and guidelines are of course to be vetted and reckoned with other key aspects such as selection criteria (selection in HPLOGISTIC, by the way , has become a separate statement under the procedure, unlike Proc Logistic where Selection is a Model statement option)
While SAS remains the very best, strongest statistical powerhouse, the big data orientation and embedment in its HPAS release, hopefully manifested by this writing, has demonstrated its leading position in commerical machine learning solution in tackling big data; SAS is very computational today. New HP procedures like HPLOGISTIC require the modeler to be very sensitive and conscious of data conditions, complexities and residuals in the model universe on hand. The ultimate value of SAS HPAS, like many other SAS solutions and tools, lies in its productivity implication: You don't build anything from ground zero. You don't even write a line of code.

My next writing on SAS logistic regression will cover selection criteria.

Saturday, February 2, 2013

Turning Score into Probabilistic Grouping: The Fraction Option In Proc Rank

The SAS code below turns raw score into probability based groups

"
%let rankme =crscore;

proc rank data=indsn(keep=&rankme.)  fraction ties=mean out=outdsn.;
  var &rankme.;
  ranks &rankme._ranked ;
run;

proc means data=outdsn. n nmiss min mean median max range std;
run ;
"

Variable N N Miss Minimum Mean Median Maximum Range Std Dev
CrScore 39779 0 365 493.69683 495 610 245 28.819612
crscore_ranked 39779 0 2.5139E-05 0.5000126 0.506637 1 0.999975 0.288662

The Fraction option is in parallel to the Group option that is used most often and longest. The Fraction option allows for probability based grouping, normalizes the distribution and caps it between 0 and 1.  One variation of the Fraction option is NPLUS1 that yields similar results.

In this case, the original 39,779 observations are collapsed to 257 groups. The following is a portion of the group distribution

crscore_rankedFrequency
0.952097841162
0.955805827133
0.959099022129
0.962191106117
0.965119787116
0.96780964898
0.970310968101
0.9727745895
0.97493652477
0.97694763683
0.97875763661
0.98031624763
0.98210110979
0.98378541455
0.9850172243
0.98611076244
0.98724201246
0.98829784638

Computation wise, the Fraction and NPLUS1 options are among those Proc Rank options supported through SAS in-DB technology. As of today February 2nd, 2013, the supported databases include Oracle, Teradata, Netezza and DB2. The probablistic grouping can be executed inside supported database tables without having to query and move big data to SAS environment.