Sunday, February 17, 2013

TOBIT and Limited Dependant Variable Modeling on Big Data: Making the Case for SAS HPQLIM

Tobit models are regression models where range of the target variable is constrained or
limited in some way. For example, testing kit used to measure lead level in households below
10 parts per billion; the target takes on value 0 for measures of 0 to 9 parts per billion.

Tobit models apply where the response variable has bounds and takes on limiting values for
a large percentage of the respondents. Tobit models take into account the (over)
concentration of the observations at the limiting values when estimating and testing hypotheses
about the relationship between the target and the explanatory variables.

In SAS product family, limited dependent variable modeling could use PROCs QLIM, NLMIXED, or PHREG. In some design, other variants of survival models could be formulated. This writing focuses on the newly released high performance Proc HPQLIM from SAS Institute. The focus is on how HPQLIM, as compared with traditional Proc QLIM in SAS ETS, can support estimation involving much more variables and carrying out modeling in time several magnitudes faster.

Two computer environments are used for this comparison. The first one is a Windows 2008 Server R2, running Intel Xeon X7560, 2.27 GHz, 2.26 GHz (2 processors), 16 GB RAM, 64 bit OS. Let us call it OC (old client). The second one is a GreenPlum Appliance, with 1 worker node managing a grid of 32 worker nodes. 96 GB per node. Let us call it GA

The data used are from a training course that has ~19,372 observations and 27 variables (10 used as explanatory variables). To compare, I run one model using this data set as it is (call it model 0). Then stack the data set 400 time to get ~7.7 million observations, still the same 27 variables. I then run three models: using Proc Qlim running on OC (model 1), using Proc HPQLIM on OC (model 2) and using Proc HPQLIM on GA (model 3). Models 1, 2, 3 yield identical meaningless models. The computational performance provided by HPQLIM hopefully will interest you, if not impress you. The size of 7.7 million observations used for modeling, to data base marketing models, is still considered small. As far as I know, for QLIM models, this size is exceptionally large. While there are other reasons not many variables are used in limited dependent variable models, I am sure computational constraint in Proc QLIM is major one.

As of this release, HPQLIM does not support CLASS statement. I therefore comment out Class statement from all four models.

Target distribution is below. So you see a Proc QLIM may be suitable

 The log below shows process details for model 0 using 19K observations. Real time 4.6o seconds.

The next log is from running model 1, 7.7 million observations, 38 minutes and 7 seconds real time

When HPQLIM is engaged (still running on OC, the old computer), performance improved dramatically (see below). This is due to the fact SAS High-Performance PROCs default to take advantage of symmetric multiprocessing (SMP) mode, if SMP is enabled on the computer. This is the case with OC

Then I ran HPDS2 to copy the big data set to GA appliance (25.92 seconds) and repeated the model using HPQLIM. The model is identical. The real time is 1 minute 24.20 seconds, 21.21 second CPU time. (I replicated it in 29 seconds before, but tonight GA appears a bit busy)

I don't necessarily believe one should engage table as tall as 7.7 million for QLIM. I think better potential lies in that you may widen the table by including more input variables for consideration. Some of my friends told me they cannot realistically launch sample selection models and stochastic frontier models if the data set is relatively wide, under Proc QLIM. Hopefully HPQLIM changes that.


  1. Jason, glad to know SAS rolls out more High Performance analytic procedures. But Type 1 Tobit model (the one you used) can be efficiently estimated using PROC LIFEREG, 60 times more efficient. I think you should use a Heckman's selection model or Type 4, Type 5 Tobit model to demonstrate the full power of HPQLIM.

  2. Thank you for the comment.

    As of Feb 2013, SAS has about 25 HP PROCS. Some are brand new like HPFOREST. Some are from STAT, from ETS and Enterprise Miner. There is one from OR as well, HPLSO, local search optimization (this one actually is my favorite since my edu major actually was OR). They are all solid, production grade. Many are not as rich as their counterpart in STAT and ETS; they are more computation oriented. HPQLIM is one of such. I have a 104-page SAS training course document on TOBIT that shows exactly as you suggested, about Proc QLIM. Some capabilities covered there are yet to be added to HPQLIM.

    My plan for this 'first round' is to cover as many HP PROCs as possible. Due to the diverse nature of the HP PROCS, this first round is going to be a 'coarse and computation' round. Hopefully, at second round, I can put on finer and more 'exquisite' examples.

    One variable is file system variation. I have received emails asking if I can show examples on Hadoop. Currently I am working on a random foerst on Apache HD using SAS HPFOREST, 22 million records and 150 variables (later on the year MapR is a possibility). Another one is HPNEURAL. Next one is SVM in Enterprise Miner. And next is text mining.... Roughly 4 posts a month. Some will be on Teradata appliance box as well.