limited in some way. For example, testing kit used to measure lead level in households below
10 parts per billion; the target takes on value 0 for measures of 0 to 9 parts per billion.
Tobit models apply where the response variable has bounds and takes on limiting values for
a large percentage of the respondents. Tobit models take into account the (over)
concentration of the observations at the limiting values when estimating and testing hypotheses
about the relationship between the target and the explanatory variables.
In SAS product family, limited dependent variable modeling could use PROCs QLIM, NLMIXED, or PHREG. In some design, other variants of survival models could be formulated. This writing focuses on the newly released high performance Proc HPQLIM from SAS Institute. The focus is on how HPQLIM, as compared with traditional Proc QLIM in SAS ETS, can support estimation involving much more variables and carrying out modeling in time several magnitudes faster.
Two computer environments are used for this comparison. The first one is a Windows 2008 Server R2, running Intel Xeon X7560, 2.27 GHz, 2.26 GHz (2 processors), 16 GB RAM, 64 bit OS. Let us call it OC (old client). The second one is a GreenPlum Appliance, with 1 worker node managing a grid of 32 worker nodes. 96 GB per node. Let us call it GA
The data used are from a training course that has ~19,372 observations and 27 variables (10 used as explanatory variables). To compare, I run one model using this data set as it is (call it model 0). Then stack the data set 400 time to get ~7.7 million observations, still the same 27 variables. I then run three models: using Proc Qlim running on OC (model 1), using Proc HPQLIM on OC (model 2) and using Proc HPQLIM on GA (model 3). Models 1, 2, 3 yield identical meaningless models. The computational performance provided by HPQLIM hopefully will interest you, if not impress you. The size of 7.7 million observations used for modeling, to data base marketing models, is still considered small. As far as I know, for QLIM models, this size is exceptionally large. While there are other reasons not many variables are used in limited dependent variable models, I am sure computational constraint in Proc QLIM is major one.
As of this release, HPQLIM does not support CLASS statement. I therefore comment out Class statement from all four models.
Target distribution is below. So you see a Proc QLIM may be suitable
The log below shows process details for model 0 using 19K observations. Real time 4.6o seconds.
The next log is from running model 1, 7.7 million observations, 38 minutes and 7 seconds real time
When HPQLIM is engaged (still running on OC, the old computer), performance improved dramatically (see below). This is due to the fact SAS High-Performance PROCs default to take advantage of symmetric multiprocessing (SMP) mode, if SMP is enabled on the computer. This is the case with OC
Then I ran HPDS2 to copy the big data set to GA appliance (25.92 seconds) and repeated the model using HPQLIM. The model is identical. The real time is 1 minute 24.20 seconds, 21.21 second CPU time. (I replicated it in 29 seconds before, but tonight GA appears a bit busy)
I don't necessarily believe one should engage table as tall as 7.7 million for QLIM. I think better potential lies in that you may widen the table by including more input variables for consideration. Some of my friends told me they cannot realistically launch sample selection models and stochastic frontier models if the data set is relatively wide, under Proc QLIM. Hopefully HPQLIM changes that.