Wednesday, March 13, 2013

SAS Random Forest Modeling on Hadoop, Proc HPFOREST

The example is essentially the same as the random forest model I covered in my 11/23/2012 blog post, with some minor adjustments. The key difference is this instance is now implemented on Cloudera CDH 4.0 Hadoop cluster, as compared to the previous one built on GreenPlum appliance. The SAS High Performance Analytics client runs on a Linux box. The grid is computing with the Hadoop cluster, 16 worker nodes +a head node, a total of 1.5 TB RAM.

This is how the client interface looks like. Jboss is not the best. Works OK.



This random forest model uses 280 interval variables, only 3 categorical variables, against a binary target, ~1.6 million rows. A snapshot of SAS log is below


  1. About 22 minutes to finish a random forest model, 5 other concurrent big jobs running
  2. I ran it 5 times. It gives the same result, very consistent. The quickest run takes 20 minute 14 seconds. The longest is >26 minutes. Does not vary much. I can reduce it to seconds. But real-time is not always necessary
  3. I changed vars_to_try from 3 to 17: 17*17=289, the closest number to 283, the total number of input variables. The model improves quite a bit, in terms of misclassification rate. It costs on average ~5 more minutes
  4. This data set I have is small. So I ran it on a small Hadoop cluster to test. For jobs involving bigger data sets, you need to maintain and expand your clusters and grid network
  5. This mode, to use a term stolen from large-scale predictive learning community, is a in-memory model. It appears that SAS is getting ready to 'industralize' random forest models on large scale of data.
  6. I plan to publish some practice on how to prepare data for random forest modeling. Many have the mind set to build random forest models like pushing IPhone buttons, to avoid typically lengthy exploratory data analysis in building, say, a logistic regression model. GOOD random forest models, however, require data preparation and tuning, just like GOOD logistic regression. The difference, in terms of dollar and sense, can be heaven and earth in some cases
  7. SAS HPA currently already supports Apache Hadoop. Will SAS run on MapReduce? Will see. 
Thanks for viewing.

Wednesday, March 6, 2013

SAS PROC HPNEURAL vs. PROC NEURAL: Statement List Side by Side

PROC HPNEURAL is neural network modeling PROC available in SAS High Performnace Analytics release. PROC NEURAL has been the solid workhorse behind SAS Enterprise Miner's Neural Network Node. Many have asked me what are the difference. Sadly I have not found time to write about it.

Below is side-by-side of what statements are avaiable in either, tip of an overview

PROC NEURAL, 9.2/9.3 COMMENTS PROC HPNEURAL 12.2(MARCH 2013)
ARCHITECTURE   ARCHITECTURE
CODE   CODE
CONNECT    
CUT    
DECISION    
DELETE    
FREEZE    
FREQ    
HIDDEN   HIDDEN
    ID
INITIAL    
INPUT   INPUT
NETOPTIONS    
NLOPTIONALS    
PERFORMANCE   PERFORMANCE
    PARTITION
PERTURB    
PRELIM    
RANOPTIONS    
SAVE    
SCORE   SCORE
SET    
SHOW    
TARGET   TARGET
THAT    
TRAIN   TRAIN
USE    
    Weight