Sunday, December 9, 2012

Stochastic Gradient Boosting Modeling in SAS: A Procedure Example

Some of my friends are pretty experienced modelers. They have asked me if SAS has stochastic gradient boosting (SGB) capabilities. I told them Enterprise Miner has had it for ~8 years.  But they told me using SAS Enterprise Miner makes them appear ...junior level. This post shows one procedure syntax example of SGB

Here is the SAS code (IF data elements look like yours, that is pure coincidence)

%macro treeboost2;
proc treeboost data=&indsn

 /*INMODEL: use data= to build it or use inmodel= to read in existing models*/

  /*this idea is similar to coarse binning in credit scoring card*/

  EXHAUSTIVE=100 /*could be bigger*/
  INTERVALDECIMALS= MAX /*Trees are decimal sensitive*/
  Iterations=1000 /*maxbranch= maxdepth=  + several other options*/
  missing=useinsearch /*distribute,*/
  SHRINKAGE=0.1 /*If you are grounded in SGB, you know its role. default  =0.2.   Be gradual, be very gradual,if you want to boost well*/
  SPLITSIZE=20  /*split a node only when it contains at least number Observations. The default value is twice the size of the vaue specified in LEAFSIZE=   */ 

Input &select_s /*If you don't differentiate, it treats numeric as interval and character variables as categorical--class variables*/
    MISSING=useinsearch /*distribute bigbranch*/
    order=descending /*sorting order of oridnal*/;  

  Target &targetx. /level=binary;
  save FIT=fit IMPORTANCE=imp MODEL=mdl RULES=rules;
  /*score data=score out= outfit=outfit prediction*/
  SUBSERIES longest; /*Best iteration=100*/
  /*The SUBSERIES statement specifies how many iterations in the series to use in the model. For a binary or interval target, the number of
  iterations is the number of trees in the series. For a nominal target with k categories, k > 2, each iteration contains k trees. The following
        options are mutually exclusive.*/;

%mend treeboost2;


I am not going to visualize and compile the performance results as Enterprise Miner does. Here are some SAS log details

NOTE: Assuming numeric variables have INTERVAL measurement level, and character variables have NOMINAL.

NOTE: 1671263 kilobytes of physical memory.

NOTE: Will use 168252 out of 168252 training cases.

NOTE: Using memory pool with 410163200 bytes.

NOTE: Passed training data 5000 times.

NOTE: Current TREEBOOST model contains 1000 trees.

NOTE: Training used 15327976 bytes of work memory.

NOTE: The data set WORK.FIT has 1000 observations and 11 variables.

NOTE: The data set WORK.IMP has 31 observations and 4 variables.

NOTE: The data set WORK.MDL has 141996 observations and 4 variables.

NOTE: The data set WORK.RULES has 20805 observations and 7 variables.

NOTE: Current TREEBOOST model contains 1000 trees.

NOTE: There were 168252 observations read from the data set LOCALX.DLQ_ACT_MODELDEV.

WHERE RANUNI(987282)<=0.1;

NOTE: PROCEDURE TREEBOOST used (Total process time):

real time 12:22.92

cpu time 11:55.98"

There are another ~1/3 statements and options not shown. This is quick view. Enjoy. My local PC has 32 GB RAM. This post does not focus on processing speed so I just took 10% random sample.

No comments:

Post a Comment