Here is the SAS code

**(IF data elements look like yours, that is pure coincidence)**

%macro treeboost2;

proc treeboost data=&indsn

/*INMODEL: use data= to build it or use inmodel= to read in existing models*/

CATEGORICALBINS=6

INTERVALBINS=35

/*this idea is similar to coarse binning in credit scoring card*/

EXHAUSTIVE=100 /*could be bigger*/

INTERVALDECIMALS= MAX /*Trees are decimal sensitive*/

leafsize=2

Iterations=1000 /*maxbranch= maxdepth= + several other options*/

Mincatsize=2

missing=useinsearch /*distribute,*/

seed=989795

SHRINKAGE=0.1 /*If you are grounded in SGB, you know its role. default =0.2. Be gradual, be very gradual,if you want to boost well*/

SPLITSIZE=20 /*split a node only when it contains at least number Observations. The default value is twice the size of the vaue specified in LEAFSIZE= */

Input &select_s /*If you don't differentiate, it treats numeric as interval and character variables as categorical--class variables*/

/INTERVALDECIMALS=5 /*MAXBRANCES=*/

MINCATSIZE=2

MISSING=useinsearch /*distribute bigbranch*/

order=descending /*sorting order of oridnal*/;

Target &targetx. /level=binary;

save FIT=fit IMPORTANCE=imp MODEL=mdl RULES=rules;

/*score data=score out= outfit=outfit prediction*/

SUBSERIES longest; /*Best iteration=100*/

/*The SUBSERIES statement specifies how many iterations in the series to use in the model. For a binary or interval target, the number of

iterations is the number of trees in the series. For a nominal target with k categories, k > 2, each iteration contains k trees. The following

options are mutually exclusive.*/;

run;

quit;

%mend treeboost2;

**"**

I am not going to visualize and compile the performance results as Enterprise Miner does. Here are some SAS log details

"

NOTE: Assuming numeric variables have INTERVAL measurement level, and character variables have NOMINAL.

NOTE: 1671263 kilobytes of physical memory.

NOTE: Will use 168252 out of 168252 training cases.

NOTE: Using memory pool with 410163200 bytes.

NOTE: Passed training data 5000 times.

NOTE: Current TREEBOOST model contains 1000 trees.

NOTE: Training used 15327976 bytes of work memory.

NOTE: The data set WORK.FIT has 1000 observations and 11 variables.

NOTE: The data set WORK.IMP has 31 observations and 4 variables.

NOTE: The data set WORK.MDL has 141996 observations and 4 variables.

NOTE: The data set WORK.RULES has 20805 observations and 7 variables.

NOTE: Current TREEBOOST model contains 1000 trees.

NOTE: There were 168252 observations read from the data set LOCALX.DLQ_ACT_MODELDEV.

WHERE RANUNI(987282)<=0.1;

NOTE: PROCEDURE TREEBOOST used (Total process time):

real time 12:22.92

cpu time 11:55.98"

There are another ~1/3 statements and options not shown. This is quick view. Enjoy. My local PC has 32 GB RAM. This post does not focus on processing speed so I just took 10% random sample.

Really cool post, highly informative and professionally written and I am glad to be a visitor of this perfect blog, thank you for this rare info!

ReplyDeleteSAS Online Training

Tableau Online Training|

R Programming Online Training|