Friday, November 23, 2012

Random Forest Modeling in SAS, Proc HPFOREST, Example 1

In August 2012, SAS Institute had Release 12.1. One major modeling facility added to its machine learning and data science portfolio is random forest. In SAS High-Performance Analytics Server 12.1, or the procedure place, proc HPFOREST does the job. In SAS Enterprise Miner, HP FOREST node is where random forests can be built.

This post focuses on proc HPFOREST. It omits computational complexities (may be covered later) and focuses on procedure syntax, to show, in a way, how # of variables to be considered at spliting impacts misclassification rates of a random forest model (RF)

Here is the SAS code (IF data elements look like yours, that is pure coincidence)

"
libname localx "d:\sas\";
%include "d:\sas\intervar.txt";   /*interval +categorical vars*/
%include "d:\sas\select80.txt";  /*pick top 80 variables. See proc REDUCE below*/
%let indsn           =localx.dlq_modeldev;
%let dropx          =suppress_;
%let targetx        =d_chargoff;
proc hpreduce data=&indsn(drop=&dropx. ) technique=DiscriminantAnalysis ;
    /*Other techniques are supported*/
    class &targetx. &catex.; /*like proc glmselect*/
    reduce supervised &targetx. = &varx. &catex./maxeffects=80 ;
    /*Can do both supervised or unsupervised. Full control over how many variables to pick*/
    performance host="&GRIDHOST" install="&GRIDINSTALLLOC" details;
    /*Can run on large-scale commodity hardware based parallel system. Now supports HDFS, Teradata and Greenplum. You can have as many worker nodes on the grid as you like*/
run;
%global vars;
%macro hpforest(Vars=);
proc hpforest data=&indsn maxtrees=200 vars_to_try =&Vars. trainfraction=0.6;
   /*vars_to_try: among the 80 variables, how many you want to randomly select for spliting trees*/
  /*performance host="&GRIDHOST" install="&GRIDINSTALLLOC";*/
                      /*You can run it on local PC*/
  target &targetx./level=binary;
  input &selectx./level=interval;   /*There is binning option you can turn on */
  input s_loanstatus s_LoanType s_ratecode s_loan_vintage /level=nominal;
      ods output FitStatistics = fitstats_vars&Vars.(rename=(Miscoob=VarsToTry&Vars.));
run;
%mend; /*This is tip of the iceberg. A lot more options to customize RF. You can also score it*/

/*Pick 13-24 variables to split trees*/
%hpforest(vars=13); %hpforest(vars=14); %hpforest(vars=15); %hpforest(vars=16);
%hpforest(vars=17); %hpforest(vars=18); %hpforest(vars=19); %hpforest(vars=20);
%hpforest(vars=21); %hpforest(vars=22); %hpforest(vars=23); %hpforest(vars=24);

/*Some junk steps to plot misclassification rates. All performance uses OOB data*/
data fitstats;
   merge fitstats_vars13 fitstats_vars14 fitstats_vars15 fitstats_vars16 fitstats_vars17 fitstats_vars18
           fitstats_vars19 fitstats_vars20 fitstats_vars21 fitstats_vars22 fitstats_vars23 fitstats_vars24 ;
   rename Ntrees=Trees;
label VarsToTry13  = "Vars=13"; label VarsToTry14  = "Vars=14"; label VarsToTry15  = Vars=15";
label VarsToTry16  = "Vars=16"; label VarsToTry17  = "Vars=17"; label VarsToTry18 =  Vars=18";
label VarsToTry19  = "Vars=19"; label VarsToTry20  = "Vars=20"; label VarsToTry24  = Vars=24";
label VarsToTry23   = "Vars=23"; label VarsToTry22  = "Vars=22"; label VarsToTry21= "Vars=21";
run;

proc sgplot data=fitstats;
   title "Misclassification Rate for Various VarsToTry Values";
series x=Trees y=VarsToTry13/lineattrs=(Pattern=ShortDash Thickness=5 color=purple);
series x=Trees y=VarsToTry14/lineattrs=(Pattern=ShortDash Thickness=5 color=green);
series x=Trees y=VarsToTry15/lineattrs=(Pattern=ShortDash Thickness=5 color=black);
series x=Trees y=VarsToTry16/lineattrs=(Pattern=MediumDashDotDot Thickness=4 color=yellow);
series x=Trees y=VarsToTry17/lineattrs=(Pattern=MediumDashDotDot Thickness=4 color=brown);
series x=Trees y=VarsToTry18/lineattrs=(Pattern=MediumDashDotDot Thickness=4 color=blue);
series x=Trees y=VarsToTry19/lineattrs=(Pattern=LongDash Thickness=3 color=red);
series x=Trees y=VarsToTry20/lineattrs=(Pattern=MediumDashDotDot Thickness=3 color=pink);
series x=Trees y=VarsToTry24/lineattrs=(Pattern=LongDash Thickness=3 color=green);
series x=Trees y=VarsToTry23/lineattrs=(Pattern=MediumDashDotDot Thickness=2 color=orange);
series x=Trees y=VarsToTry22/lineattrs=(Pattern=MediumDashDotDot Thickness=2 color=red);
series x=Trees y=VarsToTry21/lineattrs=(Pattern=MediumDashDotDot Thickness=2 color=grey);
yaxis label='OOB Misclassification Rate';
run;
title;    /*I am not trying to write beautiful code here*/

"
Attached is misclassification rate plot. The rule-of-thumb for starting number is SQRT(80) ~9. But using 9 (ran it, not showing here), ceteris paribus, arrives at MR > 0.01, worse than most shown here.