Monday, December 31, 2012

Random Forest Modeling in SAS, Several Key Aspects

In August 2012, SAS Institute had Release 12.1. One major modeling facility added to its machine learning and data science portfolio is random forest. In SAS High-Performance Analytics Server 12.1, or the procedure place, proc HPFOREST does the job. In SAS Enterprise Miner, HP FOREST node is where random forests can be built.

This post illustrates several key aspects proc HPFOREST covers in modeling random forest.

Here is the SAS code (IF data elements look like yours, that is pure coincidence)

%macro hpforest(Vars=);
proc hpforest data=&indsn maxtrees=200 vars_to_try =&Vars. trainfraction=0.6;
  target &targetx./level=binary;
  input &input1/level=interval;
  input &input2/level=nominal;
  input &input3/level=ordinal;
  ods output FitStatistics = fitstats_vars&Vars.(rename=(Miscoob=VarsToTry&Vars.));


data fitstats;
   set fitstats_vars8;
   rename Ntrees=Trees;
   label VarsToTry8   = "Vars=8";

proc sgplot data=fitstats;
   title "Misclassification Rate for 200 Trees";
   series x=Trees y=VarsToTry8/lineattrs=(Pattern=MediumDashDotDot Tickness=4 COlor=brown);
yaxis label='OOB Misclassification Rate';


Subject 1: Do more trees improve classification rate? The plot above shows the classification rate starts to peter to flat ~ 50 trees. After 100 trees, it definitely does not improve any more

Subject 2: "Loss Reduction Variable Importance Report" from random forest often does NOT tell a story about variable importance similar to what you get from other methods

Subject 3: Random Forest Fit Statistics, the Out-of-Bag tree steps

1 comment:

  1. Hi, It's very usefull code.

    How to score the test dataset