This post focuses on proc HPFOREST. It omits computational complexities (may be covered later) and focuses on procedure syntax, to show, in a way, how # of variables to be considered at spliting impacts misclassification rates of a random forest model (RF)
Here is the SAS code (IF data elements look like yours, that is pure coincidence)
"
libname localx "d:\sas\";
%include "d:\sas\intervar.txt"; /*interval +categorical vars*/
%include "d:\sas\select80.txt"; /*pick top 80 variables. See proc REDUCE below*/
%let indsn =localx.dlq_modeldev;
%let dropx =suppress_;
%let targetx =d_chargoff;
proc hpreduce data=&indsn(drop=&dropx. ) technique=DiscriminantAnalysis ;
/*Other techniques are supported*/
class &targetx. &catex.; /*like proc glmselect*/
reduce supervised &targetx. = &varx. &catex./maxeffects=80 ;
/*Can do both supervised or unsupervised. Full control over how many variables to pick*/
performance host="&GRIDHOST" install="&GRIDINSTALLLOC" details;
/*Can run on large-scale commodity hardware based parallel system. Now supports HDFS, Teradata and Greenplum. You can have as many worker nodes on the grid as you like*/
run;
%global vars;
%macro hpforest(Vars=);
proc hpforest data=&indsn maxtrees=200 vars_to_try =&Vars. trainfraction=0.6;
/*vars_to_try: among the 80 variables, how many you want to randomly select for spliting trees*/
/*performance host="&GRIDHOST" install="&GRIDINSTALLLOC";*/
/*You can run it on local PC*/
target &targetx./level=binary;
input &selectx./level=interval; /*There is binning option you can turn on */
input s_loanstatus s_LoanType s_ratecode s_loan_vintage /level=nominal;
ods output FitStatistics = fitstats_vars&Vars.(rename=(Miscoob=VarsToTry&Vars.));
run;
%mend; /*This is tip of the iceberg. A lot more options to customize RF. You can also score it*/
/*Pick 13-24 variables to split trees*/
%hpforest(vars=13); %hpforest(vars=14); %hpforest(vars=15); %hpforest(vars=16);
%hpforest(vars=17); %hpforest(vars=18); %hpforest(vars=19); %hpforest(vars=20);
%hpforest(vars=21); %hpforest(vars=22); %hpforest(vars=23); %hpforest(vars=24);
/*Some junk steps to plot misclassification rates. All performance uses OOB data*/
data fitstats;
merge fitstats_vars13 fitstats_vars14 fitstats_vars15 fitstats_vars16 fitstats_vars17 fitstats_vars18
fitstats_vars19 fitstats_vars20 fitstats_vars21 fitstats_vars22 fitstats_vars23 fitstats_vars24 ;
rename Ntrees=Trees;
label VarsToTry13 = "Vars=13"; label VarsToTry14 = "Vars=14"; label VarsToTry15 = Vars=15";
label VarsToTry16 = "Vars=16"; label VarsToTry17 = "Vars=17"; label VarsToTry18 = Vars=18";
label VarsToTry19 = "Vars=19"; label VarsToTry20 = "Vars=20"; label VarsToTry24 = Vars=24";
label VarsToTry23 = "Vars=23"; label VarsToTry22 = "Vars=22"; label VarsToTry21= "Vars=21";
run;
proc sgplot data=fitstats;
title "Misclassification Rate for Various VarsToTry Values";
series x=Trees y=VarsToTry13/lineattrs=(Pattern=ShortDash Thickness=5 color=purple);
series x=Trees y=VarsToTry14/lineattrs=(Pattern=ShortDash Thickness=5 color=green);
series x=Trees y=VarsToTry15/lineattrs=(Pattern=ShortDash Thickness=5 color=black);
series x=Trees y=VarsToTry16/lineattrs=(Pattern=MediumDashDotDot Thickness=4 color=yellow);
series x=Trees y=VarsToTry17/lineattrs=(Pattern=MediumDashDotDot Thickness=4 color=brown);
series x=Trees y=VarsToTry18/lineattrs=(Pattern=MediumDashDotDot Thickness=4 color=blue);
series x=Trees y=VarsToTry19/lineattrs=(Pattern=LongDash Thickness=3 color=red);
series x=Trees y=VarsToTry20/lineattrs=(Pattern=MediumDashDotDot Thickness=3 color=pink);
series x=Trees y=VarsToTry24/lineattrs=(Pattern=LongDash Thickness=3 color=green);
series x=Trees y=VarsToTry23/lineattrs=(Pattern=MediumDashDotDot Thickness=2 color=orange);
series x=Trees y=VarsToTry22/lineattrs=(Pattern=MediumDashDotDot Thickness=2 color=red);
series x=Trees y=VarsToTry21/lineattrs=(Pattern=MediumDashDotDot Thickness=2 color=grey);
yaxis label='OOB Misclassification Rate';
run;
title; /*I am not trying to write beautiful code here*/
"
Attached is misclassification rate plot. The rule-of-thumb for starting number is SQRT(80) ~9. But using 9 (ran it, not showing here), ceteris paribus, arrives at MR > 0.01, worse than most shown here.
Hi,
ReplyDeleteThis is superb dude. I loved it.
Can you tell me how to score the test data.