Saturday, February 2, 2013

Turning Score into Probabilistic Grouping: The Fraction Option In Proc Rank

The SAS code below turns raw score into probability based groups

"
%let rankme =crscore;

proc rank data=indsn(keep=&rankme.)  fraction ties=mean out=outdsn.;
  var &rankme.;
  ranks &rankme._ranked ;
run;

proc means data=outdsn. n nmiss min mean median max range std;
run ;
"

Variable N N Miss Minimum Mean Median Maximum Range Std Dev
CrScore 39779 0 365 493.69683 495 610 245 28.819612
crscore_ranked 39779 0 2.5139E-05 0.5000126 0.506637 1 0.999975 0.288662

The Fraction option is in parallel to the Group option that is used most often and longest. The Fraction option allows for probability based grouping, normalizes the distribution and caps it between 0 and 1.  One variation of the Fraction option is NPLUS1 that yields similar results.

In this case, the original 39,779 observations are collapsed to 257 groups. The following is a portion of the group distribution

crscore_rankedFrequency
0.952097841162
0.955805827133
0.959099022129
0.962191106117
0.965119787116
0.96780964898
0.970310968101
0.9727745895
0.97493652477
0.97694763683
0.97875763661
0.98031624763
0.98210110979
0.98378541455
0.9850172243
0.98611076244
0.98724201246
0.98829784638

Computation wise, the Fraction and NPLUS1 options are among those Proc Rank options supported through SAS in-DB technology. As of today February 2nd, 2013, the supported databases include Oracle, Teradata, Netezza and DB2. The probablistic grouping can be executed inside supported database tables without having to query and move big data to SAS environment.

No comments:

Post a Comment