Wednesday, December 26, 2012

K Nearest Neighbor Modeling Using SAS: Proc PMBR

K nearest neighbor modeling (KNN) essentially says: “if you are very similar to k nearest entities, with respect to a list of variables or dimensions, I think it is more likely you will make the decision (as reflected in the target variable) as those K nearest entities make.”
SAS has KNN implementation in its Enterprise Miner. This blog provides code example of the procedure that runs behind Enterprise Miner’s Memory-Based Reasoning Node (MBR),  proc PMBR. Memory based because it simply trains on a training data set and in the same procedure allows you to score another data set. It does not provide a parametric model or rule-set that does generalization or deployment.
Proc dmdb batch data=&indsn(keep=&targetx. &donerx.) dmdbcat=catCR;
   var &donerx. &targetx.;    /*class ;*/
   target &targetx.;
/*you need to run this procedure to create SAS catalog for later processing. It runs very fast even on big data sets with many variables. You may save it to permanent data set if you like*/ 
proc pmbr              /*I think the implied distance is Euclidian*/ 
   data=&indsn.(keep=_numeric_ &targetx.)  dmdbcat=catCR
   THREADS=8  /*the default option is THREADS. You can specify NOTHREADS*/
  • Specify this option if you want to have data decide # of neighbors. Conceptually it is similar to Cubic Cluster Criterion (CCC)
  • You can clearly specify K=.  Adding WEIGHTED will weight influence of nearest neighbors according to their relative distance to the subject being classfied
  • Large # of neighbors should be balanced against how similarity graduates among the neighbors. You may consider target profiling during the graduation.
  • If your target is binary, selecting even number for K may result in more ties*/ 
    EPSILON=0 /*
  • Minimum allowable distance for a scoring observation to a training observation.
  • If you are not sure how big it should be, leave it at 0.00
  • Depending on collective complexion of the variables listed at VAR statement big Epsilon may give you no neighbors near enough to support classification */ 
   Method=rdtree /*
  • This option determines the data representation that is used to store the training data set and determine the nearest neighbors
  • RDTREE is default
  • Another more data intensive option is SCAN */ 
   out=outx outest=oust;
/*This ends discussion of options at the procedure level. There are several others left out*/ 
   target &targetx.; /*This is statement*/
   var &donerx.;
   /*decision cost= costvar= decisiondata= decvar= priorvar=*/
   score data=temp out=&outdsn.;

General notes

  • As you should heed in conducting distance base classification, scale and the variables being orthogonal to each other are important to the usefulness of your results. Many often conduct standardization and PCA to prepare the data. 'Speciality' treatment may be called upon if your inputs are such as purchase basket, sequence or preference/subjective. Using multi-dimensional scaling is not uncommon. Weighting, however, generally should not be over-done
  • You may consider variable clustering for selection. Some argue that presence of a TARGET should require usual 'variable selection' like building a logistic regression model. Target variable here is little > voting chips. MBR is NOT to maximize separation between 0 and 1
  • Training under this procedure typically does not take very long. Scoring does when distance between observations in the training set and the scoring data set is calculated. This reminds of how proc discrim works when a non-parametric model is built and scored upon
  • You should settle issues such as missing value (surrogates, used-in-distance) and sparcity before engaging the two procedures
  • The scored data set has predicted (voting results from nearest neighbors) score for each observation. You can conduct performance analysis from there (actual vs. predicted…)

No comments:

Post a Comment