K nearest neighbor modeling (KNN) essentially says:
“if you are very similar to k nearest entities, with respect to a list of
variables or dimensions, I think it is more likely you will make the decision
(as reflected in the target variable) as those K nearest entities make.”
SAS has KNN implementation in its Enterprise Miner.
This blog provides code example of the procedure that runs behind Enterprise
Miner’s Memory-Based Reasoning Node (MBR),
proc PMBR. Memory based because it simply trains on a training data set
and in the same procedure allows you to score another data set. It does not provide a parametric model or rule-set that does generalization or deployment.
“Proc dmdb batch data=&indsn(keep=&targetx.
&donerx.) dmdbcat=catCR;
var &donerx. &targetx.; /*class ;*/
target &targetx.;
run;
/*you need to run this procedure to create SAS
catalog for later processing. It runs very fast even on big data sets with many
variables. You may save it to permanent data set if you like*/
proc pmbr /*I think the implied distance is
Euclidian*/
data=&indsn.(keep=_numeric_ &targetx.) dmdbcat=catCR
THREADS=8 /*the default option is THREADS. You can specify
NOTHREADS*/
OPTIMIZEK /*
Specify this option if you want to have data decide # of neighbors. Conceptually it is similar to Cubic Cluster Criterion (CCC)
You can clearly specify K=. Adding WEIGHTED will weight influence of nearest neighbors according to their relative distance to the subject being classfied
Large # of neighbors should be balanced against how similarity graduates among the neighbors. You may consider target profiling during the graduation.
If your target is binary, selecting even number for K may result in more ties*/
EPSILON=0 /*
Minimum allowable distance for a scoring observation to a training observation.
If you are not sure how big it should be, leave it at 0.00
Depending on collective complexion of the variables listed at VAR statement big Epsilon may give you no neighbors near enough to support classification */
Method=rdtree /*
out=outx outest=oust;
/*This ends discussion of options at the procedure
level. There are several others left out*/
target &targetx.; /*This is statement*/
var &donerx.;
/*decision cost= costvar= decisiondata= decvar=
priorvar=*/
score data=temp out=&outdsn.;
run;
”
General notes
As
you should heed in conducting distance base classification, scale and the
variables being orthogonal to each other are important to the usefulness of
your results. Many often conduct standardization and PCA to prepare the data. 'Speciality' treatment may be called upon if your inputs are such as purchase basket, sequence or preference/subjective. Using multi-dimensional scaling is not uncommon. Weighting, however, generally should not be over-done
You
may consider variable clustering for selection. Some argue that presence of a TARGET should require usual 'variable selection' like building a logistic regression model. Target variable here is little > voting chips. MBR is NOT to maximize separation between 0 and 1
Training
under this procedure typically does not take very long. Scoring does when
distance between observations in the training set and the scoring data set is
calculated. This reminds of how proc discrim works when a non-parametric model
is built and scored upon
You should settle issues such as missing value (surrogates, used-in-distance) and sparcity before engaging the two procedures
The
scored data set has predicted (voting results from nearest neighbors) score for
each observation. You can conduct performance analysis from there (actual vs.
predicted…)