Saturday, April 13, 2013

Generalized Linear Model Structure and Nonlinear Model Structure in SAS STAT

SAS STAT product has so many model tools to offer sometime one is confused which covers what
cases and data structure. Below is a summary diagram I took from a training course SAS offers.

Again a picture speaks volume. This diagram is two years old. I believe, 90%, stay the same since.

Some such as GENMOD and GLIMMIX may be considered to move to HP platform. And NLIN and MIXED already have their big data counterpart in SAS HPA's HPNLIN and HPMIXED

Friday, April 12, 2013

SAS Clustering Solution Overview, just One Picture

More and more encounters and friends lately told me they see many SAS procedures that are related

to clustering, but not clear about interrelations among them (which one does what). From a training

course offered by SAS titled "Applied Clustering Techniques", I found a diagram that does a good

job explain it

As we often say, a picture is better a thousand words. Take a look

SAS High Performance Text Mining: SAS HPTMINE

Currently there is one text mining procedure in SAS HPA, HPTMINE (experimental) which actually works fair well. This writing presents one working example.
The text file contains ~216K news entries, total file size ~384MB. The example runs on a Windows client with 16GB RAM.

proc HPTMINE data=doc2.news2;
   doc_id id2;   /*ID variable is required*/
   variable description; /*listing multiple variables may cause confusion*/
   parse outterms = doc2.out_terms_news reducef=2;

 /*frequency for term filtering: minimum frequency of occurrence by which a term is dropped*/
   /*nostermming entities= stop= start= multiterm= syn= termwgt= cellwgt= outchild= outterms=*/

  /*all these options can be turned on and off. Weighting is important in tweaking process*/
   svd k=10 outdocpro=doc2.docpro_news

  /*this is critical math part in the whole exercise. In some cases you act on direct frequency*/
   svdu=doc2.news_svdu  /*left singular vector*/
   svdv=doc2.news_svdv; /*right singular vector*/
   /*tol=  tolerance value for singular value*/
   /*resolution =low|med|high
   performance host="&GRIDHOST" install="&GRIDINSTALLLOC" details;*/
run ;


This procedure integrates several separate procedures available in regular SAS Text Miner, so as to reduce I/O traffic due to the separations. The advantage from this integration is more pronounced when the input text file is huge. This integration also is a logic centralization to happen before parallel computation is invoked to execute the job.  This specific example is not executed on parallel nodes.

Below are some log details, less than 2 minutes for the operation


Below are screen shots of term probability table and term-frequency matrix. The mechanics of the whole operation is very intuitive. To get desired outcome often requires time-consuming tweaking. The upside is using all defaults could very well