Friday, April 12, 2013

SAS High Performance Text Mining: SAS HPTMINE

Currently there is one text mining procedure in SAS HPA, HPTMINE (experimental) which actually works fair well. This writing presents one working example.
The text file contains ~216K news entries, total file size ~384MB. The example runs on a Windows client with 16GB RAM.

proc HPTMINE data=doc2.news2;
   doc_id id2;   /*ID variable is required*/
   variable description; /*listing multiple variables may cause confusion*/
   parse outterms = doc2.out_terms_news reducef=2;

 /*frequency for term filtering: minimum frequency of occurrence by which a term is dropped*/
   /*nostermming entities= stop= start= multiterm= syn= termwgt= cellwgt= outchild= outterms=*/

  /*all these options can be turned on and off. Weighting is important in tweaking process*/
   svd k=10 outdocpro=doc2.docpro_news

  /*this is critical math part in the whole exercise. In some cases you act on direct frequency*/
   svdu=doc2.news_svdu  /*left singular vector*/
   svdv=doc2.news_svdv; /*right singular vector*/
   /*tol=  tolerance value for singular value*/
   /*resolution =low|med|high
   performance host="&GRIDHOST" install="&GRIDINSTALLLOC" details;*/
run ;


This procedure integrates several separate procedures available in regular SAS Text Miner, so as to reduce I/O traffic due to the separations. The advantage from this integration is more pronounced when the input text file is huge. This integration also is a logic centralization to happen before parallel computation is invoked to execute the job.  This specific example is not executed on parallel nodes.

Below are some log details, less than 2 minutes for the operation


Below are screen shots of term probability table and term-frequency matrix. The mechanics of the whole operation is very intuitive. To get desired outcome often requires time-consuming tweaking. The upside is using all defaults could very well


No comments:

Post a Comment