Currently there is one text mining procedure in SAS HPA, HPTMINE (experimental) which actually works fair well. This writing presents one working example.
The text file contains ~216K news entries, total file size ~384MB. The example runs on a Windows client with 16GB RAM.
proc HPTMINE data=doc2.news2;
doc_id id2; /*ID variable is required*/
variable description; /*listing multiple variables may cause confusion*/
parse outterms = doc2.out_terms_news reducef=2;
/*frequency for term filtering: minimum frequency of occurrence by which a term is dropped*/
/*nostermming entities= stop= start= multiterm= syn= termwgt= cellwgt= outchild= outterms=*/
/*all these options can be turned on and off. Weighting is important in tweaking process*/
svd k=10 outdocpro=doc2.docpro_news
/*this is critical math part in the whole exercise. In some cases you act on direct frequency*/
svdu=doc2.news_svdu /*left singular vector*/
svdv=doc2.news_svdv; /*right singular vector*/
/*tol= tolerance value for singular value*/
performance host="&GRIDHOST" install="&GRIDINSTALLLOC" details;*/
This procedure integrates several separate procedures available in regular SAS Text Miner, so as to reduce I/O traffic due to the separations. The advantage from this integration is more pronounced when the input text file is huge. This integration also is a logic centralization to happen before parallel computation is invoked to execute the job. This specific example is not executed on parallel nodes.
Below are some log details, less than 2 minutes for the operation
Below are screen shots of term probability table and term-frequency matrix. The mechanics of the whole operation is very intuitive. To get desired outcome often requires time-consuming tweaking. The upside is using all defaults could very well