Sunday, May 19, 2013

Mining 108 Million Text Messages in 7 Minutes: SAS High Performance Text Mining HPTMINE

The job is processed on a Greenplum parallel system running SAS High Performance Analytics Server 12.1. The system has 32 worker nodes. Each node has 24 threads with 256GB RAM.

The text data is a text type column in a SAS data set. The total file size is  ~187 GB. Total text cells /messages processed are ~108 million. Cell weight, document weight and SVD are computed

The following picture shows detailed processing log of the SAS job

Below is detailed speed info of each computing step inside the whole job. Parsing takes ~70% time

Finally, a snapshot of the frequency-term table

