Analytics in Writing

Tuesday, July 15, 2014

SAS In-Memory Statistics for Hadoop: Key Exercises to Jump-Start Long Time SAS Users, Part One

SAS® In-Memory Statistics for Hadoop ("SASIMSH") is a single interactive programming environment for analytics on Hadoop that integrates analytical data preparation, exploration, modeling and deployment. It contains PROC IMSTAT and SAS LASR Analytic Engine (SASIOLA). SASIOLA essentially is 'BASE' for SASIMSH, whereas PROC IMSTAT covers the full analytical cycle from BASE, through modeling, towards deployment. This duality continues SAS's long standing tradition of 'getting the same job done in different ways', to accommodate users' different style, constraints and preferences.

This post is one of several upcoming posts I plan to publish soon that discuss code mapping of key analytical data exercises from traditional SAS programming to SASIMSH. This post today covers sorting, sorting related BY and ordering in SASIMSH.

Almost every modeler/analyst who has ever prepared data for modeling using SAS tools is familiar with PROC SORT, Set By on the sorted data set and/or further engaging first. and last. processing.

In SASIMSH, "Proc Sort" is NO LONGER explicitly supported as syntax invocation. However, the act of sorting is definitely supported. The naming/syntax invocation is now changed to PARTITION. Below are two code examples

Example 1 : partition under SASIOLA engine

" libname dlq SASIOLA START TAG=data2;
       data dlq.in(partition=(key2) orderby=(month) replace=yes);
          set trans.dlq_month1
                trans.dlq_month2;
       run; "

At the LIBNAME statement, specification of SASIOLA continues the spirit of SAS Access drivers; you probably have run SAS Access to Oracle or older versions of SAS like V8 at LIBNAME statements before. The option START is unique with SASIOLA (IOLA standing for input-output LASR). It simply tells SAS to launch the SASIOLA in-memory server (you can release the library or shut down the server later) . TAG= is critical and is required. One reason is to reference the data properly once it is loaded into the memory. Another, associated, reason is to avoid potential 'collision' when multiple actions are happening in the memory space. Also, when loading data from, say, Hadoop where the source has many layers of locations, the two-level restriction embedded with traditional BASE is no longer sufficient. Tag will allow for long specification.
SET statement can still be used to stack data sets . Still, there is no limit as to how many data sets you can stack; salient concern is sizing: how much memory space is needed to accommodate the input data sets combined, a concern you care far less when running SET statement in your traditional BASE code. Also noteworthy is that multiple SET statements are no longer supported with the SASIOLA data step, although you can SET multiple input sets with a single SET statement. Interesting question is: how much do you still need to engage multiple SET statements, in this new in-memory computation context?
Now under SASIOLA engine, sorting happens as a data set option PARTITION=: partition=key2 is, logically, the same as "proc sort; by key2; run;". However, this evolution is >> than just syntax or naming/action switch. It reflects fundamental difference between analytical computing centering around Hadoop (SASIMSH) and traditional SAS BASE. Hadoop is parallel computing by default. If you are running SAS IOLA on a 32-node Hadoop environment, partitioning naturally tries to load different partitions cross the 32 nodes, instead of jamming all the partitions into one single partition (sequentially) as is the case with PROC SORT. PARTITION= is to put records pertaining to the same partition on the same node (there is indeed optimal padding/block size to consider) . Accessing the partitions later, by design, is to happen in parallel fashion; some of us call it bursting through the memory pipes . This is very different from SAS BASE where you grind through observations one by one.
As we should have learned from PROC SORT, the first variable listed at PROC SORT typically is to group, not to order; if you list only one variable for PROC SORT, you should care only to group. For example, if the variable is account_number or segment label, analytically speaking you rarely need to order by the variable values, in addition to sorting by it. But PROC SORT in most cases orders the observations by the sorting variable anyway. This is no longer the case with partitioning with SASIOLA or SASIMSH in general.
Similar to PROC SORT, with SASIOLA, 1) you can list as many variables as you see fit at PARTITION=. 2) order of the variables listed still matters 3) same sense and sensibility that the more variables you list, the less analytical sense it makes, still necessary albeit.
You can engage PARTITION= as data set option for input data set as well. My preference is to use it as 'summary' at the output data set. There are cases where partitions rendered at the input are 'automatically'/'implicitly' preserved into the output. There are cases where the preservation does not happen.
Orderby = is interesting. If you specify orderby=, the ordering happens within the partitions.

When you apply "PROC SORT; by key2 months; run;" and you have multiple entries of month=JAN,

for example, using first.key2 later does not pin down the record for you, unless you add at least one more

variable at the BY statement. This remains the case with "partition=(key2) orderby=(month)" under

SASIOLA. If, however, the later action is to do summary by the two variables, running "proc means; by key2

month; run;" will yield different results from running summary under SASIMSH (PROC IMSTAT, to be

specific), because in PROC IMSTAT only the variable key2 is effectively used and the orderby variable month

is ignored.

8. Reeplace =YES: a concise way to effect "proc delete data=dlq.in;" or "proc dataset lib=dlq; delete in;

run;". This carries obvious Hadoop flavor.

Example 2: partition using PROC IMSTAT under SASIMSH,

"PROC IMSTAT ;
       table dlq.in;
           partition key2/orderby=month;
      run ;
       table dlq.&_templast_;
           summary x1 /partition;
      run; "

This example has pretty much the same end result as example 1 above, as far as partitioning is concerned.
The key difference is in their 'way of life'. While both examples represent genuine in-memory computation, example 1 resembles traditional SAS BASE batch action and example 2 is true interactive programming. In example 2, within one single invocation of PROC IMSTAT, the analyst can use RUN statements to scope the whole stream into different sub-scopes, where the 'slice and dice', exploration (like the summery action), and modeling (not shown here) is happening WHILE the data tables are 'floating in memory'
In both examples, none of the resulting data sets are saved onto disk. They are all eventually parked in memory space. There is SAVE statement that allows the data to be 'downloaded' to the disk.

In next posts, I will cover transpose and retain actions. Let me know what you think. Thanks.

Thursday, January 23, 2014

Using SAS Forecasting Server for CCAR Macroeconomic Forecasting

Comprehensive Capital Analysis and Review, or CCAR, is one of Federal Reserve’s capital planning initiatives for big banks. The Federal Reserve has published broad contours of the baseline scenario, the adverse scenario, and the severely adverse scenario. Forecasting macroeconomic variables under these different scenarios serve as starting points for loan loss prediction, financial portfolio modeling and comprehensive reporting under CCAR. The Fed Reserve also provided updated historical time series accompanying the scenarios (see www.federalreserve.gov/bankinforeg/stress-tests-capital-planning.htm).

This post shows how to use SAS Forecasting Server (FS) to build 28 time series forecasting from the link above. It took me a little over 2 hours to get the 28 models set up and run with FS including data prep. The SAS code program "fedmodeifed2.sas" in the appendix of this post converts original charter data types into numeric data type, and creates quarterly time ID necessary to model with FS.

After creating new project, point to the data set that has all 28 time series

Decide to run 'pure' time series forecast or to include explanatory variables

Great flexibility to decide which subset of the series to be used for forecast. This is a huge productivity boost since the modeler does not need to toggle back and forth between the design window and data steps, or carry many different conditioning flags in the data set

Point and click to access 47 model selection criteria. Slide validation data window to suite your needs

A set of quality models are built in 2 minutes. You can copy the built models, make changes to vary into different criteria, validation, measurement combinations for quick comparison. The model building process produces diagnostics, residual plots, whitening capability and transformations, among others. You can design event variables to adjust forecast by business cycles and external shocks/treatment. FS also supports custom overrides (although in many veins manual overrides are not recommended)

SAS Forecast Server essentially is GUI counterpart of the renown SAS ETS software. It is built and optimized towards the latest trend in forecast theory and practice towards forecast data mining. The productivity boost and model documentation are the top reasons I fall in love with FS. I spent about 2 hours and 20 minutes to build up all 28 series. They are baseline and you certainly can and should spend more time polishing them. Below is a summary of the 28 models. My favorite method is UCM

Enjoy!!

Appendix: SAS code to clean up original data for FS modeling, fedmodeifed2.sas"

     "
     libname ccar "c:\sasdata";
    %let indsn =ccar.ccartest2; /*use Import facility to read in FED spreadsheet directly*/
    %let outdsn =sashelp.fedmodified2;
    data &outdsn.;
         set &indsn. ;
        length Q2 $2 y2 $4 ;
        Q2 =substr(compress(OBS), 2,1);
        Y2 =substr(compress(OBS), 3,4);
        YYQ =yyQ(y2,Q2);
        format yyq yyq6.;
        if _n_<=151;

        rename _0_year_Treasury_yield=Ten_year_Tre_yield
                   __month_Treasury_rate =Three_mon_Tre_rate
                     __year_Treasury_yield=Five_year_tre_yield;

                   /*the three raw names above cause problems with FS without renaming*/

                    DJTSMI=Dow_Jones_Total_Stock_Market_Ind*1;
                    Developing_Asia_Inflation2 =Developing_Asia_Inflation*1;
            Developing_Asia_Real_GDP_Growth2=Developing_Asia_Real_GDP_Growth*1;
            Euro_Area_Bilateral_Dollar_Exch2=Euro_Area_Bilateral_Dollar_Excha*1;
            Euro_Area_Inflation2=Euro_Area_Inflation*1;
            Developing_Asia_Bilateral_Dolla2=Developing_Asia_Bilateral_Dollar*1;
            Market_Volatility_Index__VIX2=Market_Volatility_Index__VIX_*1;

            /*the original data are very clean and regularized. There is no need to write fancy    code to do the type conversion. Just get it done*/

       drop Developing_Asia_Inflation
       Developing_Asia_Real_GDP_Growth
       Euro_Area_Bilateral_Dollar_Excha
       Euro_Area_Inflation
       Market_Volatility_Index__VIX_
      Dow_Jones_Total_Stock_Market_Ind
      Developing_Asia_Bilateral_Dollar
      Q2 y2;
run ;

"

Sunday, December 1, 2013

Why many have the sentiment ROI on big data is not paying off? 12/2013

        In the past 2 years, I have seen Java/computer scientist firing up Java/fancy programming tools to do ... One example: to build objective functions for a regression. With all due respect, for statistical software companies like SAS and SPSS this was cutting-edge of a time when Clinton was in WH. Some today asked (SAS) to break out its much advanced big data analytics software, to show them how objective function is built, so to validate their building using R. If you don't know David Ricardo's 'comparative advantage', Yahoo it now. Please don't tell me you should spend >>ten weeks of your time (charging your client ~$300/hour?) to build a pair of shoes, instead of spending $2000 to just buy one off-shelf (very likely better built than your cookout), because what? Because you are not statistician?

    If your goal is to start up business in advanced analytics, hoping to go for IPO, striking big $, that is fine, and probably a necessary path to start from scratch, if not coding from ground zero (if not asking open source community to contribute to your cause for free). For 99% of us into big data analytics, it is about enhancing our core business on hand. Why today more and more seem to have the sentiment that ROI on big data is not paying off? Forgetting your core competence/specialty and core business amongst big data fever is one key reason: if you are not able to articulate "why not", this inability becomes "why Yes" quickly. This way of investment has obvious logic problems, and is anti-analytics per se. What is SIC (standard industry code) for analytics? None, because it permeates each SIC. Your job is to hold onto your SIC, adopt and modernize your analytics.

    I see speech and blog where people toss up new terms and concepts on big data analytics. Often 5 minutes later I realize "oh, is that just what statisticians call (clustering)? Kernel estimation? ....." Not many read deeply into (statistics) past literature these days. For some, if they cannot find it at Google.com, they start to think they have one innovation on their hand. One day I was asked to take a look at 'a design'. I suggested applying a KS test. That test eventually eliminated ~750K lines of Java code the developer was writing for >3 months. KS test? Is that what statisticians have been doing behind banks' firewall in the past >15 years? Now you spent another 3 months to code KS using Java. You could not match SAS. You switched to SPSS. Still nowhere close, while SAS and SPSS have turned in consistent results/+cosmetic differences on the same data set... My point? Integrity, regardless on big data or small data, is way more important than scalability; scalability actually is the easier part.

   Instead of checking out fashion labels on our jackets, statisticians and non-thereof should work together on big data analytics. Recently I had honor to review a friend paper. I was very impressed by her creativity and ability to use R. Then she asked "when do you think SAS is going to implement it?" "Why do you ask?" I smiled on the webcam. A sheepish look on her face "you know the authenticity part....." Creativity, nimble, flexibility shall meet and marry the 'king of algorithm'. The offspring should benefit all of us. If you want to exceed a giant, try to stand on its head or shoulder to grow. If you choose to start afresh along its side, the chance is you will live in its shadow for long time, if not for life.

   Another friend is a division chief at a big NYC hospital. Two days ago he told me his medical school is hiring computer scientists to work with biostatisticians. I am also seeing banks hiring analysts with more diverse background, like physics major doing predictive modeling. This trend towards multidisciplinary mix is healthy. Let us don't dumb down and out any major. Statistics is going to be a stalwart in big data for a long time to come. If you don't learn and adapt quickly, you become irrelevant, regardless which major you are in. My experience in the past is learning statistics is harder than learning machine learning stuff. If you 'hate' statistics for that reason, I fully appreciate and am with you, especially if the market does not appear to pay statistician as much it pays data sciences. On the other hand, if you take away coding/programming/system building, how much analytics really is left in many data science? See for yourself.

Tuesday, May 21, 2013

Fitting Logistic Regression on 67 Million Rows Using SAS HPLOGISTIC

This blog is mainly focused on model processing performance details using PROC HPLOGISTIC under SAS HPAS12.1. Given the data set is the same, the target and input are also the same, AICC statistics is shown for the models; when data set is different, or the target or the inputs are different, it makes little sense to talk AICC or performance statistics in general about models constructed.

The jobs are processed on a Greenplum parallel system running SAS High Performance Analytics Server 12.1. The system has 32 worker nodes. Each node has 24 threads with 256GB RAM. The data set has 67,164,440, ~67 million rows. Event rate is 6.61% or 4,437,160. The data set is >80 GB. After EDA steps, 59 variables are entered to fit all the models.

One executive summary style of observation is : we are talking seconds, and minutes.

Below is sample code for fitting a binary logistic regression using HPLOGISTIC

The following is SAS log for the job running inside SAS Enterprise Guide project flow

While much of HPLOGISTIC's model output is similar to PROC Logistic from SAS STAT, some new output reflects strong and renewed emphasis on computational efficiency in modeling. One good example "Procedure Task Timing" is now part of all SAS HP PROCs. Here is example + details for three Newton Raphson with Ridge (NRRIDG) models.

Finally, a juxtaposition of several technique and selection mixes, iteration, time spent and AICC

Some observations from this 'coarse' exercise (refined, depth work is planned for 2014)

1. WHERE statement is supported. You can use PROC HPSAMPLE to insert a variable _partind_ into your data set. Then use WHERE statement to separate training from validation and test partitions. In this way, you don't have to populate separate data sets which now have big footprint

2. RANUNI function is also supported. Unlike PROC HPREG, though, as of today, PARTITION statement where you can leverage external data set for variable selection and feature validation, is not yet supported in HPLOGISTIC. The popular C statistic will be available in June 2014 release of SAS HPAS 13.1.

3. It becomes more and more obvious that the era of Fisher Scoring is moving on where most, if not all modelers, build one Exploratory Data Analysis (EDA) set to fit all techniques and selection. To max out what the big data set can afford for lift, one may need to build one set per technique. That the build using technique=NMSIMP crashed in this exercise examplified this. NMSIMP is typically suitable for small problems. It is recommended one samples down the universe. The technique best for large data set is Conjugate-Gradient Optimization, which is not shown in this blog (I should have one tested on Hadoop cluster ready for blogging soon)

4. Apparently it is not the case the more iteration the better performance. This seems also the case with random forest or neural networks. In other words, there ought to be a saturation lift point, for a given EDA on a given data set. One needs to reach that 'realization point' sooner than later to be more productive

5. Over data ranges such as 67 million, I doubt if HPSUMMARY, HPCORR... all the usual summary exercise is sufficient to understand the raw input at the EDA stage. Multicollinearity is another intriguing subject. The trend seems that it is being overridden by observation-wise swap test for best subsets of variables. Kernel estimation and full visual data analysis such as offered by SAS Visual Analytics are two other strong options.

6. Logistic regression, especially armed by facilities such as SAS HPLOGISTIC, opens the door for innovative modeling scheme. More and more are moving beyond "one row per account" to model directly on transactions. In the context of transactions, 67 million is really not considered big at all. So be bold. Because if today with HPLOGISTIC, if your performance still suffers, what is holding you back is hardware, not the software. (Well, if you insist on using SAS 9.1 of 2008, and refuse to move onto HPLOGISTIC of 2013, and keep on declaring 'SAS is slow', there is nothing anyone can do about it).

7. Analytically, if you build directly on transactions, instead of rolling up and summing up, you are purely modeling behavior. However, if your action requires account or individual level, how to roll up your transaction model scores to the level is interesting.

Sunday, May 19, 2013

Mining 108 Million Text Messages in 7 Minutes: SAS High Performance Text Mining HPTMINE

The job is processed on a Greenplum parallel system running SAS High Performance Analytics Server 12.1. The system has 32 worker nodes. Each node has 24 threads with 256GB RAM.

The text data is a text type column in a SAS data set. The total file size is ~187 GB. Total text cells /messages processed are ~108 million. Cell weight, document weight and SVD are computed

The following picture shows detailed processing log of the SAS job

Below is detailed speed info of each computing step inside the whole job. Parsing takes ~70% time

Finally, a snapshot of the frequency-term table

Saturday, April 13, 2013

Generalized Linear Model Structure and Nonlinear Model Structure in SAS STAT

SAS STAT product has so many model tools to offer sometime one is confused which covers what
cases and data structure. Below is a summary diagram I took from a training course SAS offers.

Again a picture speaks volume. This diagram is two years old. I believe, 90%, stay the same since.

Some such as GENMOD and GLIMMIX may be considered to move to HP platform. And NLIN and MIXED already have their big data counterpart in SAS HPA's HPNLIN and HPMIXED

Friday, April 12, 2013

SAS Clustering Solution Overview, just One Picture

More and more encounters and friends lately told me they see many SAS procedures that are related

to clustering, but not clear about interrelations among them (which one does what). From a training

course offered by SAS titled "Applied Clustering Techniques", I found a diagram that does a good

job explain it

As we often say, a picture is better a thousand words. Take a look

Tuesday, July 15, 2014

Thursday, January 23, 2014

After creating new project, point to the data set that has all 28 time series

Decide to run 'pure' time series forecast or to include explanatory variables

Great flexibility to decide which subset of the series to be used for forecast. This is a huge productivity boost since the modeler does not need to toggle back and forth between the design window and data steps, or carry many different conditioning flags in the data set