Tuesday, July 15, 2014

SAS In-Memory Statistics for Hadoop: Key Exercises to Jump-Start Long Time SAS Users, Part One


SAS® In-Memory Statistics for Hadoop ("SASIMSH")  is a single interactive programming environment for analytics on Hadoop that  integrates analytical data preparation, exploration, modeling and deployment. It contains PROC IMSTAT and  SAS LASR Analytic Engine (SASIOLA). SASIOLA essentially is 'BASE' for SASIMSH, whereas PROC IMSTAT covers the full analytical cycle from BASE, through modeling, towards deployment. This duality continues SAS's long standing tradition of 'getting the same job done in different ways', to accommodate users' different style, constraints  and preferences.
This post is one of several upcoming posts I plan to publish soon that discuss code mapping of key analytical data exercises from traditional SAS programming to SASIMSH. This post today covers sorting, sorting related BY and ordering in SASIMSH.
Almost every modeler/analyst who has ever prepared data for modeling using SAS tools is familiar with PROC SORT, Set By on the sorted data set and/or further engaging first. and last. processing.
In SASIMSH, "Proc Sort" is NO LONGER explicitly supported as syntax invocation. However, the act of sorting is definitely supported. The naming/syntax invocation is now changed to PARTITION. Below are two code examples

Example 1 : partition under SASIOLA engine

" libname dlq SASIOLA START TAG=data2;
       data dlq.in(partition=(key2) orderby=(month) replace=yes);
          set trans.dlq_month1
                trans.dlq_month2;

       run; "
  1.  At the LIBNAME statement, specification of SASIOLA continues the spirit of SAS Access drivers; you probably have run SAS Access to Oracle or older versions of SAS like V8 at LIBNAME statements before.  The option START is unique with SASIOLA (IOLA standing for input-output LASR). It simply tells SAS to launch the SASIOLA in-memory server (you can release the library or shut down the server later) . TAG= is critical and is required. One reason is to reference the data properly once it is loaded into the memory.  Another,  associated, reason is to avoid potential 'collision' when multiple actions are happening in the memory space. Also, when loading data from, say, Hadoop where the source has many layers of locations, the two-level restriction embedded with traditional BASE is no longer sufficient. Tag will allow for long specification.
  2. SET statement can still be used to stack data sets . Still, there is no limit as to how many data sets you can stack; salient concern is sizing: how much memory space  is needed to accommodate the input data sets combined, a concern you care far less when running SET statement in your traditional BASE code. Also noteworthy is that multiple SET statements are no longer supported with the SASIOLA data step, although you can SET multiple input sets with a single SET statement. Interesting question is: how much do you still need to engage multiple SET statements, in this new in-memory computation context?
  3. Now under SASIOLA engine, sorting happens as a data set option PARTITION=: partition=key2 is, logically, the same as "proc sort; by key2; run;". However, this evolution is >> than just syntax or naming/action switch. It reflects fundamental difference between analytical computing centering around Hadoop (SASIMSH) and traditional SAS BASE. Hadoop is parallel computing by default. If you are running SAS IOLA on a 32-node Hadoop environment, partitioning naturally tries to     load different partitions cross the 32 nodes, instead of jamming all the partitions  into one single partition (sequentially) as is the case with PROC SORT. PARTITION= is to put records pertaining to the same partition on the same node (there is indeed optimal padding/block size to consider) . Accessing the partitions later, by design, is to happen in parallel fashion; some of us call it bursting through the memory pipes . This is very different from SAS BASE where you grind through observations one by one.     
  4. As we should have learned from PROC SORT, the first variable listed at PROC SORT typically is to group, not to order; if you list only one variable for PROC SORT, you should care only to group. For example, if the variable is account_number or segment label, analytically speaking you rarely need to   order by the variable values, in addition to sorting by it. But PROC SORT in most cases orders the observations by the sorting variable anyway. This is no longer the case with partitioning with SASIOLA or SASIMSH in general.
  5. Similar to PROC SORT, with SASIOLA, 1) you can list as many variables as you see fit at PARTITION=. 2) order of the variables listed still matters  3) same sense and sensibility that the more variables you list, the less analytical sense it makes, still necessary albeit.
  6. You can engage PARTITION= as data set option for input data set as well. My preference is to use it as 'summary' at the output data set. There are cases where partitions rendered at the input are 'automatically'/'implicitly' preserved into the output. There are cases where the preservation does not happen.
  7. Orderby = is interesting. If you specify orderby=, the ordering happens within the partitions.
            When you apply "PROC SORT; by key2 months; run;" and you have multiple entries of  month=JAN,
            for  example, using first.key2 later does not pin down the record for you, unless you add at least one more 
           variable at the BY statement. This remains the case with "partition=(key2) orderby=(month)" under
           SASIOLA. If, however,  the later action is to do summary by the two variables, running "proc means; by key2
           month; run;" will yield different results from running summary under SASIMSH (PROC IMSTAT, to be
           specific), because in PROC IMSTAT only the variable key2 is effectively used and the orderby variable month
           is ignored.

     8. Reeplace =YES: a concise way to effect "proc delete data=dlq.in;" or "proc dataset lib=dlq; delete in;

          run;". This carries obvious Hadoop flavor.


Example 2: partition using PROC IMSTAT under SASIMSH,
"PROC IMSTAT ;
       table dlq.in;
           partition key2/orderby=month;
      run ;
       table dlq.&_templast_;
           summary x1 /partition;
      run; "

  • This example has pretty much the same end result as example 1 above, as far as partitioning is concerned.
  • The key difference is in their 'way of life'. While both examples represent genuine in-memory computation, example 1 resembles traditional SAS BASE batch action and example 2 is  true interactive programming. In example 2, within one single invocation of PROC IMSTAT, the analyst can use RUN statements to scope the whole stream into different sub-scopes, where the 'slice and dice', exploration (like the summery action), and modeling (not shown here) is happening WHILE the data tables are 'floating in memory'
  • In both examples, none of the resulting data sets are saved onto disk. They are all eventually parked in memory space. There is SAVE statement that allows the data to be 'downloaded' to the disk.
In next posts, I will cover transpose and retain actions. Let me know what you think. Thanks.