Tuesday, July 15, 2014

SAS In-Memory Statistics for Hadoop: Key Exercises to Jump-Start Long Time SAS Users, Part One

SAS® In-Memory Statistics for Hadoop ("SASIMSH")  is a single interactive programming environment for analytics on Hadoop that  integrates analytical data preparation, exploration, modeling and deployment. It contains PROC IMSTAT and  SAS LASR Analytic Engine (SASIOLA). SASIOLA essentially is 'BASE' for SASIMSH, whereas PROC IMSTAT covers the full analytical cycle from BASE, through modeling, towards deployment. This duality continues SAS's long standing tradition of 'getting the same job done in different ways', to accommodate users' different style, constraints  and preferences.
This post is one of several upcoming posts I plan to publish soon that discuss code mapping of key analytical data exercises from traditional SAS programming to SASIMSH. This post today covers sorting, sorting related BY and ordering in SASIMSH.
Almost every modeler/analyst who has ever prepared data for modeling using SAS tools is familiar with PROC SORT, Set By on the sorted data set and/or further engaging first. and last. processing.
In SASIMSH, "Proc Sort" is NO LONGER explicitly supported as syntax invocation. However, the act of sorting is definitely supported. The naming/syntax invocation is now changed to PARTITION. Below are two code examples

Example 1 : partition under SASIOLA engine

" libname dlq SASIOLA START TAG=data2;
       data dlq.in(partition=(key2) orderby=(month) replace=yes);
          set trans.dlq_month1

       run; "
  1.  At the LIBNAME statement, specification of SASIOLA continues the spirit of SAS Access drivers; you probably have run SAS Access to Oracle or older versions of SAS like V8 at LIBNAME statements before.  The option START is unique with SASIOLA (IOLA standing for input-output LASR). It simply tells SAS to launch the SASIOLA in-memory server (you can release the library or shut down the server later) . TAG= is critical and is required. One reason is to reference the data properly once it is loaded into the memory.  Another,  associated, reason is to avoid potential 'collision' when multiple actions are happening in the memory space. Also, when loading data from, say, Hadoop where the source has many layers of locations, the two-level restriction embedded with traditional BASE is no longer sufficient. Tag will allow for long specification.
  2. SET statement can still be used to stack data sets . Still, there is no limit as to how many data sets you can stack; salient concern is sizing: how much memory space  is needed to accommodate the input data sets combined, a concern you care far less when running SET statement in your traditional BASE code. Also noteworthy is that multiple SET statements are no longer supported with the SASIOLA data step, although you can SET multiple input sets with a single SET statement. Interesting question is: how much do you still need to engage multiple SET statements, in this new in-memory computation context?
  3. Now under SASIOLA engine, sorting happens as a data set option PARTITION=: partition=key2 is, logically, the same as "proc sort; by key2; run;". However, this evolution is >> than just syntax or naming/action switch. It reflects fundamental difference between analytical computing centering around Hadoop (SASIMSH) and traditional SAS BASE. Hadoop is parallel computing by default. If you are running SAS IOLA on a 32-node Hadoop environment, partitioning naturally tries to     load different partitions cross the 32 nodes, instead of jamming all the partitions  into one single partition (sequentially) as is the case with PROC SORT. PARTITION= is to put records pertaining to the same partition on the same node (there is indeed optimal padding/block size to consider) . Accessing the partitions later, by design, is to happen in parallel fashion; some of us call it bursting through the memory pipes . This is very different from SAS BASE where you grind through observations one by one.     
  4. As we should have learned from PROC SORT, the first variable listed at PROC SORT typically is to group, not to order; if you list only one variable for PROC SORT, you should care only to group. For example, if the variable is account_number or segment label, analytically speaking you rarely need to   order by the variable values, in addition to sorting by it. But PROC SORT in most cases orders the observations by the sorting variable anyway. This is no longer the case with partitioning with SASIOLA or SASIMSH in general.
  5. Similar to PROC SORT, with SASIOLA, 1) you can list as many variables as you see fit at PARTITION=. 2) order of the variables listed still matters  3) same sense and sensibility that the more variables you list, the less analytical sense it makes, still necessary albeit.
  6. You can engage PARTITION= as data set option for input data set as well. My preference is to use it as 'summary' at the output data set. There are cases where partitions rendered at the input are 'automatically'/'implicitly' preserved into the output. There are cases where the preservation does not happen.
  7. Orderby = is interesting. If you specify orderby=, the ordering happens within the partitions.
            When you apply "PROC SORT; by key2 months; run;" and you have multiple entries of  month=JAN,
            for  example, using first.key2 later does not pin down the record for you, unless you add at least one more 
           variable at the BY statement. This remains the case with "partition=(key2) orderby=(month)" under
           SASIOLA. If, however,  the later action is to do summary by the two variables, running "proc means; by key2
           month; run;" will yield different results from running summary under SASIMSH (PROC IMSTAT, to be
           specific), because in PROC IMSTAT only the variable key2 is effectively used and the orderby variable month
           is ignored.

     8. Reeplace =YES: a concise way to effect "proc delete data=dlq.in;" or "proc dataset lib=dlq; delete in;

          run;". This carries obvious Hadoop flavor.

Example 2: partition using PROC IMSTAT under SASIMSH,
       table dlq.in;
           partition key2/orderby=month;
      run ;
       table dlq.&_templast_;
           summary x1 /partition;
      run; "

  • This example has pretty much the same end result as example 1 above, as far as partitioning is concerned.
  • The key difference is in their 'way of life'. While both examples represent genuine in-memory computation, example 1 resembles traditional SAS BASE batch action and example 2 is  true interactive programming. In example 2, within one single invocation of PROC IMSTAT, the analyst can use RUN statements to scope the whole stream into different sub-scopes, where the 'slice and dice', exploration (like the summery action), and modeling (not shown here) is happening WHILE the data tables are 'floating in memory'
  • In both examples, none of the resulting data sets are saved onto disk. They are all eventually parked in memory space. There is SAVE statement that allows the data to be 'downloaded' to the disk.
In next posts, I will cover transpose and retain actions. Let me know what you think. Thanks.





  1. Thank for sharing this great Hadoop tutorials Blog post. I will use your command when upgrade hadoop.I get a lot of great information here and this is what I am searching for. Thank you for your sharing. I have bookmark this page for my future reference.
    Hadoop Training in hyderabad

  2. You put really very helpful information. Keep it up.

    Big Data Training in Chennai

  3. Thanks for sharing this valuable information to our vision. You have posted a trust worthy blog keep sharing.
    AWS Training in chennai | AWS Training chennai | AWS course in chennai

  4. I am following your blog from the beginning, it was so distinct & I had a chance to collect conglomeration of information that helps me a lot to improvise myself.
    ccna courses in Chennai|ccna institutes in Chennai

  5. Hi admin thanks for sharing informative article on hadoop technology. In coming years, hadoop and big data handling is going to be future of computing world. This field offer huge career prospects for talented professionals. Thus, taking Hadoop Training in Chennai will help you to enter big data technology.

  6. Best SQL Query Tuning Training Center In Chennai This information is impressive; I am inspired with your post writing style & how continuously you describe this topic. After reading your post, thanks for taking the time to discuss this, I feel happy about it and I love learning more about this topic..


  7. This information is impressive; I am inspired with your post writing style & how continuously you describe this topic. After reading your post, thanks for taking the time to discuss this, I feel happy about it and I love learning more about this topic.

    Online Reputation Management

  8. This blog having the details of Processes running. The way of running is explained clearly. The content quality is really great. The full document is entirely amazing. Thank you very much for this blog.
    SEO Company in India
    Digital Marketing Company in India

  9. Great and useful article. Creating content regularly is very tough. Your points are motivated me to move on
    SEO Company in Chennai

  10. Thank you for sharing such a nice and interesting blog with us. I have seen that all will say the same thing repeatedly. But in your blog, I had a chance to get some useful and unique information. I would like to suggest your blog in my dude circle.
    Jobs in Chennai
    Jobs in Bangalore
    Jobs in Delhi
    Jobs in Hyderabad
    Jobs in Kolkata
    Jobs in Mumbai
    Jobs in Noida
    Jobs in Pune