Analytics in Writing: SAS In-Memory Statistics for Hadoop: Key Exercises to Jump-Start Long Time SAS Users, Part One

SAS® In-Memory Statistics for Hadoop ("SASIMSH") is a single interactive programming environment for analytics on Hadoop that integrates analytical data preparation, exploration, modeling and deployment. It contains PROC IMSTAT and SAS LASR Analytic Engine (SASIOLA). SASIOLA essentially is 'BASE' for SASIMSH, whereas PROC IMSTAT covers the full analytical cycle from BASE, through modeling, towards deployment. This duality continues SAS's long standing tradition of 'getting the same job done in different ways', to accommodate users' different style, constraints and preferences.

This post is one of several upcoming posts I plan to publish soon that discuss code mapping of key analytical data exercises from traditional SAS programming to SASIMSH. This post today covers sorting, sorting related BY and ordering in SASIMSH.

Almost every modeler/analyst who has ever prepared data for modeling using SAS tools is familiar with PROC SORT, Set By on the sorted data set and/or further engaging first. and last. processing.

In SASIMSH, "Proc Sort" is NO LONGER explicitly supported as syntax invocation. However, the act of sorting is definitely supported. The naming/syntax invocation is now changed to PARTITION. Below are two code examples

Example 1 : partition under SASIOLA engine

" libname dlq SASIOLA START TAG=data2;
       data dlq.in(partition=(key2) orderby=(month) replace=yes);
          set trans.dlq_month1
                trans.dlq_month2;
       run; "

At the LIBNAME statement, specification of SASIOLA continues the spirit of SAS Access drivers; you probably have run SAS Access to Oracle or older versions of SAS like V8 at LIBNAME statements before. The option START is unique with SASIOLA (IOLA standing for input-output LASR). It simply tells SAS to launch the SASIOLA in-memory server (you can release the library or shut down the server later) . TAG= is critical and is required. One reason is to reference the data properly once it is loaded into the memory. Another, associated, reason is to avoid potential 'collision' when multiple actions are happening in the memory space. Also, when loading data from, say, Hadoop where the source has many layers of locations, the two-level restriction embedded with traditional BASE is no longer sufficient. Tag will allow for long specification.
SET statement can still be used to stack data sets . Still, there is no limit as to how many data sets you can stack; salient concern is sizing: how much memory space is needed to accommodate the input data sets combined, a concern you care far less when running SET statement in your traditional BASE code. Also noteworthy is that multiple SET statements are no longer supported with the SASIOLA data step, although you can SET multiple input sets with a single SET statement. Interesting question is: how much do you still need to engage multiple SET statements, in this new in-memory computation context?
Now under SASIOLA engine, sorting happens as a data set option PARTITION=: partition=key2 is, logically, the same as "proc sort; by key2; run;". However, this evolution is >> than just syntax or naming/action switch. It reflects fundamental difference between analytical computing centering around Hadoop (SASIMSH) and traditional SAS BASE. Hadoop is parallel computing by default. If you are running SAS IOLA on a 32-node Hadoop environment, partitioning naturally tries to load different partitions cross the 32 nodes, instead of jamming all the partitions into one single partition (sequentially) as is the case with PROC SORT. PARTITION= is to put records pertaining to the same partition on the same node (there is indeed optimal padding/block size to consider) . Accessing the partitions later, by design, is to happen in parallel fashion; some of us call it bursting through the memory pipes . This is very different from SAS BASE where you grind through observations one by one.
As we should have learned from PROC SORT, the first variable listed at PROC SORT typically is to group, not to order; if you list only one variable for PROC SORT, you should care only to group. For example, if the variable is account_number or segment label, analytically speaking you rarely need to order by the variable values, in addition to sorting by it. But PROC SORT in most cases orders the observations by the sorting variable anyway. This is no longer the case with partitioning with SASIOLA or SASIMSH in general.
Similar to PROC SORT, with SASIOLA, 1) you can list as many variables as you see fit at PARTITION=. 2) order of the variables listed still matters 3) same sense and sensibility that the more variables you list, the less analytical sense it makes, still necessary albeit.
You can engage PARTITION= as data set option for input data set as well. My preference is to use it as 'summary' at the output data set. There are cases where partitions rendered at the input are 'automatically'/'implicitly' preserved into the output. There are cases where the preservation does not happen.
Orderby = is interesting. If you specify orderby=, the ordering happens within the partitions.

When you apply "PROC SORT; by key2 months; run;" and you have multiple entries of month=JAN,

for example, using first.key2 later does not pin down the record for you, unless you add at least one more

variable at the BY statement. This remains the case with "partition=(key2) orderby=(month)" under

SASIOLA. If, however, the later action is to do summary by the two variables, running "proc means; by key2

month; run;" will yield different results from running summary under SASIMSH (PROC IMSTAT, to be

specific), because in PROC IMSTAT only the variable key2 is effectively used and the orderby variable month

is ignored.

8. Reeplace =YES: a concise way to effect "proc delete data=dlq.in;" or "proc dataset lib=dlq; delete in;

run;". This carries obvious Hadoop flavor.

Example 2: partition using PROC IMSTAT under SASIMSH,

"PROC IMSTAT ;
       table dlq.in;
           partition key2/orderby=month;
      run ;
       table dlq.&_templast_;
           summary x1 /partition;
      run; "

This example has pretty much the same end result as example 1 above, as far as partitioning is concerned.
The key difference is in their 'way of life'. While both examples represent genuine in-memory computation, example 1 resembles traditional SAS BASE batch action and example 2 is true interactive programming. In example 2, within one single invocation of PROC IMSTAT, the analyst can use RUN statements to scope the whole stream into different sub-scopes, where the 'slice and dice', exploration (like the summery action), and modeling (not shown here) is happening WHILE the data tables are 'floating in memory'
In both examples, none of the resulting data sets are saved onto disk. They are all eventually parked in memory space. There is SAVE statement that allows the data to be 'downloaded' to the disk.

In next posts, I will cover transpose and retain actions. Let me know what you think. Thanks.

16 comments:

UnknownOctober 15, 2014 at 3:00 AM
You put really very helpful information. Keep it up.

Big Data Training in Chennai
UnknownApril 24, 2015 at 3:47 AM
Thanks for sharing this valuable information to our vision. You have posted a trust worthy blog keep sharing.
AWS Training in chennai | AWS Training chennai | AWS course in chennai
MelisaAugust 11, 2015 at 5:58 AM
I am following your blog from the beginning, it was so distinct & I had a chance to collect conglomeration of information that helps me a lot to improvise myself.
Regards,
ccna courses in Chennai|ccna institutes in Chennai
unknownSeptember 15, 2017 at 6:12 AM
The Spring Framework is a lightweight framework for developing Java enterprise applications. It provides high performing, easily testable and reusable code. Spring handles the infrastructure as the underlying framework so that you can focus on your application.Spring is modular in design, thereby making creation, handling and linking of individual components so much easier. Spring implements Model View Container(MVC) design pattern.
spring mvc form example
Ancy merinaFebruary 22, 2018 at 2:10 AM
This comment has been removed by the author.
Ancy merinaFebruary 22, 2018 at 2:10 AM
This comment has been removed by the author.
saranyaApril 28, 2018 at 3:10 AM
Your very own commitment to getting the message throughout came to be rather powerful and have consistently enabled employees just like me to arrive at their desired goals.
selenium training in chennai
gowthunanOctober 24, 2018 at 5:07 AM
I’m enjoying the information. I’m bookmarking and will be tweeting this to my followers! Wonderful blog and amazing design and style.
safety courses in chennai
afiah bOctober 30, 2018 at 1:37 AM
I am really very happy to find this particular site. I just wanted to say thank you for this huge read!! I absolutely enjoying every petite bit of it and I have you bookmarked to test out new substance you post.
Java training in Chennai | Java training in USA |

Java training in Bangalore | Java training in Indira nagar | Java training in Bangalore | Java training in Rajaji nagar
prabhaNovember 2, 2018 at 4:08 AM
A very nice guide. I will definitely follow these tips. Thank you for sharing such detailed article. I am learning a lot from you.

angularjs Training in marathahalli

angularjs interview questions and answers

angularjs Training in bangalore

angularjs Training in bangalore

angularjs online Training
gowthunanJanuary 9, 2019 at 11:08 PM
Hello, I read your blog occasionally, and I own a similar one, and I was just wondering if you get a lot of spam remarks? If so how do you stop it, any plugin or anything you can advise?
safety course in chennai
anirudhMay 23, 2019 at 3:45 AM

i just go through your article it’s very interesting time just pass away by reading your article looking for more updates. Thank you for sharing.

Best Devops Training Institute
Digi TrendingJanuary 16, 2020 at 3:55 AM
Thanks for sharing.
Digital Marketing Agency in Mumbai
lavanyaJuly 24, 2020 at 4:37 PM
Thank you for your post. This is excellent information. It is amazing and wonderful to visit your site.
Java training in Chennai

Java Online training in Chennai

Java Course in Chennai

Best JAVA Training Institutes in Chennai

Java training in Bangalore

Java training in Hyderabad

Java Training in Coimbatore

Java Training

Java Online Training

vijayAugust 8, 2020 at 8:15 AM
Really the Blog is very Informative. every blog of this content should be very uniquely Represented. and easily clarify the queries for the Beginners.
Salesforce Training in Chennai

Salesforce Online Training in Chennai

Salesforce Training in Bangalore

Salesforce Training in Hyderabad

salesforce training in ameerpet

Salesforce Training in Pune

Salesforce Online Training

Salesforce Training
GenerativeaimastersJune 28, 2025 at 3:35 AM
Very insightful post on SAS In-Memory Statistics for Hadoop! The explanation of key features and benefits is clear and valuable for data professionals working with big data analytics. Great to see such technical topics made easy to understand. Keep sharing more informative content—this is truly helpful!

Generative AI Training In Hyderabad

Analytics in Writing

Tuesday, July 15, 2014

SAS In-Memory Statistics for Hadoop: Key Exercises to Jump-Start Long Time SAS Users, Part One

16 comments:

About Me