SAS® In-Memory Statistics for Hadoop ("SASIMSH")
is a single interactive programming environment for analytics on Hadoop
that integrates analytical data preparation, exploration, modeling and
deployment. It contains PROC IMSTAT and SAS LASR Analytic Engine
(SASIOLA). SASIOLA essentially is 'BASE' for SASIMSH, whereas PROC IMSTAT
covers the full analytical cycle from BASE, through modeling, towards
deployment. This duality continues SAS's long standing tradition of 'getting
the same job done in different ways', to accommodate users' different style,
constraints and preferences.
This post is one of several upcoming posts I plan to
publish soon that discuss code mapping of key analytical data
exercises from traditional SAS programming to SASIMSH. This post today covers
sorting, sorting related BY and ordering in SASIMSH.
Almost every modeler/analyst who has ever prepared data for
modeling using SAS tools is familiar with PROC SORT, Set By on the
sorted data set and/or further engaging first. and last. processing.
In SASIMSH, "Proc Sort" is NO LONGER explicitly
supported as syntax invocation. However, the act of sorting is definitely
supported. The naming/syntax invocation is now changed to PARTITION. Below are
two code examples
Example 1 : partition under SASIOLA engine
" libname dlq SASIOLA START TAG=data2;
data dlq.in(partition=(key2) orderby=(month) replace=yes);
set trans.dlq_month1
trans.dlq_month2;
run; "
" libname dlq SASIOLA START TAG=data2;
data dlq.in(partition=(key2) orderby=(month) replace=yes);
set trans.dlq_month1
trans.dlq_month2;
run; "
- At the
LIBNAME statement, specification of SASIOLA continues the spirit of SAS
Access drivers; you probably have run SAS Access to Oracle
or older versions of SAS like V8 at LIBNAME statements
before. The option START is unique with SASIOLA (IOLA standing
for input-output LASR). It simply tells SAS to launch the SASIOLA
in-memory server (you can release the library or shut down the server
later) . TAG= is critical and is required. One reason is to reference the data properly once it is loaded into the memory. Another, associated, reason is to avoid potential 'collision' when multiple actions are happening in the memory space. Also, when loading data from, say, Hadoop where the source has many layers of locations, the two-level restriction embedded with traditional BASE is no longer sufficient. Tag will allow for long specification.
- SET statement can still be used to stack data sets . Still, there is no limit as to how many data sets you can stack; salient concern is sizing: how much memory space is needed to accommodate the input data sets combined, a concern you care far less when running SET statement in your traditional BASE code. Also noteworthy is that multiple SET statements are no longer supported with the SASIOLA data step, although you can SET multiple input sets with a single SET statement. Interesting question is: how much do you still need to engage multiple SET statements, in this new in-memory computation context?
- Now under SASIOLA engine, sorting happens as a data set option PARTITION=: partition=key2 is, logically, the same as "proc sort; by key2; run;". However, this evolution is >> than just syntax or naming/action switch. It reflects fundamental difference between analytical computing centering around Hadoop (SASIMSH) and traditional SAS BASE. Hadoop is parallel computing by default. If you are running SAS IOLA on a 32-node Hadoop environment, partitioning naturally tries to load different partitions cross the 32 nodes, instead of jamming all the partitions into one single partition (sequentially) as is the case with PROC SORT. PARTITION= is to put records pertaining to the same partition on the same node (there is indeed optimal padding/block size to consider) . Accessing the partitions later, by design, is to happen in parallel fashion; some of us call it bursting through the memory pipes . This is very different from SAS BASE where you grind through observations one by one.
- As we should have learned from PROC SORT, the first variable listed at PROC SORT typically is to group, not to order; if you list only one variable for PROC SORT, you should care only to group. For example, if the variable is account_number or segment label, analytically speaking you rarely need to order by the variable values, in addition to sorting by it. But PROC SORT in most cases orders the observations by the sorting variable anyway. This is no longer the case with partitioning with SASIOLA or SASIMSH in general.
- Similar to PROC
SORT, with SASIOLA, 1) you can list as many variables as you see fit at PARTITION=. 2)
order of the variables listed still matters 3) same sense and
sensibility that the more variables you list, the less analytical sense it
makes, still necessary albeit.
- You can engage
PARTITION= as data set option for input data set as well. My preference is
to use it as 'summary' at the output data set. There are cases
where partitions rendered at the input are 'automatically'/'implicitly'
preserved into the output. There are cases where the preservation does not
happen.
- Orderby = is interesting. If you specify orderby=, the ordering happens within the partitions.
When
you apply "PROC SORT; by key2 months; run;" and you
have multiple entries of month=JAN,
for example, using first.key2 later
does not pin down the record for you, unless you add at least one more
variable at the BY statement. This remains the case with "partition=(key2) orderby=(month)"
under
SASIOLA. If, however, the later action is to do summary by the two variables,
running "proc means; by key2
month; run;" will yield different
results from running summary under SASIMSH (PROC IMSTAT, to be
specific),
because in PROC IMSTAT only the variable key2 is effectively used and the
orderby variable month
is ignored.
8. Reeplace =YES: a concise way to effect "proc delete data=dlq.in;" or "proc dataset lib=dlq; delete in;
run;". This carries obvious Hadoop flavor.
Example 2: partition using PROC IMSTAT under SASIMSH,
"PROC IMSTAT ;
table dlq.in;
partition key2/orderby=month;
run ;
table dlq.&_templast_;
summary x1 /partition;
run; "
table dlq.in;
partition key2/orderby=month;
run ;
table dlq.&_templast_;
summary x1 /partition;
run; "
- This example has
pretty much the same end result as example 1 above, as far as partitioning
is concerned.
- The key
difference is in their 'way of life'. While both examples represent
genuine in-memory computation, example 1 resembles traditional SAS BASE
batch action and example 2 is true interactive programming. In
example 2, within one single invocation of PROC IMSTAT, the analyst can
use RUN statements to scope the whole stream into different sub-scopes,
where the 'slice and dice', exploration (like the summery action), and
modeling (not shown here) is happening WHILE the data tables are
'floating in memory'
- In both examples, none of the resulting data sets are saved onto disk. They are all eventually parked in memory space. There is SAVE statement that allows the data to be 'downloaded' to the disk.
You put really very helpful information. Keep it up.
ReplyDeleteBig Data Training in Chennai
Thanks for sharing this valuable information to our vision. You have posted a trust worthy blog keep sharing.
ReplyDeleteAWS Training in chennai | AWS Training chennai | AWS course in chennai
I am following your blog from the beginning, it was so distinct & I had a chance to collect conglomeration of information that helps me a lot to improvise myself.
ReplyDeleteRegards,
ccna courses in Chennai|ccna institutes in Chennai
The Spring Framework is a lightweight framework for developing Java enterprise applications. It provides high performing, easily testable and reusable code. Spring handles the infrastructure as the underlying framework so that you can focus on your application.Spring is modular in design, thereby making creation, handling and linking of individual components so much easier. Spring implements Model View Container(MVC) design pattern.
ReplyDeletespring mvc form example
This comment has been removed by the author.
ReplyDeleteThis comment has been removed by the author.
ReplyDeleteYour very own commitment to getting the message throughout came to be rather powerful and have consistently enabled employees just like me to arrive at their desired goals.
ReplyDeleteselenium training in chennai
I’m enjoying the information. I’m bookmarking and will be tweeting this to my followers! Wonderful blog and amazing design and style.
ReplyDeletesafety courses in chennai
I am really very happy to find this particular site. I just wanted to say thank you for this huge read!! I absolutely enjoying every petite bit of it and I have you bookmarked to test out new substance you post.
ReplyDeleteJava training in Chennai | Java training in USA |
Java training in Bangalore | Java training in Indira nagar | Java training in Bangalore | Java training in Rajaji nagar
A very nice guide. I will definitely follow these tips. Thank you for sharing such detailed article. I am learning a lot from you.
ReplyDeleteangularjs Training in marathahalli
angularjs interview questions and answers
angularjs Training in bangalore
angularjs Training in bangalore
angularjs online Training
Hello, I read your blog occasionally, and I own a similar one, and I was just wondering if you get a lot of spam remarks? If so how do you stop it, any plugin or anything you can advise?
ReplyDeletesafety course in chennai
ReplyDeletei just go through your article it’s very interesting time just pass away by reading your article looking for more updates. Thank you for sharing.
Best Devops Training Institute
Thanks for sharing.
ReplyDeleteDigital Marketing Agency in Mumbai
Thank you for your post. This is excellent information. It is amazing and wonderful to visit your site.
ReplyDeleteJava training in Chennai
Java Online training in Chennai
Java Course in Chennai
Best JAVA Training Institutes in Chennai
Java training in Bangalore
Java training in Hyderabad
Java Training in Coimbatore
Java Training
Java Online Training
Really the Blog is very Informative. every blog of this content should be very uniquely Represented. and easily clarify the queries for the Beginners.
ReplyDeleteSalesforce Training in Chennai
Salesforce Online Training in Chennai
Salesforce Training in Bangalore
Salesforce Training in Hyderabad
salesforce training in ameerpet
Salesforce Training in Pune
Salesforce Online Training
Salesforce Training