SAS® In-Memory Statistics for Hadoop ("SASIMSH")
is a single interactive programming environment for analytics on Hadoop
that integrates analytical data preparation, exploration, modeling and
deployment. It contains PROC IMSTAT and SAS LASR Analytic Engine
(SASIOLA). SASIOLA essentially is 'BASE' for SASIMSH, whereas PROC IMSTAT
covers the full analytical cycle from BASE, through modeling, towards
deployment. This duality continues SAS's long standing tradition of 'getting
the same job done in different ways', to accommodate users' different style,
constraints and preferences.
This post is one of several upcoming posts I plan to
publish soon that discuss code mapping of key analytical data
exercises from traditional SAS programming to SASIMSH. This post today covers
sorting, sorting related BY and ordering in SASIMSH.
Almost every modeler/analyst who has ever prepared data for
modeling using SAS tools is familiar with PROC SORT, Set By on the
sorted data set and/or further engaging first. and last. processing.
In SASIMSH, "Proc Sort" is NO LONGER explicitly
supported as syntax invocation. However, the act of sorting is definitely
supported. The naming/syntax invocation is now changed to PARTITION. Below are
two code examples
Example 1 : partition under SASIOLA engine
" libname dlq SASIOLA START TAG=data2;
data dlq.in(partition=(key2) orderby=(month) replace=yes);
set trans.dlq_month1
trans.dlq_month2;
run; "
" libname dlq SASIOLA START TAG=data2;
data dlq.in(partition=(key2) orderby=(month) replace=yes);
set trans.dlq_month1
trans.dlq_month2;
run; "
- At the
LIBNAME statement, specification of SASIOLA continues the spirit of SAS
Access drivers; you probably have run SAS Access to Oracle
or older versions of SAS like V8 at LIBNAME statements
before. The option START is unique with SASIOLA (IOLA standing
for input-output LASR). It simply tells SAS to launch the SASIOLA
in-memory server (you can release the library or shut down the server
later) . TAG= is critical and is required. One reason is to reference the data properly once it is loaded into the memory. Another, associated, reason is to avoid potential 'collision' when multiple actions are happening in the memory space. Also, when loading data from, say, Hadoop where the source has many layers of locations, the two-level restriction embedded with traditional BASE is no longer sufficient. Tag will allow for long specification.
- SET statement can still be used to stack data sets . Still, there is no limit as to how many data sets you can stack; salient concern is sizing: how much memory space is needed to accommodate the input data sets combined, a concern you care far less when running SET statement in your traditional BASE code. Also noteworthy is that multiple SET statements are no longer supported with the SASIOLA data step, although you can SET multiple input sets with a single SET statement. Interesting question is: how much do you still need to engage multiple SET statements, in this new in-memory computation context?
- Now under SASIOLA engine, sorting happens as a data set option PARTITION=: partition=key2 is, logically, the same as "proc sort; by key2; run;". However, this evolution is >> than just syntax or naming/action switch. It reflects fundamental difference between analytical computing centering around Hadoop (SASIMSH) and traditional SAS BASE. Hadoop is parallel computing by default. If you are running SAS IOLA on a 32-node Hadoop environment, partitioning naturally tries to load different partitions cross the 32 nodes, instead of jamming all the partitions into one single partition (sequentially) as is the case with PROC SORT. PARTITION= is to put records pertaining to the same partition on the same node (there is indeed optimal padding/block size to consider) . Accessing the partitions later, by design, is to happen in parallel fashion; some of us call it bursting through the memory pipes . This is very different from SAS BASE where you grind through observations one by one.
- As we should have learned from PROC SORT, the first variable listed at PROC SORT typically is to group, not to order; if you list only one variable for PROC SORT, you should care only to group. For example, if the variable is account_number or segment label, analytically speaking you rarely need to order by the variable values, in addition to sorting by it. But PROC SORT in most cases orders the observations by the sorting variable anyway. This is no longer the case with partitioning with SASIOLA or SASIMSH in general.
- Similar to PROC
SORT, with SASIOLA, 1) you can list as many variables as you see fit at PARTITION=. 2)
order of the variables listed still matters 3) same sense and
sensibility that the more variables you list, the less analytical sense it
makes, still necessary albeit.
- You can engage
PARTITION= as data set option for input data set as well. My preference is
to use it as 'summary' at the output data set. There are cases
where partitions rendered at the input are 'automatically'/'implicitly'
preserved into the output. There are cases where the preservation does not
happen.
- Orderby = is interesting. If you specify orderby=, the ordering happens within the partitions.
When
you apply "PROC SORT; by key2 months; run;" and you
have multiple entries of month=JAN,
for example, using first.key2 later
does not pin down the record for you, unless you add at least one more
variable at the BY statement. This remains the case with "partition=(key2) orderby=(month)"
under
SASIOLA. If, however, the later action is to do summary by the two variables,
running "proc means; by key2
month; run;" will yield different
results from running summary under SASIMSH (PROC IMSTAT, to be
specific),
because in PROC IMSTAT only the variable key2 is effectively used and the
orderby variable month
is ignored.
8. Reeplace =YES: a concise way to effect "proc delete data=dlq.in;" or "proc dataset lib=dlq; delete in;
run;". This carries obvious Hadoop flavor.
Example 2: partition using PROC IMSTAT under SASIMSH,
"PROC IMSTAT ;
table dlq.in;
partition key2/orderby=month;
run ;
table dlq.&_templast_;
summary x1 /partition;
run; "
table dlq.in;
partition key2/orderby=month;
run ;
table dlq.&_templast_;
summary x1 /partition;
run; "
- This example has
pretty much the same end result as example 1 above, as far as partitioning
is concerned.
- The key
difference is in their 'way of life'. While both examples represent
genuine in-memory computation, example 1 resembles traditional SAS BASE
batch action and example 2 is true interactive programming. In
example 2, within one single invocation of PROC IMSTAT, the analyst can
use RUN statements to scope the whole stream into different sub-scopes,
where the 'slice and dice', exploration (like the summery action), and
modeling (not shown here) is happening WHILE the data tables are
'floating in memory'
- In both examples, none of the resulting data sets are saved onto disk. They are all eventually parked in memory space. There is SAVE statement that allows the data to be 'downloaded' to the disk.