Analytics in Writing: 2014

Monday, December 15, 2014

SAS High Performance Analytics and In-Memory Statistics for Hadoop: : Two Genuine in-Memory Math Blades Working Together

SAS In-Memory Statistics for Hadoop ("SASIMSH" ) and SAS High Performance Analytics Server (“SAS-HPA”) are two key in-memory advanced analytics /modeling products SAS Institute (“SI”) offers today. Both support Hadoop as data source server: SASIMSH is Hadoop centric while SAS-HPA supports Hadoop, Teradata and others. While both have its own in-memory data management capabilities, there are applications and efficiency scenarios where one is engaged to build out data sets to share between the two.

This post shows how to integrate (modeling) data sets from local SAS clients, SASIMSH’s LASR server and HPA, with a Hadoop cluster (“the cluster”) serving as central data repository. Some performance details are also shown.

1. The cluster and other system features

The cluster has 192 nodes, with 96GB RAM each (CPU and other details on the individual nodes are unknown, but are of ‘2014 standard level’). Only 14 nodes are engaged for the exercises; all the performance details are conditional upon this many nodes
All the performance details are with other users’ concurrent jobs running at the cluster
The local SAS client is a virtual machine /cloud running 64-bit Windows 2008 R2 standard server with 32 GB RAM, Intel Xeon(R) CPU, X7560 @2.27 GHz, 2.26 GHz (4 processors). Not that these details are critical. Just so you know. It has some relevance when you load data from traditional SAS sources onto LASR servers. Jobs running on the client have no other concurrent jobs running while reporting the performance details.

2. The action on LASR server

In the picture ("Pic 1") above,

2.1 This is SAS Studio, not SAS Editor or SAS EG

2.1 "LIBNAME bank2":
2.11. SASHDAT is system reserved engine label, specifically indicating that the library being established and pointing to is a Hadoop file co-location. Explaining 'co-location' in great detail is beyond the scope of this post. For now, imagine Hadoop is where big data chunk is stored. SASIMSH and SAS-HPA are like (math) blades sitting along side Hadoop. Stick the blade into Hadoop, lift the data into memory, get the math job done and put results back to Hadoop if any (sometimes you just take the insights or score code without writing back to Hadoop)
2.12. SERVER= is just your host. Path= CAN supports as many slashes/directory levels as you like.

2.2 "LIBNAME bank3": just your regular SAS local directory.

2.3 "goport=1054": you pick a port number to ask for allocation of a slice of memory space (which in this case is collective, 96GB*14, -/~) for your action. As of today, this number cannot exceed 65535 and must not have been reserved: if you just ran this port to create a LASR server with this port number, you (or somebody else) need to terminate that server to release the port number (and, YES, destroy everything you did within that previous in-memory session. You will see the benefits of doing so later) if you want to use the same number again. Of course you can use a different number, if it is available. A good (tactical) habit (with strategic implication for having a good life with in-memory analytics) is to use a limited set of numbers as ports. One obvious reason is that in memory like this is not to use the memory space to mainly store (the huge) data chunk. One logical, associated question therefore is how fast it is to load/reload (big) data chunk into LASR server from the client, or from the Hadoop co-location. ("if it takes forever to load this much, I have to park it". Sounds familiar?). You will see how fast it takes to load in both ways, shortly.

2.4 "outdsn=...; ": I declare a file location the library of which is yet to be set. That is not problem, as long as you set the library before eventually USING it. You can put everything between = sign and ; and it will not fail you.

2.5 "PROC LASR Create": This is to create a new LASR server process, with the port number. Path=/tmp is similar to temp space in regular SAS session.

2.6 "Lifttime=7200": I want the server to cease in two hours.
2.7 "Node=all": all the 14 nodes available. Not the 192 nodes physically installed. Either your administrator caps you, or you can manually set Node=14.

2.8 "PROC LASR Add": this is to load the data from Hadoop library BANK2 (the co-location) into the LASR server defined by the port number.

The data set: 300 million rows, ~45 variables. 12 monthly transactions combined, 25 million rows/month. Total file ~125 GB.

The WHERE statement conditions out the last 7 months while loading the input into memory. Note: there is no SAS Workspace in LASR operation.

The picture below shows that the loading takes 6.54 seconds, most of which is CPU time. Which is desirable. Which implies 'data piping' and trafficking takes little time.

3. Load local data onto the LASR server

In the pic ("Pic 2") above,

3.1 "Libname bank22": sasiola is system reserved label indicating that library BANK22 is a SAS LASR input-output (IO) LASR library. Plain English: use this engine/library to input SAS data sets from outside LASR library. TAG: this short-name option is key to handle file system like Hadoop where directory can easily breach the two-level limitation regular SAS library-data set can support. Within the quotes, the directory can have as many slashes or levels as needed.

3.2 "Data...": syntax wise, this is just regular SAS data steps. I am stacking the last 7 month data sets from two different local SAS libraries, Bank3 and Bank4. T_ is least minimum action you should take to make additional analytical use of any date data you have in your data set. Most (not all) data steps in SAS BASE are supported.

As shown in the picture below, the log shows ~19 minutes of the loading act

The CPU time is not much, indicating piping and trafficking takes time. Considering this is loading from local C drive and D drive of the virtual machine Windows client, this is expected. Notice: the resulting 5 month file now resides in the same in-memory library Bank22.

SASIOLA engine apparently is key gateway connecting traditional SAS data sourcing with in-memory LASR servers. However you have been building your data sets, you can load it to LASR server via SASIOLA. You can definitely originate your data sources directly within LASR.

4. Append /Stack data set in SASIMSH.

The picture below ("Pic 3") shows how the two data sets, one directly loaded from Hadoop cluster, one loaded off local SAS, are stacked using LASR server in-memory.

4.1 Unlike SAS BASE steps, the table listed at the TABLE has to be one of existing tables; it is called ACTIVE table. The SET action here is defined as 'to modify the active table'.

4.2 The table listed at SET statement to modify inherits the active table's library; as this version of SASIMSH stands, one has to load both tables into the same library to perform the act; spelling out library name in front of the data set name will cause error.

4.3 Drop option at SET statement is very nice considering you may very well be setting gigantic data onto the active table: YES, space management remains essential during in-memory analytics.

And it takes, how long? 247 seconds.

4.4 "Save...": after the appending act is over, write the resulting 12 month data set back to Hadoop cluster. Copy and Replace are ordinary Hadoop options. (1 copy and replace existing file with the same name). Recall: the Hadoop cluster directory is where SAS-HPA picks up data for its operations. And the writing time is included within the 247 seconds elapsed.

5. Terminate LASR server. After the work is done, the LASR server is terminated, as shown in the picture below.

Now you can launch HPA procedures such as HPSUMMARY to run additional analytics.

Final notes:

SAS-HPA has its own data management facilities such as HPDS2 that has strong OOP (Object Oriented Programming) orientation and flavor. Which is more native to traditional formats such as ANSI-SQL
Both HPDS2 and SASIMSH support traditional SAS BASE step syntax, with a few exceptions.
There are use cases where you prefer parking large data set in ported LASR servers. For example, if, instead of stacking 7 monthly files, you are stacking 24 monthly files. That may take >>20, 18 minutes. You can conceivably stack the 24 data sets, share the port with others who want to consume the resulting one big data set. They just set proper library to access it.
The central philosophy of SASIMSH, by way of fast loading and clean termination, is that against the reality of data changing instantly and much, one may have desire to build more and more analytics in high pace. Be flashy. Classic such example is digital marketing analytics where input data are often live, highly interactive social event streams.
SASIMSH is resource sensitive. In another separate environment where 47 nodes with 96GB/ node, all steps were finished with a minute. 96GB RAM per node, by 2014 standard , is under par. Today the standard RAM equipped with each node is 512 GB, or at least 256 GB. If you want to read 2TB data from Hadoop, of course you need stronger and more node machines. As practical matter, not a general statement, I have observed that virtual machines as node member on the cluster, run obviously more slowly than if the nodes are real, physical machines.

Friday, October 17, 2014

SAS In-Memory Statistics for Hadoop: Using PROC IMSTAT to Transpose Data

SAS® In-Memory Statistics for Hadoop ("SASIMSH") is a single interactive programming environment for analytics on Hadoop that integrates analytical data preparation, exploration, modeling and deployment. It contains PROC IMSTAT and SAS LASR Analytic Engine (SASIOLA). SASIOLA essentially is 'BASE' for SASIMSH, whereas PROC IMSTAT covers the full analytical cycle from BASE, through modeling, towards deployment. This duality continues SAS's long standing tradition of 'getting the same job done in different ways', to accommodate users' different style, constraints and preferences.

This post is one of several upcoming posts I plan to publish soon that discuss code mapping of key analytical data exercises from traditional SAS programming to SASIMSH. This post today covers using PROC IMSTAT to transpose data.

Prior to IMSTAT being available, SAS users typically either write data step code or engage PROC transpose to transpose data. In PROC IMSTAT, PROC transpose is not supported (of course, because there is no way another procedure is supported to run underneath and within a procedure like PROC IMSTAT, or any SAS procedure. There is no statement in PROC IMSTAT to explicitly do what PROC transpose does either). However, PROC IMSTAT CAN be programmed to transpose. This post provides one such example.

Code part 1: Set up in-memory analytics SAS libraries for Hadoop + loading

Below is code part 1 screen shot

The first LIBNAME statement sets up library reference to a Hadoop cluster by the IP address of sas.com (I cleaned up confidential address info), which I am using 14 parallel nodes out of >180 nodes in total.
At this design interface, the analyst (me) does not care if the underlying Hadoop is Cloudera or Horton Works or somebody else. IMSTAT is designed NOT to program differently simply because the underlying Hadoop is 'from a different flavor of Hadoop'. Bank2 is the name for the SAS library, as if you are just setting one up for regular SAS work or access to Oracle, Green Plum DB or Teradata DB. The naming convention is the same.
SASHDAT is the quintessential differentiator for SAS in-memory advanced analytics. It appears just like regular SAS Access driver. It actually is, but much more in the interest of enabling genuine in-memory analytics; somebody can write a book about this 'driver' alone. For now, it simply allows us to enable SAS in-memory data access onto Hadoop.
Path= is a directory management structure, little new.
From the DATA statement,

I am simulating 12 months of variable X data for 1 million IDs. Simple.
I am asking SAS to write the 12 million rows directly to Hadoop in the SASHDAT format. In doing so, I also request that the resulting data set &indsn be sorted by ID (the partition= data set option) and further sorted by MONTH (orderby=). The fact with 'sorting' in a parallel system like this, though, is that sorting (with the partition= option) is actually grouping: sorting by ID actually is just placing records with the same ID values together; they do not collate any more (meaning sequencing groups in descending or ascending value of the variable ID like BASE SAS is doing with PROC SORT) . Since later access to the grouped records will be parallel, why spending more time and resources ( a lot if your data set is huge) asking them to further collate after initial grouping? Orderby= option is to add collation WITHIN grouped groups. The notion of using partition= AND orderby= is the same with "PROC SORT; by ID month;", but the physiology and mechanism of the act is different, moving from a 'single server' mode to parallel world (or from SMP to MPP)
Also, the partition and orderby options are supported as data set options (in a typical MPP Hadoop operation as such, likely these two data set options are only supported at output data sets, not at input data set. Why?), whereas in regular BASE SAS operation, the analyst has to call up separate PROC SORT to do it. This luxury is affordable now because this is IN-MEMORY, this is no longer disk swapping (Imagine more and you will enjoy this more, I promise)
The Replace=: this is Hadoop finger print. If the destination library is set up to point to other DB such as ORACLE or Teradata, or any DBMS-like, this option does not work. As Hadoop is towards more 'managed' (not necessarily towards DBMS, but Yarn, Kerberos...) this may or may not change. Not shown here, but I recall another option is Copy=. The Copy option simply tells Hadoop how many copies it should make for the data set you are dumping into its laps.

At "PROC LASR",

This block is creating one in-memory working session by requesting allocation of port 10002 (if available at the time I submit the request) as "access, passing lane" to the server, by requesting directory "/tmp" as "operating/temp parking ground'.
Lifetime =: tells IMSTAT that after 72000 seconds, trash the whole session and everything within it. Noclass=: not required, has something to do with categorical variable loading
Performance nodes=ALL: means all Hadoop nodes that are available and policy-permitted by whoever is in charge of the 'box'
The second "PROC LASR" has ADD option asking that the table I wrote to the target Hadoop cluster be loaded into the memory session I just created. As you may learn more and more about 'loading data into memory', this is only one of many loading methods. The analyst, for example, certainly can load many SAS7BDAT data sets sourced by all kinds of traditional venues and practices (PROC SQL, BASE...) by using another in-memory engine SASIOLA. There are also data loader products from SAS as well. Once the table is loaded into memory, it parks there until the server is terminated.
Notice: this session implicitly inherits the server from previous lines, if not directed or specified otherwise. The port, however, must be explicit while trying to load data into it. The session is port specific and the port is generic: port number 10002 means the same to whoever is using it.

Code Piece 2: Reference the loaded file in-memory, build code to transpose

Below is code part 2 screen shot

The LIBNAME TRANS engages SASIOLA engine to call up the file just loaded into the memory space by pointing to the directory where the in-memory session is set up. This LIBNAME statement uses tag= option to represent this reference for later processing. This tagging act is to respond to the fact Hadoop system typically has system of directories, which is worsen by the fact Hadoop systems often run off Linux-like OS which per se has limitless directories. Traditionally SAS products support two levels such as mysas.in. Tagging therefore is in place.
The "PROC FCMP" portion is entirely my doing that does not much to generalize. I show how one can generate code in this way. You can certainly type or copy to make your own reference code file.
The ARRAY details in the middle of FCMP should be straightforward. I am sure you can make it more sophisticated, implicit (or mysterious). The point is to show one basic approach to transpose data with IMSTAT. Noteworthy is the __first_in_partition and __last_in_partition. This is nothing but your familiar first.ID and last.ID. Their invocation certainly depends on the data set being partitioned/sorted (where did it happen in this post? Did it already happen?)

Code Piece 3: Using PROC IMSTAT to tranpose/score the data

Below is code part 2 screen shot

As many SAS users often say, the last piece is the easiest. Once the code is built, you can run it through the SCORE statement. To use the __first_in_partition and __last_in_partition, you MUST specify partition option value at the SCORE statement. In this way, IMSTAT will go search for the partition table created while partition= and orderby= options were effected (this certainly is not the only way to partition and order). FETCH is similar "PROC PRINT".

The last "PROC LASR" with TERM is to terminate the in-memory session once the job is done. This is one important habit to have with SAS LASR centered in-memory analytics, although not a technical requirement. Lengthy discussion on this subject belongs to another post.

Here are some log screen shots

Generating 12 Million Records, 3 Variables, 14 Seconds

Load 12 Million Records, 3 Variable to Memory, 2.6 Seconds

Transposing takes 5 Seconds

TRANSPOSED !

October 2014, from Brookline, Massachusetts

Thursday, October 16, 2014

SAS In-Memory Statistics (IMSTAT) for Hadoop Overview

This post provides overview of IMSTAT, with a little associated coverage of the SASIOLA facilities.

You certainly read through IMSTAT details here at sas.com. Below is a summary picture many have liked better, to capture features and functions of PROC IMSTAT. It covers the latest as of Q1 of 2014, but the spirit and gist remain the same since.

Some comments about the 'total concept' first
1. If you are familiar with SAS products and solutions, you are used to seeing BASE (programming), STAT (statistical sciences), ETS (Econometrics), OR (operations research), EM (enterprise data mining including machine learning, text mining and statistics), EG (enterprise guide) and MM (model management). Another line of SAS in-memory products still largely follow this set of convention. For example, HP Statistics (high performance counterpart of STAT), HPDM (high performance counterpart of EM) and so on. You are used to seeing long list of procedures under each product or package.

Now, conceptual 'shock' #1 is all these features listed in this IMSTAT picture are grouped under ONE procedure. Yes, IMSTAT is one procedure and one procedure only, with so many features

2. Why this change?

If you use any of the traditional SAS products mentioned above, you know to get the work on hand done, you likely engage a very small set of procedures, functions and statements afforded by a specific product that you have license for. For example, I myself have been using STAT since ~1991, but still ~ half of the procedures under STAT remain stranger to me. I don't recall having known anybody who uses all of the BASE capabilities either. On the contrary, I know friends who have held SAS jobs for many years. They are experienced, but have only known just half a dozen procedures.
The reality though is it is not cost possible for a software developer to build just a few procedures for one company and build another small set for another company.

One way SAS has to address this (price and value) gap is software on demand offerings, in-depth discussion of which is beyond the scope of this post. Another way is to redesign package in such a way that all the essential features and functions to get analytical jobs done are built in and integrated. The next immediate question is: which to pick and chose from which existing packages? Apparently, from elementary 'can do' perspective, it is hard to imagine many things that SAS cannot do with its existing offerings. In many cases, the challenge is how, not if. Still, a coherent organizing theme is needed to build the new piece. Good news is such piece has existed for many years.

The BLUE spoke in the center of the picture presents a diagram of modern analytical life cycle, from problem definition, data preparation and exploration, through modeling, to deployment and presentation. PROC IMSTAT features and functions are organized and developed by this framework. In other words, PROC IMSTAT has collected core functions and features from SAS software families, to optimize against needs and challenges confronting analytical users in Hadoop world. Many 'pre-existing' features have been distilled and streamlined while being moved to the in-memory platform.

PROC IMSTAT is expanding rapidly to accommodate ever changing Hadoop world.

Some comments about PROC IMSTAT
1. Major pure feature and function addition actually are on the right side, the recommender engine. Everything else essentially has been in existence in SAS software family in some format or style.
2. All the entries listed under a box header label (such as Data Management to the top left corner) are IMSTAT statements. To access the statements, the user must invoke "PROC IMSTAT" first. Unlike many traditional SAS procedures where one has to invoke procedures many times, once PROC IMSTAT is invoked once, the user can invoke the same statement again and again, as the job deems necessary
3. User who are very familiar and deep on some SAS procedures may find that some features, reduced from a regular SAS procedure (for example, CORR statement stems from PROC CORR) to IMSTAT statement, no longer have that many options as their counterpart procedures have. This, in part, is because the procedure has been reduced to a statement. Another reason is more strategic design; the reduction or left-out is intentional: do you really think it makes sense to run all those distance options under the cluster statement on now much bigger data sets?
4. Some statements actually are mini-solution. GroupBy statement, for example, is "in-memory cube builder " or a genuine OLAP killer, while it appears like a small statement

I plan to publish specific use case to help better understand how IMSTAT works. Thanks.

October 2014, from Wellesley, Massachusetts

Sunday, October 5, 2014

SAS High Performance Finite Mixture Modeling, HPFMM: Introduction by One Example

SAS Institute released HPFMM procedure with its High Performance Statistics 13.1, after the regular FMM procedure was made available through regular STAT package for years.

This post is not intended to show or discuss how to do FMM. A solid, yet simple discussion of FMM with excellent SAS concentration can be found in professor Malthouse's book. There is also a excellent blog on regular FMM procedure usage. A SAS developer of the FMM procedure also did a video on FMM.

This post is to showcase computational power of the new HPFMM procedure over the regular FMM procedure. For those who have little exposure to FMM practice, but need to compute on bigger data sets with the method, this post covers essential aspects to get on with FMM, since regular PROC FMM and HPFMM overlap a lot.

This is the first time I have used the new SAS Studio to build blogs here. I will cover some SAS Studio as well.

Introduction

What is FMM: in plain English, "a variable is distributed in such fashion you believe it is mixture of components"
How FMM varies, to name a few ways,

By component type: categorical, interval... leading to very different governing/underlying distributions. One sub-direction in this vein is observed or unobserved where you can find intellectual connections to some canonical analysis
By interrelationship among components: same variance, mean?
By if and how mixing interacts with components: share covariates? determine / estimate simultaneously?
how 'weird' the distorted distribution is: you can fit and match to several major, if not popular 'skewed' distribution class such as zero inflation, Poisson distribution.

Description of data set used for the post

5 million observations, 204 variables. The response variable is binary Y/N, with 'response rate' ~26.76% or 1,338,111=Y. Response is B2C Q & A: "Do you have your desired product now?"
The 204 variables are mostly RFM variables dating back to the past 72 months, + some attributes.

The environment the SAS HPFMM job is running on:

A 192-node Cloudera Hadoop Cluster with SAS HP STAT 13.1 loaded. The specific job is capped at 14 nodes, with each node having >96 GB ? (this number is not important, given this size of the data). FMM rarely has real time or near real time requirement; running faster than what is shown in the post likely does not provide much incremental value.
The job is conducted using SAS HP STAT 13.2, the latest version of the software
The coding is through a virtual Windows client machine with 32 GB local RAM. The client connects to the Hadoop cluster. No FMM processing happens locally on the client.

The code and the editing interface, the new SAS Studio

Running HPFMM at SAS Studio

This is the look of SAS Studio which essentially is SAS enhanced editor, but represents a revolutionary leap-forward from its ancestors in classic SAS Enhanced Editor, Enterprise Guide, or Enterprise Miner. The Studio is web-native, built for collaboration and portability and 'flashy'. It is much better at output rendering, log, message and error handling, and version management + the modern touch and feel. In current analytics world when look does matter sometimes, the Studio is a true contender in stability, integrity, consistency and support.
If you are familiar with SAS procedure programming, the program does not present much surprises; if you are familiar with PROC FMM, you can plug in your existing program, add the Performance statement and start tweaking. If you are just learning how to use SAS FMM facility, starting with PROC FMM and PROC HPFMM should provide similar learning return in speed, functionality and feature, with some exceptions (which you probably don't care)

Still some notes,

The default Output statement setting is no longer copying all the variables over from the input data set, especially any input variable and BY variables, unlike the regular PROC FMM where all are moved over to output data set listed at OUTPUT statement, if ID statement is not engaged. This is rather a contrast between all the HP procedures and their non-HP counterparts, not just with the FMM procedures.
The default maximum iteration is set at 200 or maximum function calls at 2000. I raised them to 5000 and 250000 respectively, just to test, not to suggest the simplex optimization method requires that. As some of you may have experienced, another factor in tweaking is link function. The point here is: if called for, more iterations and function calls now can be handled with the HPFMM procedure fairly quickly
What does not converge under PROC FMM probably still does not converge under HPFMM, eventually. So, exploratory data analysis, design, business domain, prior or posterior knowledge about the model project remain quintessential for success.
Rules of thumb such as "identity link runs much faster than other links" probably still hold, everything else being equal, as far as I have tested.

The HPFMM log,

You can set message level to get more details.
Three traffic lights on top left corner indicate summarily how the program goes.
Carrying ~200 variables into the process with 5 million records (yes, I deliberately decided not to write Keep= or Drop = statement to shrink input variable list, because the time spent on writing that code likely exceeds the time taken to finish the entire job without dropping them, the 27-28 second mark), it finished below 28 seconds.
The gap between CPU time and Real time is still large, percentage wise, but I care less since it is in seconds, unlike regular SAS session where the gap may be in minutes or hours
This is genuine in-memory computation in which the data set is loaded into memory residing with the 14 nodes on the target Hadoop cluster. Once the job is done, data are dropped off the memory space. The resulting output data set, as I wished and specified, is written to the cluster
When using PROC FMM, tackling 5 million rows with 200 + variables is harder to manage; on the 32 GB Windows machine where this blog is being typed right now, I cannot have PROC FMM to finish the job. I have to cut the data set down to the short list of variables to make it through. And it took > 16 minutes.
Overall, you still need to be data efficient. If you have, say, >600K variables (a near reality where a lot of text/unstructured elements are involved in modeling <not necessarily FMM>), you may still consider Keep/Drop variables.

FMM output, the administrative section,

Count of Hadoop nodes involved in the job is first reported
You then see link function details. Some modelers disrespect identity. Well, in this specific case, I have tested, in a total of 10 minutes, that three other link functions do NOT provide better results any way. Respect what the data have to tell you, because IDENTITY is the fastest
All the estimation / convergence is now ML
The class variable details should be exciting simply because now you are running on full sample in so little time.

HPFMM output, the estimation section,

The fit statistics and Z value, pretty straight forward. You can read all the books to appreciate underpinning definitions and rules; I am not sure I am qualified to profess you on the technicalities here (I don't have Ph.D title)
One cautionary note: keep in mind these are computed statistic. Always interpret and use them with the specific data condition in mind. They may not be as 'portable' as you think, which blunt the motivation many decide to leverage mixture modeling, to begin with.

FMM output, the performance section (sort of),

The probability estimate serves as a direction as to how 'well' the model is. Data some time defy what numbers you should put in as K value (or range thereof). In some cases, you have to 'insist' or 'hold on' to your K value even if data support 'better' mixing probabilities otherwise
Professor Malthouse's book, aforementioned, provides great examples using PROC KDE and PROC SGPLOT to assist with K value insights. KDE on large data set can also be practiced with another SAS in-memory product IMSTAT (the KDE statement there) and SAS Visual Analytics (VA).

Thank you.

Autumn of 2014, from Chestnut Hill, Massachusetts

Tuesday, July 15, 2014

SAS In-Memory Statistics for Hadoop: Key Exercises to Jump-Start Long Time SAS Users, Part One

Almost every modeler/analyst who has ever prepared data for modeling using SAS tools is familiar with PROC SORT, Set By on the sorted data set and/or further engaging first. and last. processing.

In SASIMSH, "Proc Sort" is NO LONGER explicitly supported as syntax invocation. However, the act of sorting is definitely supported. The naming/syntax invocation is now changed to PARTITION. Below are two code examples

Example 1 : partition under SASIOLA engine

" libname dlq SASIOLA START TAG=data2;
       data dlq.in(partition=(key2) orderby=(month) replace=yes);
          set trans.dlq_month1
                trans.dlq_month2;
       run; "

At the LIBNAME statement, specification of SASIOLA continues the spirit of SAS Access drivers; you probably have run SAS Access to Oracle or older versions of SAS like V8 at LIBNAME statements before. The option START is unique with SASIOLA (IOLA standing for input-output LASR). It simply tells SAS to launch the SASIOLA in-memory server (you can release the library or shut down the server later) . TAG= is critical and is required. One reason is to reference the data properly once it is loaded into the memory. Another, associated, reason is to avoid potential 'collision' when multiple actions are happening in the memory space. Also, when loading data from, say, Hadoop where the source has many layers of locations, the two-level restriction embedded with traditional BASE is no longer sufficient. Tag will allow for long specification.
SET statement can still be used to stack data sets . Still, there is no limit as to how many data sets you can stack; salient concern is sizing: how much memory space is needed to accommodate the input data sets combined, a concern you care far less when running SET statement in your traditional BASE code. Also noteworthy is that multiple SET statements are no longer supported with the SASIOLA data step, although you can SET multiple input sets with a single SET statement. Interesting question is: how much do you still need to engage multiple SET statements, in this new in-memory computation context?
Now under SASIOLA engine, sorting happens as a data set option PARTITION=: partition=key2 is, logically, the same as "proc sort; by key2; run;". However, this evolution is >> than just syntax or naming/action switch. It reflects fundamental difference between analytical computing centering around Hadoop (SASIMSH) and traditional SAS BASE. Hadoop is parallel computing by default. If you are running SAS IOLA on a 32-node Hadoop environment, partitioning naturally tries to load different partitions cross the 32 nodes, instead of jamming all the partitions into one single partition (sequentially) as is the case with PROC SORT. PARTITION= is to put records pertaining to the same partition on the same node (there is indeed optimal padding/block size to consider) . Accessing the partitions later, by design, is to happen in parallel fashion; some of us call it bursting through the memory pipes . This is very different from SAS BASE where you grind through observations one by one.
As we should have learned from PROC SORT, the first variable listed at PROC SORT typically is to group, not to order; if you list only one variable for PROC SORT, you should care only to group. For example, if the variable is account_number or segment label, analytically speaking you rarely need to order by the variable values, in addition to sorting by it. But PROC SORT in most cases orders the observations by the sorting variable anyway. This is no longer the case with partitioning with SASIOLA or SASIMSH in general.
Similar to PROC SORT, with SASIOLA, 1) you can list as many variables as you see fit at PARTITION=. 2) order of the variables listed still matters 3) same sense and sensibility that the more variables you list, the less analytical sense it makes, still necessary albeit.
You can engage PARTITION= as data set option for input data set as well. My preference is to use it as 'summary' at the output data set. There are cases where partitions rendered at the input are 'automatically'/'implicitly' preserved into the output. There are cases where the preservation does not happen.
Orderby = is interesting. If you specify orderby=, the ordering happens within the partitions.

When you apply "PROC SORT; by key2 months; run;" and you have multiple entries of month=JAN,

for example, using first.key2 later does not pin down the record for you, unless you add at least one more

variable at the BY statement. This remains the case with "partition=(key2) orderby=(month)" under

SASIOLA. If, however, the later action is to do summary by the two variables, running "proc means; by key2

month; run;" will yield different results from running summary under SASIMSH (PROC IMSTAT, to be

specific), because in PROC IMSTAT only the variable key2 is effectively used and the orderby variable month

is ignored.

8. Reeplace =YES: a concise way to effect "proc delete data=dlq.in;" or "proc dataset lib=dlq; delete in;

run;". This carries obvious Hadoop flavor.

Example 2: partition using PROC IMSTAT under SASIMSH,

"PROC IMSTAT ;
       table dlq.in;
           partition key2/orderby=month;
      run ;
       table dlq.&_templast_;
           summary x1 /partition;
      run; "

This example has pretty much the same end result as example 1 above, as far as partitioning is concerned.
The key difference is in their 'way of life'. While both examples represent genuine in-memory computation, example 1 resembles traditional SAS BASE batch action and example 2 is true interactive programming. In example 2, within one single invocation of PROC IMSTAT, the analyst can use RUN statements to scope the whole stream into different sub-scopes, where the 'slice and dice', exploration (like the summery action), and modeling (not shown here) is happening WHILE the data tables are 'floating in memory'
In both examples, none of the resulting data sets are saved onto disk. They are all eventually parked in memory space. There is SAVE statement that allows the data to be 'downloaded' to the disk.

In next posts, I will cover transpose and retain actions. Let me know what you think. Thanks.

Thursday, January 23, 2014

Using SAS Forecasting Server for CCAR Macroeconomic Forecasting

Comprehensive Capital Analysis and Review, or CCAR, is one of Federal Reserve’s capital planning initiatives for big banks. The Federal Reserve has published broad contours of the baseline scenario, the adverse scenario, and the severely adverse scenario. Forecasting macroeconomic variables under these different scenarios serve as starting points for loan loss prediction, financial portfolio modeling and comprehensive reporting under CCAR. The Fed Reserve also provided updated historical time series accompanying the scenarios (see www.federalreserve.gov/bankinforeg/stress-tests-capital-planning.htm).

This post shows how to use SAS Forecasting Server (FS) to build 28 time series forecasting from the link above. It took me a little over 2 hours to get the 28 models set up and run with FS including data prep. The SAS code program "fedmodeifed2.sas" in the appendix of this post converts original charter data types into numeric data type, and creates quarterly time ID necessary to model with FS.

After creating new project, point to the data set that has all 28 time series

Decide to run 'pure' time series forecast or to include explanatory variables

Great flexibility to decide which subset of the series to be used for forecast. This is a huge productivity boost since the modeler does not need to toggle back and forth between the design window and data steps, or carry many different conditioning flags in the data set

Point and click to access 47 model selection criteria. Slide validation data window to suite your needs

A set of quality models are built in 2 minutes. You can copy the built models, make changes to vary into different criteria, validation, measurement combinations for quick comparison. The model building process produces diagnostics, residual plots, whitening capability and transformations, among others. You can design event variables to adjust forecast by business cycles and external shocks/treatment. FS also supports custom overrides (although in many veins manual overrides are not recommended)

SAS Forecast Server essentially is GUI counterpart of the renown SAS ETS software. It is built and optimized towards the latest trend in forecast theory and practice towards forecast data mining. The productivity boost and model documentation are the top reasons I fall in love with FS. I spent about 2 hours and 20 minutes to build up all 28 series. They are baseline and you certainly can and should spend more time polishing them. Below is a summary of the 28 models. My favorite method is UCM

Enjoy!!

Appendix: SAS code to clean up original data for FS modeling, fedmodeifed2.sas"

     "
     libname ccar "c:\sasdata";
    %let indsn =ccar.ccartest2; /*use Import facility to read in FED spreadsheet directly*/
    %let outdsn =sashelp.fedmodified2;
    data &outdsn.;
         set &indsn. ;
        length Q2 $2 y2 $4 ;
        Q2 =substr(compress(OBS), 2,1);
        Y2 =substr(compress(OBS), 3,4);
        YYQ =yyQ(y2,Q2);
        format yyq yyq6.;
        if _n_<=151;

        rename _0_year_Treasury_yield=Ten_year_Tre_yield
                   __month_Treasury_rate =Three_mon_Tre_rate
                     __year_Treasury_yield=Five_year_tre_yield;

                   /*the three raw names above cause problems with FS without renaming*/

                    DJTSMI=Dow_Jones_Total_Stock_Market_Ind*1;
                    Developing_Asia_Inflation2 =Developing_Asia_Inflation*1;
            Developing_Asia_Real_GDP_Growth2=Developing_Asia_Real_GDP_Growth*1;
            Euro_Area_Bilateral_Dollar_Exch2=Euro_Area_Bilateral_Dollar_Excha*1;
            Euro_Area_Inflation2=Euro_Area_Inflation*1;
            Developing_Asia_Bilateral_Dolla2=Developing_Asia_Bilateral_Dollar*1;
            Market_Volatility_Index__VIX2=Market_Volatility_Index__VIX_*1;

            /*the original data are very clean and regularized. There is no need to write fancy    code to do the type conversion. Just get it done*/

       drop Developing_Asia_Inflation
       Developing_Asia_Real_GDP_Growth
       Euro_Area_Bilateral_Dollar_Excha
       Euro_Area_Inflation
       Market_Volatility_Index__VIX_
      Dow_Jones_Total_Stock_Market_Ind
      Developing_Asia_Bilateral_Dollar
      Q2 y2;
run ;

"