Analytics in Writing: October 2014

Friday, October 17, 2014

SAS In-Memory Statistics for Hadoop: Using PROC IMSTAT to Transpose Data

SAS® In-Memory Statistics for Hadoop ("SASIMSH") is a single interactive programming environment for analytics on Hadoop that integrates analytical data preparation, exploration, modeling and deployment. It contains PROC IMSTAT and SAS LASR Analytic Engine (SASIOLA). SASIOLA essentially is 'BASE' for SASIMSH, whereas PROC IMSTAT covers the full analytical cycle from BASE, through modeling, towards deployment. This duality continues SAS's long standing tradition of 'getting the same job done in different ways', to accommodate users' different style, constraints and preferences.

This post is one of several upcoming posts I plan to publish soon that discuss code mapping of key analytical data exercises from traditional SAS programming to SASIMSH. This post today covers using PROC IMSTAT to transpose data.

Prior to IMSTAT being available, SAS users typically either write data step code or engage PROC transpose to transpose data. In PROC IMSTAT, PROC transpose is not supported (of course, because there is no way another procedure is supported to run underneath and within a procedure like PROC IMSTAT, or any SAS procedure. There is no statement in PROC IMSTAT to explicitly do what PROC transpose does either). However, PROC IMSTAT CAN be programmed to transpose. This post provides one such example.

Code part 1: Set up in-memory analytics SAS libraries for Hadoop + loading

Below is code part 1 screen shot

The first LIBNAME statement sets up library reference to a Hadoop cluster by the IP address of sas.com (I cleaned up confidential address info), which I am using 14 parallel nodes out of >180 nodes in total.
At this design interface, the analyst (me) does not care if the underlying Hadoop is Cloudera or Horton Works or somebody else. IMSTAT is designed NOT to program differently simply because the underlying Hadoop is 'from a different flavor of Hadoop'. Bank2 is the name for the SAS library, as if you are just setting one up for regular SAS work or access to Oracle, Green Plum DB or Teradata DB. The naming convention is the same.
SASHDAT is the quintessential differentiator for SAS in-memory advanced analytics. It appears just like regular SAS Access driver. It actually is, but much more in the interest of enabling genuine in-memory analytics; somebody can write a book about this 'driver' alone. For now, it simply allows us to enable SAS in-memory data access onto Hadoop.
Path= is a directory management structure, little new.
From the DATA statement,

I am simulating 12 months of variable X data for 1 million IDs. Simple.
I am asking SAS to write the 12 million rows directly to Hadoop in the SASHDAT format. In doing so, I also request that the resulting data set &indsn be sorted by ID (the partition= data set option) and further sorted by MONTH (orderby=). The fact with 'sorting' in a parallel system like this, though, is that sorting (with the partition= option) is actually grouping: sorting by ID actually is just placing records with the same ID values together; they do not collate any more (meaning sequencing groups in descending or ascending value of the variable ID like BASE SAS is doing with PROC SORT) . Since later access to the grouped records will be parallel, why spending more time and resources ( a lot if your data set is huge) asking them to further collate after initial grouping? Orderby= option is to add collation WITHIN grouped groups. The notion of using partition= AND orderby= is the same with "PROC SORT; by ID month;", but the physiology and mechanism of the act is different, moving from a 'single server' mode to parallel world (or from SMP to MPP)
Also, the partition and orderby options are supported as data set options (in a typical MPP Hadoop operation as such, likely these two data set options are only supported at output data sets, not at input data set. Why?), whereas in regular BASE SAS operation, the analyst has to call up separate PROC SORT to do it. This luxury is affordable now because this is IN-MEMORY, this is no longer disk swapping (Imagine more and you will enjoy this more, I promise)
The Replace=: this is Hadoop finger print. If the destination library is set up to point to other DB such as ORACLE or Teradata, or any DBMS-like, this option does not work. As Hadoop is towards more 'managed' (not necessarily towards DBMS, but Yarn, Kerberos...) this may or may not change. Not shown here, but I recall another option is Copy=. The Copy option simply tells Hadoop how many copies it should make for the data set you are dumping into its laps.

At "PROC LASR",

This block is creating one in-memory working session by requesting allocation of port 10002 (if available at the time I submit the request) as "access, passing lane" to the server, by requesting directory "/tmp" as "operating/temp parking ground'.
Lifetime =: tells IMSTAT that after 72000 seconds, trash the whole session and everything within it. Noclass=: not required, has something to do with categorical variable loading
Performance nodes=ALL: means all Hadoop nodes that are available and policy-permitted by whoever is in charge of the 'box'
The second "PROC LASR" has ADD option asking that the table I wrote to the target Hadoop cluster be loaded into the memory session I just created. As you may learn more and more about 'loading data into memory', this is only one of many loading methods. The analyst, for example, certainly can load many SAS7BDAT data sets sourced by all kinds of traditional venues and practices (PROC SQL, BASE...) by using another in-memory engine SASIOLA. There are also data loader products from SAS as well. Once the table is loaded into memory, it parks there until the server is terminated.
Notice: this session implicitly inherits the server from previous lines, if not directed or specified otherwise. The port, however, must be explicit while trying to load data into it. The session is port specific and the port is generic: port number 10002 means the same to whoever is using it.

Code Piece 2: Reference the loaded file in-memory, build code to transpose

Below is code part 2 screen shot

The LIBNAME TRANS engages SASIOLA engine to call up the file just loaded into the memory space by pointing to the directory where the in-memory session is set up. This LIBNAME statement uses tag= option to represent this reference for later processing. This tagging act is to respond to the fact Hadoop system typically has system of directories, which is worsen by the fact Hadoop systems often run off Linux-like OS which per se has limitless directories. Traditionally SAS products support two levels such as mysas.in. Tagging therefore is in place.
The "PROC FCMP" portion is entirely my doing that does not much to generalize. I show how one can generate code in this way. You can certainly type or copy to make your own reference code file.
The ARRAY details in the middle of FCMP should be straightforward. I am sure you can make it more sophisticated, implicit (or mysterious). The point is to show one basic approach to transpose data with IMSTAT. Noteworthy is the __first_in_partition and __last_in_partition. This is nothing but your familiar first.ID and last.ID. Their invocation certainly depends on the data set being partitioned/sorted (where did it happen in this post? Did it already happen?)

Code Piece 3: Using PROC IMSTAT to tranpose/score the data

Below is code part 2 screen shot

As many SAS users often say, the last piece is the easiest. Once the code is built, you can run it through the SCORE statement. To use the __first_in_partition and __last_in_partition, you MUST specify partition option value at the SCORE statement. In this way, IMSTAT will go search for the partition table created while partition= and orderby= options were effected (this certainly is not the only way to partition and order). FETCH is similar "PROC PRINT".

The last "PROC LASR" with TERM is to terminate the in-memory session once the job is done. This is one important habit to have with SAS LASR centered in-memory analytics, although not a technical requirement. Lengthy discussion on this subject belongs to another post.

Here are some log screen shots

Generating 12 Million Records, 3 Variables, 14 Seconds

Load 12 Million Records, 3 Variable to Memory, 2.6 Seconds

Transposing takes 5 Seconds

TRANSPOSED !

October 2014, from Brookline, Massachusetts

Thursday, October 16, 2014

SAS In-Memory Statistics (IMSTAT) for Hadoop Overview

This post provides overview of IMSTAT, with a little associated coverage of the SASIOLA facilities.

You certainly read through IMSTAT details here at sas.com. Below is a summary picture many have liked better, to capture features and functions of PROC IMSTAT. It covers the latest as of Q1 of 2014, but the spirit and gist remain the same since.

Some comments about the 'total concept' first
1. If you are familiar with SAS products and solutions, you are used to seeing BASE (programming), STAT (statistical sciences), ETS (Econometrics), OR (operations research), EM (enterprise data mining including machine learning, text mining and statistics), EG (enterprise guide) and MM (model management). Another line of SAS in-memory products still largely follow this set of convention. For example, HP Statistics (high performance counterpart of STAT), HPDM (high performance counterpart of EM) and so on. You are used to seeing long list of procedures under each product or package.

Now, conceptual 'shock' #1 is all these features listed in this IMSTAT picture are grouped under ONE procedure. Yes, IMSTAT is one procedure and one procedure only, with so many features

2. Why this change?

If you use any of the traditional SAS products mentioned above, you know to get the work on hand done, you likely engage a very small set of procedures, functions and statements afforded by a specific product that you have license for. For example, I myself have been using STAT since ~1991, but still ~ half of the procedures under STAT remain stranger to me. I don't recall having known anybody who uses all of the BASE capabilities either. On the contrary, I know friends who have held SAS jobs for many years. They are experienced, but have only known just half a dozen procedures.
The reality though is it is not cost possible for a software developer to build just a few procedures for one company and build another small set for another company.

One way SAS has to address this (price and value) gap is software on demand offerings, in-depth discussion of which is beyond the scope of this post. Another way is to redesign package in such a way that all the essential features and functions to get analytical jobs done are built in and integrated. The next immediate question is: which to pick and chose from which existing packages? Apparently, from elementary 'can do' perspective, it is hard to imagine many things that SAS cannot do with its existing offerings. In many cases, the challenge is how, not if. Still, a coherent organizing theme is needed to build the new piece. Good news is such piece has existed for many years.

The BLUE spoke in the center of the picture presents a diagram of modern analytical life cycle, from problem definition, data preparation and exploration, through modeling, to deployment and presentation. PROC IMSTAT features and functions are organized and developed by this framework. In other words, PROC IMSTAT has collected core functions and features from SAS software families, to optimize against needs and challenges confronting analytical users in Hadoop world. Many 'pre-existing' features have been distilled and streamlined while being moved to the in-memory platform.

PROC IMSTAT is expanding rapidly to accommodate ever changing Hadoop world.

Some comments about PROC IMSTAT
1. Major pure feature and function addition actually are on the right side, the recommender engine. Everything else essentially has been in existence in SAS software family in some format or style.
2. All the entries listed under a box header label (such as Data Management to the top left corner) are IMSTAT statements. To access the statements, the user must invoke "PROC IMSTAT" first. Unlike many traditional SAS procedures where one has to invoke procedures many times, once PROC IMSTAT is invoked once, the user can invoke the same statement again and again, as the job deems necessary
3. User who are very familiar and deep on some SAS procedures may find that some features, reduced from a regular SAS procedure (for example, CORR statement stems from PROC CORR) to IMSTAT statement, no longer have that many options as their counterpart procedures have. This, in part, is because the procedure has been reduced to a statement. Another reason is more strategic design; the reduction or left-out is intentional: do you really think it makes sense to run all those distance options under the cluster statement on now much bigger data sets?
4. Some statements actually are mini-solution. GroupBy statement, for example, is "in-memory cube builder " or a genuine OLAP killer, while it appears like a small statement

I plan to publish specific use case to help better understand how IMSTAT works. Thanks.

October 2014, from Wellesley, Massachusetts

Sunday, October 5, 2014

SAS High Performance Finite Mixture Modeling, HPFMM: Introduction by One Example

SAS Institute released HPFMM procedure with its High Performance Statistics 13.1, after the regular FMM procedure was made available through regular STAT package for years.

This post is not intended to show or discuss how to do FMM. A solid, yet simple discussion of FMM with excellent SAS concentration can be found in professor Malthouse's book. There is also a excellent blog on regular FMM procedure usage. A SAS developer of the FMM procedure also did a video on FMM.

This post is to showcase computational power of the new HPFMM procedure over the regular FMM procedure. For those who have little exposure to FMM practice, but need to compute on bigger data sets with the method, this post covers essential aspects to get on with FMM, since regular PROC FMM and HPFMM overlap a lot.

This is the first time I have used the new SAS Studio to build blogs here. I will cover some SAS Studio as well.

Introduction

What is FMM: in plain English, "a variable is distributed in such fashion you believe it is mixture of components"
How FMM varies, to name a few ways,

By component type: categorical, interval... leading to very different governing/underlying distributions. One sub-direction in this vein is observed or unobserved where you can find intellectual connections to some canonical analysis
By interrelationship among components: same variance, mean?
By if and how mixing interacts with components: share covariates? determine / estimate simultaneously?
how 'weird' the distorted distribution is: you can fit and match to several major, if not popular 'skewed' distribution class such as zero inflation, Poisson distribution.

Description of data set used for the post

5 million observations, 204 variables. The response variable is binary Y/N, with 'response rate' ~26.76% or 1,338,111=Y. Response is B2C Q & A: "Do you have your desired product now?"
The 204 variables are mostly RFM variables dating back to the past 72 months, + some attributes.

The environment the SAS HPFMM job is running on:

A 192-node Cloudera Hadoop Cluster with SAS HP STAT 13.1 loaded. The specific job is capped at 14 nodes, with each node having >96 GB ? (this number is not important, given this size of the data). FMM rarely has real time or near real time requirement; running faster than what is shown in the post likely does not provide much incremental value.
The job is conducted using SAS HP STAT 13.2, the latest version of the software
The coding is through a virtual Windows client machine with 32 GB local RAM. The client connects to the Hadoop cluster. No FMM processing happens locally on the client.

The code and the editing interface, the new SAS Studio

Running HPFMM at SAS Studio

This is the look of SAS Studio which essentially is SAS enhanced editor, but represents a revolutionary leap-forward from its ancestors in classic SAS Enhanced Editor, Enterprise Guide, or Enterprise Miner. The Studio is web-native, built for collaboration and portability and 'flashy'. It is much better at output rendering, log, message and error handling, and version management + the modern touch and feel. In current analytics world when look does matter sometimes, the Studio is a true contender in stability, integrity, consistency and support.
If you are familiar with SAS procedure programming, the program does not present much surprises; if you are familiar with PROC FMM, you can plug in your existing program, add the Performance statement and start tweaking. If you are just learning how to use SAS FMM facility, starting with PROC FMM and PROC HPFMM should provide similar learning return in speed, functionality and feature, with some exceptions (which you probably don't care)

Still some notes,

The default Output statement setting is no longer copying all the variables over from the input data set, especially any input variable and BY variables, unlike the regular PROC FMM where all are moved over to output data set listed at OUTPUT statement, if ID statement is not engaged. This is rather a contrast between all the HP procedures and their non-HP counterparts, not just with the FMM procedures.
The default maximum iteration is set at 200 or maximum function calls at 2000. I raised them to 5000 and 250000 respectively, just to test, not to suggest the simplex optimization method requires that. As some of you may have experienced, another factor in tweaking is link function. The point here is: if called for, more iterations and function calls now can be handled with the HPFMM procedure fairly quickly
What does not converge under PROC FMM probably still does not converge under HPFMM, eventually. So, exploratory data analysis, design, business domain, prior or posterior knowledge about the model project remain quintessential for success.
Rules of thumb such as "identity link runs much faster than other links" probably still hold, everything else being equal, as far as I have tested.

The HPFMM log,

You can set message level to get more details.
Three traffic lights on top left corner indicate summarily how the program goes.
Carrying ~200 variables into the process with 5 million records (yes, I deliberately decided not to write Keep= or Drop = statement to shrink input variable list, because the time spent on writing that code likely exceeds the time taken to finish the entire job without dropping them, the 27-28 second mark), it finished below 28 seconds.
The gap between CPU time and Real time is still large, percentage wise, but I care less since it is in seconds, unlike regular SAS session where the gap may be in minutes or hours
This is genuine in-memory computation in which the data set is loaded into memory residing with the 14 nodes on the target Hadoop cluster. Once the job is done, data are dropped off the memory space. The resulting output data set, as I wished and specified, is written to the cluster
When using PROC FMM, tackling 5 million rows with 200 + variables is harder to manage; on the 32 GB Windows machine where this blog is being typed right now, I cannot have PROC FMM to finish the job. I have to cut the data set down to the short list of variables to make it through. And it took > 16 minutes.
Overall, you still need to be data efficient. If you have, say, >600K variables (a near reality where a lot of text/unstructured elements are involved in modeling <not necessarily FMM>), you may still consider Keep/Drop variables.

FMM output, the administrative section,

Count of Hadoop nodes involved in the job is first reported
You then see link function details. Some modelers disrespect identity. Well, in this specific case, I have tested, in a total of 10 minutes, that three other link functions do NOT provide better results any way. Respect what the data have to tell you, because IDENTITY is the fastest
All the estimation / convergence is now ML
The class variable details should be exciting simply because now you are running on full sample in so little time.

HPFMM output, the estimation section,

The fit statistics and Z value, pretty straight forward. You can read all the books to appreciate underpinning definitions and rules; I am not sure I am qualified to profess you on the technicalities here (I don't have Ph.D title)
One cautionary note: keep in mind these are computed statistic. Always interpret and use them with the specific data condition in mind. They may not be as 'portable' as you think, which blunt the motivation many decide to leverage mixture modeling, to begin with.

FMM output, the performance section (sort of),

The probability estimate serves as a direction as to how 'well' the model is. Data some time defy what numbers you should put in as K value (or range thereof). In some cases, you have to 'insist' or 'hold on' to your K value even if data support 'better' mixing probabilities otherwise
Professor Malthouse's book, aforementioned, provides great examples using PROC KDE and PROC SGPLOT to assist with K value insights. KDE on large data set can also be practiced with another SAS in-memory product IMSTAT (the KDE statement there) and SAS Visual Analytics (VA).

Thank you.

Autumn of 2014, from Chestnut Hill, Massachusetts