Thursday, October 16, 2014

SAS In-Memory Statistics (IMSTAT) for Hadoop Overview

SAS® In-Memory Statistics for Hadoop ("SASIMSH")  is a single interactive programming environment for analytics on Hadoop that  integrates analytical data preparation, exploration, modeling and deployment. It contains PROC IMSTAT and  SAS LASR Analytic Engine (SASIOLA). SASIOLA essentially is 'BASE' for SASIMSH, whereas PROC IMSTAT covers the full analytical cycle from BASE, through modeling, towards deployment. This duality continues SAS's long standing tradition of 'getting the same job done in different ways', to accommodate users' different style, constraints  and preferences.
This post provides overview of IMSTAT, with a little associated coverage of the SASIOLA facilities. 

You certainly read through IMSTAT details here at sas.com. Below is a summary picture many have liked better, to capture features and functions of PROC IMSTAT. It covers the latest as of Q1 of 2014, but the spirit and gist remain the same since.




Some comments about the 'total concept' first
1. If you are familiar with SAS products and solutions, you are used to seeing BASE (programming), STAT (statistical sciences), ETS (Econometrics), OR (operations research), EM (enterprise data mining including machine learning, text mining and statistics), EG (enterprise guide) and MM (model management). Another line of SAS in-memory products still largely follow this set of convention. For example, HP Statistics (high performance counterpart of STAT), HPDM (high performance counterpart of EM) and so on. You are used to seeing long list of procedures under each product or package. 

Now, conceptual 'shock' #1 is all these features listed in this IMSTAT picture are grouped under ONE procedure. Yes, IMSTAT is one procedure and one procedure only, with so many features

2. Why this change?

If you use any of the traditional SAS products mentioned above, you know to get the work on hand done, you likely engage a very small set of procedures, functions and statements afforded by a specific product that you have license for. For example, I myself have been using STAT since ~1991, but still ~ half of the procedures under STAT remain stranger to me. I don't recall having known anybody who uses all of the BASE capabilities either. On the contrary, I know friends who have held SAS jobs for many years. They are experienced, but have only known just half a dozen procedures.
The reality though is it is not cost possible for a software developer to build just a few procedures for one company and build another small set for another company.

One way SAS has to address this (price and value) gap is software on demand offerings, in-depth discussion of which is beyond the scope of this post. Another way is to redesign package in such a way that all the essential features and functions to get analytical jobs done are built in and integrated. The next immediate question is: which to pick and chose from which existing packages? Apparently, from elementary 'can do' perspective, it is hard to imagine many things that SAS cannot do with its existing offerings. In many cases, the challenge is how, not if. Still, a coherent organizing theme is needed to build the new piece. Good news is such piece has existed for many years.


The BLUE spoke in the center of the picture presents a diagram of modern analytical life cycle, from problem definition, data preparation and exploration, through modeling, to deployment and presentation. PROC IMSTAT features and functions are organized and developed by this framework. In other words, PROC IMSTAT has collected core functions and features from SAS software families, to optimize against needs and challenges confronting analytical users in Hadoop world. Many 'pre-existing' features have been distilled and streamlined while being moved to the in-memory platform. 


PROC IMSTAT is expanding rapidly to accommodate ever changing Hadoop world.


Some comments about PROC IMSTAT

1. Major pure feature and function addition actually are on the right side, the recommender engine. Everything else essentially has been in existence in SAS software family in some format or style.
2. All the entries listed under a box header label (such as Data Management to the top left corner) are IMSTAT statements. To access the statements, the user must invoke "PROC IMSTAT" first. Unlike many traditional SAS procedures where one has to invoke procedures many times, once PROC IMSTAT is invoked once, the user can invoke the same statement again and again, as the job deems necessary
3. User who are very familiar and deep on some SAS procedures may find that some features, reduced from a regular SAS procedure (for example, CORR statement stems from PROC CORR) to IMSTAT statement, no longer have that many options as their counterpart procedures have. This, in part, is because the procedure has been reduced to a statement. Another reason is more strategic design; the reduction or left-out is intentional: do you really think it makes sense to run all those distance options under the cluster statement on now much bigger data sets?
4. Some statements actually are mini-solution.  GroupBy statement, for example, is "in-memory cube builder " or a genuine OLAP killer, while it appears like a small statement

I plan to publish specific use case to help better understand how IMSTAT works. Thanks.

October 2014, from Wellesley, Massachusetts 

12 comments:

  1. There are lots of information about latest technology and how to get trained in them, like Hadoop training institutes in chennai have spread around the web, but this is a unique one according to me. The strategy you have updated here will make me to get trained in future technologies(Hadoop Training in Chennai). By the way you are running a great blog. Thanks for sharing this.

    Big Data Hadoop Training in Chennai | Hadoop Course in Chennai

    ReplyDelete
  2. Managing a business data is not an easy thing, it is very complex process to handle the corporate information both Hadoop and cognos doing this in a easy manner with help of business software suite, thanks for sharing this useful post….
    Regards,
    cognos tm1 Training in Chennai|cognos Certification|cognos Training in Chennai

    ReplyDelete
  3. A table is the basic unit of data storage in an oracle database. The table of a database hold all of the user accesible data. Table data is stored in rows and columns. But what is all about the clusters and how to handle it using oracle database system? Expecting a right answer from you. By the way you are maintaining a great blog. Thanks for sharing this in here.
    Oracle Training in Chennai | Oracle Course in Chennai | Oracle Training Center in Chennai

    ReplyDelete
  4. It’s too informative blog and I am getting conglomerations of info’s about CCNA certification. Thanks for sharing; I would like to see your updates regularly so keep blogging.
    Regards,
    ccna institutes in Chennai|ccna courses in Chennai

    ReplyDelete
  5. The expansion of internet and other business intelligence leads to large volume of data. Industries are looking for talented professionals to maintain and process huge volume of data with latest tools available in the market. Taking Hadoop Training in Chennai | Big Data Training in Chennai will ensure better career prospects for talented professionals.

    ReplyDelete
  6. Thanks for sharing this pretty post to our knowledge, SAS is a program that assists to retrieve, managing and uploading the data & simply it’s an integration system of software for performing these actions, thanks for taking your time to discuss about this topic.
    Regards,
    sas training in Chennai|sas course in Chennai|sas training center in Chennai

    ReplyDelete
  7. Maharashtra Police Patil Recruitment 2016

    Hi everyone, it’s my first visit at this site, and post is genuinely fruitful for me, keep up posting these types of articles..........

    ReplyDelete
  8. I understand your blog this can help me to analyses the SAS oriented concepts.This can increasing the volume data of sas analytics. Thanks for sharing this blog.


    SASTraining in Bangalore

    ReplyDelete
  9. This comment has been removed by the author.

    ReplyDelete
  10. Thank you so much for sharing this worth able content with us. The concept taken here will be useful for my future programs and i will surely implement them in my study. Keep blogging article like this.

    SAS Online Training

    ReplyDelete
  11. informative post! I really like and appreciate your work, thank you for sharing such a useful facts and information about capability procedure hr strategies, keep updating the blog, hear i prefer some more information about jobs for your career hr jobs in hyderabad .

    ReplyDelete