Monday, December 15, 2014

SAS High Performance Analytics and In-Memory Statistics for Hadoop: : Two Genuine in-Memory Math Blades Working Together

SAS In-Memory Statistics for Hadoop ("SASIMSH" ) and SAS High Performance Analytics Server (“SAS-HPA”) are two key in-memory advanced analytics /modeling products SAS Institute (“SI”) offers today. Both support Hadoop as data source server: SASIMSH is Hadoop centric while SAS-HPA supports Hadoop, Teradata and others. While both have its own in-memory data management capabilities, there are applications and efficiency scenarios where one is engaged to build out data sets to share between the two. 

This post shows how to integrate (modeling) data sets from local SAS clients, SASIMSH’s LASR server and HPA, with a Hadoop cluster (“the cluster”) serving as central data repository. Some performance details are also shown. 

1. The cluster and other system features
  • The cluster has 192 nodes, with 96GB RAM each (CPU and other details on the individual nodes are unknown, but are of ‘2014 standard level’). Only 14 nodes are engaged for the exercises; all the performance details are conditional upon this many nodes
  • All the performance details are with other users’ concurrent jobs running at the cluster
  • The local SAS client is a virtual machine /cloud running 64-bit Windows  2008 R2 standard server with 32 GB RAM,  Intel Xeon(R) CPU, X7560 @2.27 GHz, 2.26 GHz (4 processors). Not that these details are critical. Just so you know. It has some relevance when you load data from traditional SAS sources onto LASR servers. Jobs running on the client have no other concurrent jobs running while reporting the performance details.
2. The action on LASR server 

In the picture ("Pic 1") above,

2.1 This is SAS Studio, not SAS Editor or SAS EG

2.1 "LIBNAME bank2": 
    2.11. SASHDAT is system reserved engine label, specifically indicating that the library being established and pointing to is a Hadoop file co-location. Explaining 'co-location' in great detail is beyond the scope of this post. For now, imagine Hadoop is where big data chunk is stored. SASIMSH and SAS-HPA are like (math) blades sitting along side Hadoop. Stick the blade into Hadoop, lift the data into memory, get the math job done and put results back to Hadoop if any (sometimes you just take the insights or score code without writing back to Hadoop)
   2.12. SERVER= is just your host. Path= CAN supports as many slashes/directory levels as you like. 

2.2 "LIBNAME bank3": just your regular SAS local directory.

2.3 "goport=1054": you pick a port number to ask for allocation of a slice of memory space (which in this case is collective, 96GB*14, -/~) for your action.  As of today, this number cannot exceed 65535 and must not have been reserved: if you just ran this port to create a LASR server with this port number, you (or somebody else) need to terminate that server to release the port number (and, YES, destroy everything you did within that previous in-memory session. You will see the benefits of doing so later) if you want to use the same number again. Of course you can use a different number, if it is available. A good (tactical) habit (with strategic implication for having a good life with in-memory analytics) is to use a limited set of numbers as ports. One obvious reason is that in memory like this is not to use the memory space to mainly store (the huge) data chunk. One logical, associated question therefore is how fast it is to load/reload (big) data chunk into LASR server from the client, or from the Hadoop co-location. ("if it takes forever to load this much, I have to park it". Sounds familiar?). You will see how fast it takes to load in both ways, shortly.

2.4 "outdsn=...; ": I declare a file location the library of which is yet to be set. That is not problem, as long as you set the library before eventually USING it. You can put everything between = sign and ; and it will not fail you.

2.5 "PROC LASR Create": This is to create a new LASR server process, with the port number. Path=/tmp is similar to temp space in regular SAS session. 

2.6 "Lifttime=7200": I want the server to cease in two hours.
2.7 "Node=all": all the 14 nodes available. Not the 192 nodes physically installed. Either your administrator caps you, or you can manually set Node=14. 

2.8 "PROC LASR Add": this is to load the data from Hadoop library BANK2 (the co-location) into the LASR server defined by the port number. 

The data set: 300 million rows, ~45 variables. 12 monthly transactions combined, 25 million rows/month. Total file ~125 GB. 

The WHERE statement conditions out the last 7 months while loading the input into memory. Note: there is no SAS Workspace in LASR operation. 

The picture below shows that the loading takes 6.54 seconds, most of which is CPU time. Which is desirable. Which implies 'data piping' and trafficking takes little time.

3. Load local data onto the LASR server

In the pic ("Pic 2") above,

3.1 "Libname bank22":  sasiola is system reserved label indicating that library BANK22 is a SAS LASR input-output (IO) LASR library. Plain English: use this engine/library to input SAS data sets from outside LASR library. TAG: this short-name option is key to handle file system like Hadoop where directory can easily breach the two-level limitation regular SAS library-data set can support. Within the quotes, the directory can have as many slashes or levels as needed.

3.2 "Data...": syntax wise, this is just regular SAS data steps. I am stacking the last 7 month data sets from two different local SAS libraries, Bank3 and Bank4. T_ is  least minimum action you should take to make additional analytical use of any date data you have in your data set. Most (not all) data steps in SAS BASE are supported.

As shown in the picture below, the log shows ~19 minutes of the loading act

The CPU time is not much, indicating piping and trafficking takes time. Considering this is loading from local C drive and D drive of the virtual machine Windows client, this is expected. Notice: the resulting 5 month file now resides in the same in-memory library Bank22. 

SASIOLA engine apparently is key gateway connecting traditional SAS data sourcing with in-memory LASR servers. However you have been building your data sets, you can load it to LASR server via SASIOLA. You can definitely originate your data sources directly within LASR.

4. Append /Stack data set in SASIMSH. 

The picture below ("Pic 3") shows how the two data sets, one directly loaded from Hadoop cluster, one loaded off local SAS,  are stacked using LASR server in-memory. 

4.1 Unlike SAS BASE steps, the table listed at the TABLE has to be one of existing tables; it is called ACTIVE table. The SET action here is defined as 'to modify the active table'. 

4.2 The table listed at SET statement to modify inherits the active table's library; as this version of SASIMSH stands, one has to load both tables into the same library to perform the act; spelling out library name in front of the data set name will cause error.

4.3 Drop option at SET statement is very nice considering you may very well be setting gigantic data onto the active table: YES, space management remains essential during in-memory analytics. 

And it takes, how long? 247 seconds.

4.4 "Save...": after the appending act is over, write the resulting 12 month data set back to Hadoop cluster. Copy and Replace are ordinary Hadoop options. (1 copy and replace existing file with the same name). Recall: the Hadoop cluster directory is where SAS-HPA picks up data for its operations. And the writing time is included within the 247 seconds elapsed. 

5. Terminate LASR server. After the work is done, the LASR server is terminated, as shown in the picture below.

Now you can launch HPA procedures such as HPSUMMARY to run additional analytics. 

Final notes:

  • SAS-HPA has its own data management facilities such as HPDS2 that has strong OOP (Object Oriented Programming) orientation and flavor. Which is more native to traditional formats such as ANSI-SQL
  • Both HPDS2 and SASIMSH support traditional SAS BASE step syntax, with a few exceptions. 
  • There are use cases where you prefer parking large data set in ported LASR servers. For example, if, instead of stacking 7 monthly files, you are stacking 24 monthly files. That may take >>20, 18 minutes. You can conceivably stack the 24 data sets, share the port with others who want to consume the resulting one big data set. They just set proper library to access it. 
  • The central philosophy of SASIMSH, by way of fast loading and clean termination, is that against the reality of data changing instantly and much, one may have desire to build more and more analytics in high pace. Be flashy. Classic such example is digital marketing analytics where input data are often live, highly interactive social event streams. 
  • SASIMSH is resource sensitive. In another separate environment where 47 nodes with 96GB/ node, all steps were finished with a minute. 96GB RAM per node, by 2014 standard , is under par. Today the standard RAM equipped with each node is 512 GB, or at least 256 GB. If you want to read 2TB data from Hadoop, of course you need stronger and more node machines. As  practical matter, not a general statement, I have observed that virtual machines as node member on the cluster, run obviously more slowly than if the nodes are real, physical machines.