This post is not intended to show or discuss how to do FMM. A solid, yet simple discussion of FMM with excellent SAS concentration can be found in professor Malthouse's book. There is also a excellent blog on regular FMM procedure usage. A SAS developer of the FMM procedure also did a video on FMM.
This post is to showcase computational power of the new HPFMM procedure over the regular FMM procedure. For those who have little exposure to FMM practice, but need to compute on bigger data sets with the method, this post covers essential aspects to get on with FMM, since regular PROC FMM and HPFMM overlap a lot.
This is the first time I have used the new SAS Studio to build blogs here. I will cover some SAS Studio as well.
- What is FMM: in plain English, "a variable is distributed in such fashion you believe it is mixture of components"
- How FMM varies, to name a few ways,
- By component type: categorical, interval... leading to very different governing/underlying distributions. One sub-direction in this vein is observed or unobserved where you can find intellectual connections to some canonical analysis
- By interrelationship among components: same variance, mean?
- By if and how mixing interacts with components: share covariates? determine / estimate simultaneously?
- how 'weird' the distorted distribution is: you can fit and match to several major, if not popular 'skewed' distribution class such as zero inflation, Poisson distribution.
Description of data set used for the post
- 5 million observations, 204 variables. The response variable is binary Y/N, with 'response rate' ~26.76% or 1,338,111=Y. Response is B2C Q & A: "Do you have your desired product now?"
- The 204 variables are mostly RFM variables dating back to the past 72 months, + some attributes.
The environment the SAS HPFMM job is running on:
- A 192-node Cloudera Hadoop Cluster with SAS HP STAT 13.1 loaded. The specific job is capped at 14 nodes, with each node having >96 GB ? (this number is not important, given this size of the data). FMM rarely has real time or near real time requirement; running faster than what is shown in the post likely does not provide much incremental value.
- The job is conducted using SAS HP STAT 13.2, the latest version of the software
- The coding is through a virtual Windows client machine with 32 GB local RAM. The client connects to the Hadoop cluster. No FMM processing happens locally on the client.
- This is the look of SAS Studio which essentially is SAS enhanced editor, but represents a revolutionary leap-forward from its ancestors in classic SAS Enhanced Editor, Enterprise Guide, or Enterprise Miner. The Studio is web-native, built for collaboration and portability and 'flashy'. It is much better at output rendering, log, message and error handling, and version management + the modern touch and feel. In current analytics world when look does matter sometimes, the Studio is a true contender in stability, integrity, consistency and support.
- If you are familiar with SAS procedure programming, the program does not present much surprises; if you are familiar with PROC FMM, you can plug in your existing program, add the Performance statement and start tweaking. If you are just learning how to use SAS FMM facility, starting with PROC FMM and PROC HPFMM should provide similar learning return in speed, functionality and feature, with some exceptions (which you probably don't care)
- The default Output statement setting is no longer copying all the variables over from the input data set, especially any input variable and BY variables, unlike the regular PROC FMM where all are moved over to output data set listed at OUTPUT statement, if ID statement is not engaged. This is rather a contrast between all the HP procedures and their non-HP counterparts, not just with the FMM procedures.
- The default maximum iteration is set at 200 or maximum function calls at 2000. I raised them to 5000 and 250000 respectively, just to test, not to suggest the simplex optimization method requires that. As some of you may have experienced, another factor in tweaking is link function. The point here is: if called for, more iterations and function calls now can be handled with the HPFMM procedure fairly quickly
- What does not converge under PROC FMM probably still does not converge under HPFMM, eventually. So, exploratory data analysis, design, business domain, prior or posterior knowledge about the model project remain quintessential for success.
- Rules of thumb such as "identity link runs much faster than other links" probably still hold, everything else being equal, as far as I have tested.
The HPFMM log,
- You can set message level to get more details.
- Three traffic lights on top left corner indicate summarily how the program goes.
- Carrying ~200 variables into the process with 5 million records (yes, I deliberately decided not to write Keep= or Drop = statement to shrink input variable list, because the time spent on writing that code likely exceeds the time taken to finish the entire job without dropping them, the 27-28 second mark), it finished below 28 seconds.
- The gap between CPU time and Real time is still large, percentage wise, but I care less since it is in seconds, unlike regular SAS session where the gap may be in minutes or hours
- This is genuine in-memory computation in which the data set is loaded into memory residing with the 14 nodes on the target Hadoop cluster. Once the job is done, data are dropped off the memory space. The resulting output data set, as I wished and specified, is written to the cluster
- When using PROC FMM, tackling 5 million rows with 200 + variables is harder to manage; on the 32 GB Windows machine where this blog is being typed right now, I cannot have PROC FMM to finish the job. I have to cut the data set down to the short list of variables to make it through. And it took > 16 minutes.
- Overall, you still need to be data efficient. If you have, say, >600K variables (a near reality where a lot of text/unstructured elements are involved in modeling <not necessarily FMM>), you may still consider Keep/Drop variables.
FMM output, the administrative section,
- Count of Hadoop nodes involved in the job is first reported
- You then see link function details. Some modelers disrespect identity. Well, in this specific case, I have tested, in a total of 10 minutes, that three other link functions do NOT provide better results any way. Respect what the data have to tell you, because IDENTITY is the fastest
- All the estimation / convergence is now ML
- The class variable details should be exciting simply because now you are running on full sample in so little time.
HPFMM output, the estimation section,
- The fit statistics and Z value, pretty straight forward. You can read all the books to appreciate underpinning definitions and rules; I am not sure I am qualified to profess you on the technicalities here (I don't have Ph.D title)
- One cautionary note: keep in mind these are computed statistic. Always interpret and use them with the specific data condition in mind. They may not be as 'portable' as you think, which blunt the motivation many decide to leverage mixture modeling, to begin with.
FMM output, the performance section (sort of),
- The probability estimate serves as a direction as to how 'well' the model is. Data some time defy what numbers you should put in as K value (or range thereof). In some cases, you have to 'insist' or 'hold on' to your K value even if data support 'better' mixing probabilities otherwise
- Professor Malthouse's book, aforementioned, provides great examples using PROC KDE and PROC SGPLOT to assist with K value insights. KDE on large data set can also be practiced with another SAS in-memory product IMSTAT (the KDE statement there) and SAS Visual Analytics (VA).
Autumn of 2014, from Chestnut Hill, Massachusetts