Sunday, December 1, 2013

Why many have the sentiment ROI on big data is not paying off? 12/2013

        In the past 2 years, I have seen Java/computer scientist firing up Java/fancy programming tools to do ... One example: to build objective functions for a regression. With all due respect, for statistical software companies like SAS and SPSS this was cutting-edge of a time when Clinton was in WH. Some today asked (SAS) to break out its much advanced big data analytics software, to show them how objective function is built, so to validate their building using R. If you don't know David Ricardo's 'comparative advantage', Yahoo it now. Please don't tell me you should spend >>ten weeks of your time (charging your client ~$300/hour?) to build a pair of shoes, instead of spending $2000 to just buy one off-shelf (very likely better built than your cookout), because what? Because you are not statistician?

    If your goal is to start up business in advanced analytics, hoping to go for IPO, striking big $, that is fine, and probably a necessary path to start from scratch, if not coding from ground zero (if not asking open source community to contribute to your cause for free). For 99% of us into big data analytics, it is about enhancing our core business on hand. Why today more and more seem to have the sentiment that ROI on big data is not paying off? Forgetting your core competence/specialty and core business amongst big data fever is one key reason: if you are not able to articulate "why not", this inability becomes "why Yes" quickly. This way of investment has obvious logic problems, and is anti-analytics per se. What is SIC (standard industry code) for analytics? None, because it permeates each SIC. Your job is to hold onto your SIC, adopt and modernize your analytics.

    I see speech and blog where people toss up new terms and concepts on big data analytics. Often 5 minutes later I realize "oh, is that just what statisticians call (clustering)? Kernel estimation? ....."  Not many read deeply into (statistics) past literature these days. For some, if they cannot find it at, they start to think they have one innovation on their hand. One day I was asked to take a look at 'a design'. I suggested applying a KS test. That test eventually eliminated ~750K lines of Java code the developer was writing for >3 months. KS test?  Is that what statisticians have been doing behind banks' firewall in the past >15 years? Now you spent another 3 months to code KS using Java. You could not match SAS. You switched to SPSS. Still nowhere close, while SAS and SPSS have turned in consistent results/+cosmetic differences on the same data set... My point? Integrity, regardless on big data or small data, is way more important than scalability; scalability actually is the easier part.

   Instead of checking out fashion labels on our jackets, statisticians and non-thereof should work together on big data analytics. Recently I had honor to review a friend paper. I was very impressed by her creativity and ability to use R. Then she asked "when do you think SAS is going to implement it?" "Why do you ask?" I smiled on the webcam. A sheepish look on her face "you know the authenticity part....." Creativity, nimble, flexibility shall meet and marry the 'king of algorithm'. The offspring should benefit all of us. If you want to exceed a giant, try to stand on its head or shoulder to grow. If you choose to start afresh along its side, the chance is you will live in its shadow for long time, if not for life.

   Another friend is a division chief at a big NYC hospital. Two days ago he told me his medical school is hiring computer scientists to work with biostatisticians. I am also seeing banks hiring analysts with more diverse background, like physics major doing predictive modeling. This trend towards multidisciplinary mix is healthy. Let us don't dumb down and out any major. Statistics is going to be a stalwart in big data for a long time to come. If you don't learn and adapt quickly, you become irrelevant, regardless which major you are in. My experience in the past is learning statistics is harder than learning machine learning stuff. If you 'hate' statistics for that reason, I fully appreciate and am with you, especially if the market does not appear to pay statistician as much it pays data sciences. On the other hand, if you take away coding/programming/system building, how much analytics really is left in many data science? See for yourself.