Analytics in Writing: Associative Rule Mining (ARM) using SAS In-Memory Statistics for Hadoop: A Start-up Example

Friday, January 9, 2015

Associative Rule Mining (ARM) using SAS In-Memory Statistics for Hadoop: A Start-up Example

In SAS Enterprise Miner, there are Market Basket Node and Association Node. In SAS In-Memory Statistics for Hadoop ("SASIMSH"), the statement ARM (Associative Rule Mining) covers most, if not all what the two nodes do inside Enterprise Miner. This post presents a start-up example on how to conduct ARM on SASIMSH. While it does not change much of Market Basket Node and Association Node essentially do, you will see how fast SASIMSH can get the job done over 300 million rows of transaction over 12 months.

I focus on discussion of association based upon which, if you introduce temporal order of the transaction, you can easily extend /imagine into sequence.

The SASIMSH system used for this post is the same as the one used for my post dated 12/14/2014 "SAS High Performance Analytics and In-Memory Statistics for Hadoop: Two Genuine in-Memory Math Blades Working Together". Here are some info on the data set used.

The data set is simulated transaction data set consisting of 12 monthly transaction, 25 million transaction entries each, totaling 300 millions. The total size of the data set is ~125 GB. Below is monthly distribution.

T_weekday is how many transactions happen Sunday, Monday, Tuesday... Saturday. T_week counts how many transactions happen on week 1... week24....week52 on the year. These segment variables are created in case you want to break down your analysis.

Below is main body of the ARM modeling code

1. The two "Proc LASR" sections create LASR in-memory analytics process and load the analytics data set into it. The creation process took ~10 seconds and the loading process took ~15 seconds (see picture)

2. The Frequency statement simply profiles the variables the distributions of which I reported above.
3. The ARM statement is where the core activities happen

Item= is where you list the variable of product category. You have full control product hierarchy.
Tran= is where you specify granular level of transaction data. There are ~9 million unique accounts for this exercise. If you choose to use a level that has, say, 260 unique level values (with proper corresponding product levels) you can easily turn the ARM facility into BI reporting tool, closer to IMSTAT's GROUPBY statement does.
You can use MAXITEMs= (and/or MINITEMS) to customize item counts for compilation
Freq = is simply order count of the item. While Freq = is more 'physical, accounting book weight' (therefore less analytical, by definition), Weight= weighting is more analytical /intriguing. I used list price here, essentially compiling support in terms of individual price importance, assuming away any differential price-item elasticity and a lot more. You can easily have a separate model to study this weight input alone, which is beyond the scope of this post.
The two aggregation options allow you to decide how item aggregation and ID aggregation should happen; if weight = is left blank, both aggregations ignore the aggregation= values you plug in and aggregate by default value of SUM, which is really to ADD UP. Ideally, one aggregation should use one weight variable. For now, if you specify weight=, the weigh variable is used for both. If you are really so 'weight' sensitive, you can run the aggregation one at a time, which does not much more time and resources.
The ITEMSTBL option asks output of a temporary table to be created in-memory amid the flow for further actions during the in-memory process, the table system-reserved keyword .&_tempARMItems_ refers to in the next step. This is different from what SAVE option generates. SAVE typically outputs table to Hadoop directory "when you are done".
The list of options commented out in GREEN show that you can customize support output; you don't have to follow the same configurations when the ARM model was being fit above when generating rules or association scores.

4. Below is how some output looks like

The _T_ table is the temporary table created. You can use PROMOTE statement to make it permanent
_SetSize_ simply tells number of products in the combinations.
_Score_ is the result of your (double) aggregations. Since you can select one of 4 aggregation
options (SUM, MEAN, MIN, MAX) for either aggregation (ITEMAGG and AGG), you need to interpret the score according to your options.

5. This whole, while sounding cliche content wise, takes only ~8 minutes to finish over 300 million rows.

The gap between CPU time and real time is pretty large, but I care less since the overall is only 8 minutes.

15 comments:

UnknownMarch 1, 2016 at 2:47 AM
Hi, thank you for sharing this pretty useful blog, from the IT industry survey, Big Data/Analytics is the hot trend with the strong demand of talent in the upcoming years; I think you are making a great choice to pursue a Master degree in this area.
Regards,
SAS Training in Chennai|SAS Course in Chennai
ReplyDelete
Replies
UnknownApril 8, 2016 at 6:17 AM
Well defined. And good article..
To learn SAS business analytics, check here
SAS Training in Chennai| SAS Course in Chennai
ReplyDelete
Replies
UnknownApril 29, 2016 at 12:19 AM
Got a creative information. Understand well in this. This gives the easy technique of experiment. New technologies are developed more. so techniques are also improved. Thank you for this information.
Android Training in Chennai
ReplyDelete
Replies
venkatJuly 7, 2017 at 1:07 AM
This comment has been removed by the author.
ReplyDelete
Replies
venkatJuly 7, 2017 at 1:07 AM
Hats off to your presence of mind..I really enjoyed reading your blog. I really appreciate your information which you shared with us.
SAS Online Training
Tableau Online Training|
R Programming Online Training|

ReplyDelete
Replies
ToptrendzOctober 4, 2017 at 5:36 AM
Top Fun Ideas For Family Reunions
ReplyDelete
Replies
jananiNovember 15, 2018 at 10:57 PM
This is a nice post in an interesting line of content.Thanks for sharing this article, great way of bring this topic to discussion.
Java training in Bangalore |Java training in Rajaji nagar

Java training in Bangalore | Java training in Kalyan nagar

Java training in Bangalore | Java training in Kalyan nagar

Java training in Bangalore | Java training in Jaya nagar

ReplyDelete
Replies
UnknownNovember 24, 2018 at 1:24 AM
I always enjoy reading quality articles by an individual who is obviously knowledgeable on their chosen subject. Ill be watching this post with much interest. Keep up the great work, I will be back
Data Science training in Chennai | Data Science Training Institute in Chennai
Data science training in Bangalore | Data Science Training institute in Bangalore
Data science training in pune | Data Science training institute in Pune
Data science online training | online Data Science certification Training-Gangboard
Data Science Interview questions and answers

ReplyDelete
Replies
tamilsasiFebruary 5, 2019 at 2:48 AM
Inspiring writings and I greatly admired what you have to say , I hope you continue to provide new ideas for us all and greetings success always for you.
Keep update more information..

Selenium training in bangalore
Selenium training in Chennai
Selenium training in Bangalore
Selenium training in Pune
Selenium Online training
Selenium interview questions and answers
ReplyDelete
Replies
techJuly 6, 2021 at 5:13 AM
You should be a piece of a challenge for probably the best website on the web. I will suggest this site!best interiors
ReplyDelete
Replies
Pixel StudiosJuly 22, 2021 at 2:37 AM
Nice post. I really enjoyed your blog Thanks for sharing such an informative post.
Ecommerce Web Development In Chennai
Digital Marketing Company
Digital Marketing agency
ReplyDelete
Replies
AnonymousMay 17, 2022 at 6:02 AM
en son çıkan perde modelleri
Numara onay
turkcell mobil ödeme bozdurma
NFT NASİL ALINIR
ankara evden eve nakliyat
Trafik sigortasi
DEDEKTÖR
web sitesi kurma
AŞK KİTAPLARI
ReplyDelete
Replies
AnonymousMay 30, 2022 at 12:26 PM
Smm Panel
smm panel
iş ilanları
instagram takipçi satın al
Hirdavatci burada
BEYAZESYATEKNİKSERVİSİ.COM.TR
SERVİS
Jeton hilesi
ReplyDelete
Replies
AnonymousJune 2, 2022 at 1:51 PM
pendik mitsubishi klima servisi
pendik alarko carrier klima servisi
tuzla toshiba klima servisi
çekmeköy lg klima servisi
çekmeköy daikin klima servisi
ataşehir daikin klima servisi
maltepe toshiba klima servisi
kadıköy toshiba klima servisi
pendik toshiba klima servisi
ReplyDelete
Replies
AnonymousJune 27, 2022 at 11:29 AM
minecraft premium
uc satın al
en son çıkan perde modelleri
özel ambulans
nft nasıl alınır
lisans satın al
yurtdışı kargo
en son çıkan perde modelleri
ReplyDelete
Replies

Add comment