This is how the client interface looks like. Jboss is not the best. Works OK.
This random forest model uses 280 interval variables, only 3 categorical variables, against a binary target, ~1.6 million rows. A snapshot of SAS log is below
- About 22 minutes to finish a random forest model, 5 other concurrent big jobs running
- I ran it 5 times. It gives the same result, very consistent. The quickest run takes 20 minute 14 seconds. The longest is >26 minutes. Does not vary much. I can reduce it to seconds. But real-time is not always necessary
- I changed vars_to_try from 3 to 17: 17*17=289, the closest number to 283, the total number of input variables. The model improves quite a bit, in terms of misclassification rate. It costs on average ~5 more minutes
- This data set I have is small. So I ran it on a small Hadoop cluster to test. For jobs involving bigger data sets, you need to maintain and expand your clusters and grid network
- This mode, to use a term stolen from large-scale predictive learning community, is a in-memory model. It appears that SAS is getting ready to 'industralize' random forest models on large scale of data.
- I plan to publish some practice on how to prepare data for random forest modeling. Many have the mind set to build random forest models like pushing IPhone buttons, to avoid typically lengthy exploratory data analysis in building, say, a logistic regression model. GOOD random forest models, however, require data preparation and tuning, just like GOOD logistic regression. The difference, in terms of dollar and sense, can be heaven and earth in some cases
- SAS HPA currently already supports Apache Hadoop. Will SAS run on MapReduce? Will see.
Hi Jason,
ReplyDeleteI am a beginner in Predictive Analytic in SAS. Currently I am using SAS EG 4.3 and PROC Hp forest is not available.
Please let me know on which version of SAS, it is available and how should I proceed to implement it.
Thanks,
Thanks for this. Your comment in #6 above is extremely important. I think such a paper would be well received and quite an important contribution.
ReplyDeleteThis comment has been removed by the author.
ReplyDeleteThis comment has been removed by the author.
ReplyDeleteThis comment has been removed by the author.
ReplyDeleteThat is very interesting; you are a very skilled blogger. I have shared your website in my social networks! A very nice guide. I will definitely follow these tips. Thank you for sharing such detailed article.
ReplyDeleteSAS Online Training |
R Programming Online Training|
Tableau Online Training|
Thanks For Your valuable posting. Pridesys Business Intelligence (BI) is offering you easy-to-use analytics and business intelligence tools.keep share more.
ReplyDeleteC and C++ Training Institute in chennai | C and C++ Training Institute in anna nagar | C and C++ Training Institute in omr | C and C++ Training Institute in porur | C and C++ Training Institute in tambaram | C and C++ Training Institute in velachery