At the recent Big Data Workshop held by the Boston Predictive Analytics group, airline analyst and R user Jeffrey Breen gave a step-by-step guide to setting up an R and Hadoop infrastructure. Firstly, as a local virtual instance of Hadoop with R, using VMWare and Cloudera's Hadoop Demo VM. (This is a great way to get familiar with Hadoop.) Then, as single-machine cloud-based instance with lots of RAM and CPU, using Amazon EC2. (Good for more Hadoop experimentation, now with more realistic data sizes.) And finally, as a true distributed Hadoop cluster in the cloud, using Apache whirr to spin up multiple nodes running Hadoop and R.
With that infrastructure set up, you're ready to start using RHadoop to write map-reduce jobs in R. The final part of Jeffrey's workshop is a tutorial on the rmr package, with a worked example of loading a large data airline data set of airline departures and arrivals into HDFS and using an R-based map-reduce task to calculate the scheduled (orange) and actual (yellow) total flight hours in the US over the last decade or so. (Actual time spent in the air is also shown, in blue.)
By the way, if you're not familiar with RHadoop, project lead Antonio Piccolboni presented an overview at last month's Strata conference; I've included his slides from that presentation below.
Jeffry Breen: Slides from today’s Big Data Step-by-Step Tutorials: Infrastructure series and Intro to R+Hadoop with RHadoop’s rmr
please try link: http://ihadoop.blogspot.in/ for basic tutorial and step-by-step for hands-on
Posted by: itrainer | November 12, 2012 at 10:06
ihadoop.blogspot.com
Posted by: itrainer | December 11, 2012 at 09:22
There are many Big Data problems whose output is also Big Data. In this presentation we will show Splout SQL, which allows serving an arbitrarily big dataset by partitioning it. Splout is to Hadoop + SQL what Voldemort or Elephant DB are to Hadoop + Key/Value. When the output of a Hadoop process is big, there isn`t a satisfying solution for serving it. Splout decouples database creation from database serving and makes it efficient and safe to deploy Hadoop-generated datasets. Splout is not a "fast analytics" engine. Splout is made for demanding web or mobile applications where query performance is critical. On top of that, Splout is scalable, flexible, RESTful & open-source.
Check & Give 3Votes if you like our project ;-)
http://hadoopsummit2013.uservoice.com/forums/185448-integrating-hadoop/suggestions/3409779-splout-sql-when-big-data-output-is-also-big-data-
Posted by: BigData | December 14, 2012 at 07:36