If you want to get started doing data science with R in the cloud, a good place to start is Stephen Elston's free O'Reilly report, Data Science in the Cloud with Azure ML and R. But if you learn better with a show-and-tell approach, he now also has an O'Reilly Video Training course, Data Science with Microsoft Azure and R. The first part of the course is free, and includes an overview of Azure ML Studio (the browser-based drag-and-drop data science workflow tool), using the built-in data import, manipulation, and modeling modules in Azure ML, and using the Execute R Script node to run custom R code. Stephen takes you through the step-by-step process of writing and testing the R code in R studio, then running it as part of the workflow with ML Studio.
The remainder of the course must be purchased to view (current price is USD$119.99), and covers advanced R topics including the dplyr and ggplot2 packages, statistical modeling (including regression, time series and random forests), and writing functions in R. There's also a chapter on publishing Azure ML models as Web services. To get started with the free part of the course, follow the link below.
O'Reilly Video Training: Data Science with Microsoft Azure and R
by Joseph Rickert
Early October: somewhere the leaves are turning brilliant colors, temperatures are cooling down and that back to school feeling is in the air. And for more people than ever before, it is going to seem to be a good time to commit to really learning R. I have some suggestions for R courses below, but first: What does it mean to learn R anyway? My take is that the answer depends on a person's circumstances and motivation.
I find the following graphic to be helpful in sorting things out.
The X axis is time on Malcolm Gladwell's "Outliers" scale. His idea is that it takes 10,000 hours of real effort to master anything, R, Python or Rock and Roll Guitar. The Y axis lists increasingly difficult R tasks, and the arrows within the plot area are labels increasingly proficient types of R users.
The point I want to make here is that a significant amount of very productive R work happens in the area around the red ellipse. So, while their is no avoiding "10,000" hours of hard work to become an R Jedi knight, a curious and motivated person can master enough R to accomplish his/her programming goals with a more modest commitment. There are three main reasons for this:
If you have some background in some area of statistics or data science a viable strategy for learning R is to identify a resource that works for you and just jump into the middle of things, picking up R as you go along.
The lists below link to courses that can either start you on a formal programming path, or help you become a productive R user in a particular application area. Some of the courses are "live events" that you take with a cohort of students, others are set up for self study.
The courses devoted to teaching R as a programming language are
The first two courses above are from Coursera's Data Science Specialization sequence. Taught by Roger Peng, Jeff Leek and Brian Caffo they are probably the gold standard for MOOC R courses. I am a little late with this post. The Data Scientists's toolbox started this past Monday but there is still time to catch up. The third course, Introduction to R Programming, is a relatively new edX course from Microsoft's online offerings that is getting great reviews. The fourth course on the list a solid introduction to R from DataCamp. R Programming - Introduction 1 is a beginner's introduction to R taught by Paul Murrell or Tal Galili. Next listed, is a Spanish language introduction to R from Coursera and O'Reilly's interactive Code School course.
These next three lists contain courses from DataCamp and statistics.com and online resources from R Studio that introduce more advanced features of R by buildng on basic R programming skills. Note that the final course on the DataCamp list introduces Big Data features of Revolution R Enterprise which is available in the Azure Marketplace.
This next section lists courses from the major MOOCs, and non-MOOCs DataCamp and statistics.com that use R to teach various quantitative disciplines
Coursera Courses
edX Courses
Udacity Course
DataCamp
statistics.com
Finally, here are a couple of google apps and Swirl, a new platform for teaching and learning R that may be useful for learning on the go.
It's time to "go back to school" and make some headway against those 10,000 hours.
Microsoft is sponsoring another free MOOC starting on September 24: Data Science and Machine Learning Essentials. This course provides a five-week introduction to machine learning and data science concepts, including the open-source programming tools for data science: R and Python. (Read more about the course in this post on TechNet.) This course is organized into 5 weekly modules, each concluding with a quiz (and if you wish, can purchase a verified certificate from edX to show off your passing grade).
The course is presented by Cynthia Rudin (Professor of Statistics at MIT) and Steve Elston (author of Data Science in the Cloud with Azure ML and R), who will also participate in the course forum, and host office hours to answer questions that come up during the course.
If you're new to R, you might want to get prepared by reviewing the materials from the previous Microsoft-sponsored edX course, Introduction to R. The new course on Data Science Essentials begins online on September 24, and you can register for free at the link below.
by Ari Lamstein, Software Engineer and Data Analyst
Creating an email course for my R packages has significantly increased the number of people who use the packages. It has also reduced the learning curve for the packages and brought me into greater contact with my users. In this post I will share the 5 steps I took to create my course Learn to Map Census Data in R.
My hope is that this will encourage other package authors to create similar courses, which will in turn improve the accessibility of R to the public.
Before explaining the steps, though, I’d like to share how the course came to be. When I first started blogging in March I took John Somnez’s free email course on blogging. I found the format to be both novel and effective. The lessons were small, well spaced and contained manageable homework assignments. Also, following up with John was as easy as hitting “reply” on the email. So when I started getting requests for an online version of the tutorial I ran in May, I decided to create my own email course.
Your course needs an online “home” where you can announce it, people can signup, and so on. WordPress has become the most popular option for this. While you can host your site for free at wordpress.com, it will limit the amount of customization you can do. I pay $3.49/month to host my site with bluehost. Bluehost allows me to customize the site in any way I see fit.
After creating your blog, be sure to add it to R-bloggers. This will make your posts immediately visible to the R world.
WordPress comes with functionality that allows anyone to subscribe to your blog. This allows them to be emailed when you publish new posts. Email courses, however, are more complicated than this. You want people to receive a series of pre-written emails that are spaced over several days.
To achieve this I use the email automation feature in MailChimp. It costs $10/month. Automation (sometimes called “autoresponders”) is now a fairly standard feature among email providers. Other companies in the space are AWeber and Drip.
How you structure your course is up to you. When creating an automation MailChimp defaults to five emails. I decided to stick with that, which meant that the first email would be an introduction and the last email would be a conclusion. This left three emails to be the main content of the course. What were the three most important things I wanted people to learn? My answer was: “Create maps of census data for states, counties and ZIP codes.”
My next question was how to structure the lessons. In each email I walk thru one example and then assign a closely-related homework assignment. Because I want to communicate with my students, and because one-to-one email doesn’t scale, I asked people to tweet me their solutions using the hashtag #CensusCourse. This format seems to have worked well.
Now that you have a course you need people to sign up for it. Signup forms (or “Opt In Forms”) are the forms that make this happen. The signup form needs to be clearly visible and easy to understand while while not being spammy. There are countless WordPress plugins for this. I personally like Optin Cat, which is free
Once people can sign up for your course you need to announce it to the R community. I recommend writing a blog post that explains the goal and lessons of the course in some detail. You can see the post I wrote introducing my course here.
Note that if your blog is a part of R-bloggers (see Step 1) then a large number of people will automatically know about your course. I also recommend tweeting about your course with the hashtag #rstats.
These are the 5 steps I took to create my R email course Learn to Map Census Data in R. While some industries have embraced email courses as a way to introduce people to a product, it has not yet caught on for R packages. That being said, my experience using it has been very favorable. More importantly, my users seem to have really enjoyed it. I hope that after reading this post more package authors give email courses a try, and that it leads to greater engagement in the R community.
Ari Lamstein is a Software Engineer and Data Analyst in San Francisco. He blogs at AriLamstein.com. You can sign up for his free email course Learn to Map Census Data here.
by Joseph Rickert
Last week, I was fortunate enough to attend the R Summit & Workshop, an invitation only event, held at the Copenhagen Business School. The abstracts for the public talks presented are online and well worth a look. Collectively they provide a snapshot of the state of development of R and the R Community as well some insight into the directions in which researchers are moving to expand the boundaries of R.
Real highlights of the event were talks by Jennifer Bryan and Mine Çetinkaya-Rundel, two educators who are channeling enormous amounts of energy into teaching statistics and statistical programming, and into developing new pedagogical methods to improve the learning experience for both students and teachers alike. Both Mine and Jennifer are committed to R, as well as to using state of the art developer tools such as RStudio, R Markdown, Git and GitHub.
If you clicked on the link to Jennifer’s university home page above and expected to see more content there you are probably not running with the in crowd. Anybody who aspires to bask in the faintest glow of tech cool is hanging out on GitHub. So go to github.com/jennybc to find a place where social media, software development and best practices collide to generate a state-of-the-art learning platform.
What Jennifer has created there is the R version of a full immersion experience for learning a new language. Just like you can’t separate the highs and awkward lows of human interactions from tripping over the grammar while learning Japanese on the streets of Tokyo, at jennybc you have to cope with GitHub, R Markdown and complying with best practices while doing your homework, seeking help, and learning from your peers.
As Jennifer writes in her synopsis of her R Summit & Workshop talk:
I've formed strong opinions about workflows for R Markdown + GitHub and what the big wins are.
Mine, who focuses on undergraduate education, points out that: “R is attractive because, unlike software designed specifically for courses at this level, it is relevant beyond the introductory statistics classroom, and is more powerful and flexible.” Poke around Mine’s GitHub page and you will see that she is all about R, open source, reproducibility and teaching good habits and right values. In addition to her work in the classroom, Mine has developed the Coursera course: Data Analysis and Statistical Inference, is a coauthor of three, free R based textbooks and is a driving force behind the ASA Datafest competitions.
Students are often ambivalent as to whether they are looking for education or training. But if you are a student of either Mime or Jennifer you are going to get some of both, and have a real shot at launching a productive R fueled career.
by Bob Horton, Data Scientist, Revolution Analytics
From electronic medical records to genomic sequences, the data deluge is affecting all aspects of health care. The Masters of Science in Health Informatics (MSHI) program at the University of San Francisco, now in its second year, is designed to help students develop the practical computing skills and quantitative perspicacity they need to manage and exploit this wealth of data in health care applications.
This spring, I am privileged to participate in this effort by developing and teaching a new course, “Statistical Computing for Biomedical Data Analytics”, intended to motivate and prepare students for further studies in data science, such as the intensive summer courses of the MSAN bootcamp. The syllabus is on github.
As you’ve probably guessed, we will be using R. Other courses in the curriculum use Python, which seems to be favored by engineers; in contrast, R was developed by and for statisticians. We want the students to be exposed to both perspectives, and to have the technical background needed to make use of the extensive repositories of code available from CRAN and Bioconductor.
Data science is an interdisciplinary endeavor born of the synergy between computing, statistics, data management, and visualization. This can make it challenging to get started, because you have to know so many things before you get to the good stuff. We’re going to try to ease into it by starting with computational explorations of mathematical and statistical concepts. R is a fantastic environment for this; you can see a bell-shaped curve emerge from an example as simple as
plot(0:20, choose(n=20, k=0:20))
Note the expressive power of the vector of k values, and the easy convenience of having a world of statistical functions at your fingertips. Imagine how this little plot would have delighted Sir Francis Galton.
Data science is a journey. The enormous breadth of material and the rapid pace of development mean that the most important thing to learn is how to learn more. We’ll explore many fantastic resources for learning data science and R. For example, Coursera has excellent offerings, exemplified by the series of mini-courses from Johns Hopkins; our students will take at least one of these as a course project.
Of course, the R community itself is the biggest and most important resource. One class will be a field trip to a Bay Area useR Group (BARUG) meeting, and the comments in response to this post will be required reading. Ideas or suggestions regarding the syllabus or course materials from the github repository are welcome, as are observations or ruminations on the process of learning data science and R.
Finally, we are very interested in helping our students find outstanding internship opportunities in health-related organizations. Please don’t hesitate to contact me through community@revolutionanalytics.com if you are interested in working with us. Stay tuned for progress reports.
Harvard University is offering a free 5-week on-line course on Statistics and R for the Life Sciences on the edX platform. The course promises you will learn the basics of statistical inference and the basics of using R scripts to conduct reproducible research. You'll just need a backround in basic math and programming to follow along and complete homework in the R language.
As a new course, I haven't seen any of the content, but the presenters Rafael Irizarry and Michael Love are active contributors to the Bioconductor project, so it should be good. The course begins January 19 and registration is open through 27 April at the link below.
by Joseph Rickert
The American Statistical Association (ASA) Undergraduate Guidelines Workgroup recently published the report Curriculum Guidelines for Undergraduate Programs in Statistical Science. Although intended for educators setting up or revamping Stats programs at colleges and universities, this concise, 17 page document should be good reading for anyone who wants to take charge of their own education in learning to "think with data". Whether you are just getting started with your education or you are a working professional contemplating what to learn next to expand your knowledge and update your skills you should find the ASA report helpful.
The report places good statistical practice firmly on the foundation of the scientific method and locates statistical knowledge and skills squarely in the center modern data analysis.
However, it is far from being a complacent panegyric to statistics. The ASA report challenges educators to help students see that the "discipline of statistics is more than a collection of unrelated tools" and explicitly calls for an increased emphasis on data science and big league computational skills. Graduates of statistical programs:
should be facile with professional statistical software and other appropriate tools for data exploration, cleaning, validation, analysis, and communication. They should be able to program in a higher-level language, to think algorithmically, to use simulation-based statistical techniques . . . Graduates should be able to manage and manipulate data, including joining data from different sources and formats and restructuring data into a form suitable for analysis.
The expectations for communication skills are particularly noteworthy. The report says:
Graduates should be expected to write clearly, speak fluently, and construct effective visual displays and compelling written summaries. They should demonstrate ability to collaborate in teams and to organize and manage projects.
One could argue about the details of the topics that should be included in an undergraduate program. But, clearly the committe is aiming for far more than producing minimly competent, employable graduates. They are outlining a way of life, a competent way of being in a data driven world.
Hidden among the white papers listed on the ASA curriculum guidelines page is a treasure: Tim Hesterberg's paper on What Teachers Should Know about the Bootstrap: Resampling in the Undergraduate Curriculum. This is a lucid and fairly deep explication of bootstrapping and resampling techniques that deserves wide circulation. Tim writes that he had three goals in producing the paper: (1) To show the enormous potential of bootstrapping and permutation tests to help students understand statistical concepts . . .(2) To dig deeper . . .(3) To change statistical practice . . ."
Point (3) may sound astoundingly ambitious. However, it is grounded in a revolution that has been quietly gaining strength and whose time has come. Textbooks that rely on R based simulations to teach probability (e.g. Baclawski) and statistics (e.g. Matloff) have been available for some time, and Tim points out that undergraduate textbooks such as Chihara and Hesterberg which use resampling as the fundamental unifying idea are beginning to appear. Moreover, data scientists outside of the community of academic statisticians are well aware that programming skills more than compensate for a traditional statistics education that presents the subject as a collection of unrelated tests and techniques as this Strata Hadoop world presentation from John Rauser makes clear.
It is very good, indeed, to see the ASA leading the charge for change.
by Jeremy Reynolds
Senior R Trainer, Revolution Analytics
Last week, Revolution Analytics released its first massive open, online course through a partnership with datacamp.com: Introduction to Revolution R Enterprise for Big Data Analytics. You can sign up for the free course here.
This course provides a look at some of the tools provided by the RevoScaleR package that ships with Revolution R Enterprise. The course and the interactive training framework provided by the platform allow you to get a feel for how you can manipulate, visualize, and analyze large datasets with RevoScaleR.
There are four “chapters” within the course:
We are very excited about our partnership with datacamp.com — the platform provides a unique, hands-on training environment in which you can practice in a live R environment, and you can “learn by doing.” We are diligently working to create more extensive content, including courses on the Fundamentals of the R Programming Language, Introductory Statistics with Revolution R Enterprise, Predictive Modeling, and Advanced R Programming. Stay tuned for more!
DataCamp: Introduction to Revolution R Enterprise for Big Data Analytics
By Neera Talbert, VP Services and Ben Wiley, R Programmer at Revolution Analytics
By now, everyone should be familiar with the data scientist boom. Simply logging onto LinkedIn reveals a seemingly infinite number of people with words and phrases like “Data Scientist”, “Big Data Specialist”, and “Analytics” in their title. A few weeks ago, an article floated around the internet about how R programmers are the highest paid software engineers in industry. But the career of a data scientist is hot not only because it’s highly lucrative; drawing conclusions from data is itself a rewarding process, since these conclusions often shape our future.
As anyone would expect in such an attractive new and emerging field, a lot of people are noticing. So how do you distinguish yourself in a job application to an analytics position? Or, from a company’s perspective, how can you sift through the numerous applications of individuals with analytics backgrounds and choose the one that is best suited to finish the project? One of the tough aspects of the data scientist is that the definition is extremely broad. Upon a closer inspection of LinkedIn profiles with analytics positions, backgrounds include a variety of fields like applied and computational math, statistics, computer science, and so forth. With analytical aspects, even fields like biology or political science appear in these searches.
In other words, having one particular background cannot bar you from being a data scientist. At the same time, however, this in no way implies that the path to a data scientist is easy. In fact, the loosely defined background requirements do more to attract top talent from many fields, rather than attracting only the talent from one. Having expertise in statistics and computer science no doubt helps quite a bit, but sometimes this is not enough to distinguish yourself on an application. There are many popular programming languages used for data analysis, and because these are often new and emerging, it can be difficult to assess one’s understanding of a particular language.
Certifications can be one effective way to convey to an employer that you truly know and understand a program or concept. Revolution Analytics now offers a professional certification that tests the most sought after R analytics skills for the enterprise, and can be an effective way to assess which applicants possess the necessarily backgrounds in R and ScaleR programming. With R being the most widely used statistical language today, the Revolution R Enterprise Professional Certification can be a sure way to attract attention in the job market.
Another option for standing out as a data scientist is to attend graduate school. Of course, this path is much longer than obtaining a certification, but the effects can be lucrative as well. While selecting a good data science grad program is a blog post in and of itself, the obviously attractive fields for a data scientist are statistics and computer science. (There are now also a number of graduate programs devoted specifically to data science.) That’s not to say that other fields aren’t good options either – fields like genomics, biology, physics, economics, and others that heavily rely on data can be attractive paths for the prospective data scientist as well. The only concern to consider is again verifying that the skills gained in a grad program reflect industry’s expectations.
Finally, experience also helps. Having multiple years in an analytics position is a great way to convey one’s understanding of data science to employers, and is often a substantial consideration in a company’s evaluation of a candidate. Having a background as a programmer or analyst can be good ways to step into the data science position. Oftentimes a lack of experience is the greatest hurdle to entering the analytics profession, and so not everyone has the background described above.
Despite the difficulty in attracting an employer’s attention, entering the field of data science is totally worth it. With articles about “Big Data” and “Cloud Computing” emerging everyday on the internet, being a data scientist no doubt puts you at the edge of modern-day technological development, and gives you the ability to make a substantial contribution to society. Plus there’s the pay…
Revolution Analytics AcademyR: Revolution R Enterprise Professional Certification