by Micheleen Harris
Microsoft Data Scientist
As a Data Scientist, I refuse to choose between R and Python, the top contenders currently fighting for the title of top Data Science programming language. I am not going to argue about which is better or pit Python and R against each other. Rather, I'm simply going to suggest to play to the strengths of each language and consider using them together in the same pipeline if you don't want to give up advantages of one over the other. This is not a novel concept. Both languages have packages/modules which allow for the other language to be used within it (rpy2 in Python and rPython in R). Even in Jupyter notebooks, using the python kernel, one can use "R magics" to execute native R code (which actually relies on rpy2).
I learned R and Python at about the same time. Having pretty equal footing in both languages makes pipelining them together when need be an attractive option as I have my favorite aspects of each. It is agreed R has crisp, clean and journal-quality graphics as well as an incredible arsenal of statistical packages. Python is both a general purpose language and it is agreed in some places it's really a production-ready coding language. But who says you can't do the heavy statistics, machine learning and/or graphics in R within Python? This blog is not about comparing the two languages, however, simply about options to pipeline them and maybe a bit on why you would want to do so.
First things first. We need to decide on a platform and here I'm focusing on notebooks. We actually could do all of this outside a notebook environment, but in general notebook systems are more sharable, interactive, and completely appropriate for demonstrations. If I was well-funded and wanted much more than a notebook I'd probably go with Sense, a notebook-like-IDE-like pipelining system (you can demo it here). If I wanted to be daring, I'd go with Beaker notebooks, a promising, new open-source polyglot notebook project: beakernotebook. This time around, however, I'm going to go with the more established Jupyter notebook project, running a python 3.4 kernel, and interacting with R 3.2.3 via "R magics" and rpy2 python module.
Why notebooks? In a nutshell, I've found them a great place to learn, teach, share, and test code (see my Aside below for further explanation). It took some getting used to, but notebooks are booming right now in university courses and at conferences, both academic and industry. Why Jupyter notebooks? One reason that I particularly like is that kernels for over 50 languages have been developed for the Jupyter notebook system thus far, including ones for Scala and Julia, two increasingly popular languages in the data science arena.
Getting back to R and Python, here are two notebook snippets I created.
The first demonstrates loading the R ipython extension, creating a python pandas dataframe, passing this as input to R, and graphing the data with R's ggplot2 package.
The second demonstrates creating some data in python with numpy, passing this as input to R, performing a linear fit, graphing the results with R's plot, and passing the results of the fit (the coefficients) back to python for printing.
Aside: When I first came across notebook systems, I really disliked them. I'd start writing a chunk of code in a cell and end up switching to my favorite IDE instead, abandoning the notebook. So, what happened? I started teaching. Creating modules in notebook format, for students to enter into, interactively run, modify, test, etc. seemed like an excellent way to learn (and fun). In fact, I started learning this way, looking at notebooks I found online. And I found recently at conferences, workshop presenters are using notebooks to teach. I could grab these notebooks and go back to them over and over on my own notebook server (which is fairly easy to set up). In the end it was a shift in perspective and comfort for me, much like learning a new language. I use notebooks now for teaching, learning, testing, sharing code, and documentation. I might even start blogging in a notebook system soon (this is already a thing people do).
Also, excitingly, Jupyter Notebook had a new release at the time of writing this (4.1 came out January 8, 2016). Check out the announcement. Now, we have multi-cell selection, find-and-replace and "Restart and Run All Cells" and some other nifty stuff.
Note: if you are trying to get the latest rpy2 working on the latest Jupyter notebook on Windows, just be warned you might run into a console-writing issue. That is, print statements might write to the terminal instead of the notebook browser window. If this happens, it must still be an active bug. Contact rpy2 people. This happened to me with rpy2 2.7.6 with Jupyter notebook 4.1 and R 3.2.2 on Windows 10.
Links: My sample notebooks above are here. For a different option for using R in Jupyter see Andrie de Vries's blog post. For more information on the rpy2 project and downloads look here.
For my purposes, showing and running the code has always been only half the battle, with the other half being in generating documents and automated reports to share the results with others, especially non-programmers. R + RStudio Server + knitr + Markdown/LaTeX has been my go-to for this. Can the R/Python + Jupyter system rival this?
Posted by: Steve | January 27, 2016 at 13:59
Hi Steve. I have done lots of reporting with knitr in RStudio and I know what you mean regarding the reporting battle. To me your question revolves around sharing results with a mostly non-technical audience. For that purpose I would stick with knitr+RStudio for two main reasons: 1) one can suppress the code output which might confuse non-tech folks and 2) I believe it simply looks more professional at this point.
For my purposes, here, I am wishing to combine python and R into a pipeline, sharing variables from one code cell to the next. In knitr+RStudio, when using only R, variables persist from code chunk to code chunk. However, if you have a code chunk using the python engine in your Rmd document, variables do not persist to the next code chunk (even if your whole knitr markdown is using the python engine). Knitr+RStudio markdown files are not what we like to call 'polyglot' (yet).
That being said, one can automate the creation of a report using a notebook file and the jupyter nbconvert command-line tool. See here.
Hope that helps.
Posted by: Micheleen Harris | January 27, 2016 at 16:09
Neat! I completely agree with your point - we should use the tools that are best fit to the job, instead of taking unfounded/irrational decisions.
The first example notebook don't seem to work out of the box; I had to insert two lines before creating the dataframe object:
from rpy2.robjects import pandas2ri
pandas2ri.activate()
After this things worked like a charm; second example notebook worked fine first time.
I'm using R 3.2.3 on Ubuntu 14.04, jupyter 4.0.6, python 3.4.3
Posted by: Edward Vanden Berghe | January 28, 2016 at 04:09
Hello Edward. Please ensure you ran the code cell:
%load_ext rpy2.ipython
You should not need to use pandas2ri if using the latest rpy2.
Which versions of rpy2 and pandas do you have? These are my specs:
R 3.2.3
Python 3.4.3
OS: Mac OS X 10.10.5
rpy2==2.7.6
pandas==0.16.2
notebook==4.0.0
Posted by: Micheleen Harris | January 28, 2016 at 09:38
Micheleen, awesome post. Beaker sounds like an awesome way to combine R, python and Julia. My only issue with notebooks is that they are not nearly as user friendly as editing an R markdown file in an editor. Or is there a way to edit a notebook file as something other than a json file? thanks
Posted by: Ricardo | January 28, 2016 at 19:10
Hi,
Thank you for your post. I really liked it.
I however, tried to install rpy2 in windows 7 and even ubuntu (trusty) with no luck. Is there a special trick for making it work? like which version of python did you use. was it 32 bit or 64 bits. I think if you put a video in youtube it would have a lot of views since it seems to be very hard to install rpy2.
Thanks,
Luis
Posted by: Luis | January 29, 2016 at 01:14
Hi, Just want to say that I did spent a lot of time (8+hrs) trying to make rpy2 work with no luck, in both windows and ubuntu. Maybe I should just give up on the task.
Posted by: Luis | January 29, 2016 at 01:43
Hello Ricardo. Thank you for your feedback! To edit as markdown, I'd convert my notebook file (in the json .ipynb format) to markdown, latex or maybe just script with, for example:
jupyter nbconvert --to markdown mynotebook.ipynb
Posted by: Micheleen Harris | January 29, 2016 at 12:23
Hello Luis. Thanks for the feedback! Yes, rpy2 can be tricky to install (esp. on Windows). My suggestion would be to stick with ubuntu for now. You could look into using wheels (pre-built) modules for rpy2 (which is how I got things working on Windows). I have 64 bit python (v3.4.3). I will take the suggestion of an instructional video on this seriously.
Posted by: Micheleen Harris | January 29, 2016 at 12:36
Hi Micheleen,
Nice blog entry.
Yes, unfortunately there is no official support for Windows at the moment. However, contributed binaries for Windows can be found here: http://www.lfd.uci.edu/~gohlke/pythonlibs/
Regarding the issue on Windows you are experiencing, there is a patch in the repository and it will be included in the next bugfix release (rpy2-2.7.8).
Otherwise, main page for the project is now http://rpy2.bitbucket.org
Best,
Laurent
Posted by: Laurent Gautier | January 30, 2016 at 12:02
Hi Micheleen,, I managed to make it work at the end. I had problems with the R_HOME and R_USER paths. ALso I had to modify the Rprofile.site so the starting working directory of Rstudio and its libraries wouldnt be changing paths when working on Rstudio.
Thanks again,
Luis
Posted by: Luis | January 31, 2016 at 05:30
Hi Micheleen, where did you get the syntax `%%R -i x,y -o mycoef` as I've looked extensively in the ryp2 document page but could not find this elegant way of calling R.
Can you provide further info? Thanks!
Sean
Posted by: Sean | February 02, 2016 at 13:09