Jamie Raines was born a girl, but at age 18 he began a three-year journey to transform his body to match his identity with hormone therapy. Every day Jamie took a selfie to document the change, and the dramatic transformation is shown in this time-lapse video:

If you're in the UK, you can watch the documentary Girls to Men on Channel 4 on Tuesday (October 13), and there's also a nice interview with Jamie in Buzzfeed.

That's all from us for this week. See you on Monday!

In case you missed them, here are some articles from September of particular interest to R users.

A tutorial on using R with Jupyter Notebooks and how to control the size of R graphics therein.

A new version of Revolution R Open is available, featuring multi-threaded computing for R 3.2.2.

One benefit of fitting statistical models to large data sets: learning curves.

Using the AzureML package to publish R functions as web services.

The R Consortium forms a committee to oversee projects, headed by Hadley Wickham.

Functions for interpolation in R.

The EARL London conference (preview here) included many applications of R, from AstraZeneca, Allstate, Douwe Egberts coffee and others.

A new online Data Science and Machine Learning course, featuring R and sponsored by Microsoft.

Reading financial time series data into R with the zoo package.

An update to the checkpoint package brings support for knitr and rmarkdown documents in reproducible projects.

The new Microsoft Data Science User Group Program offers sponsorships for R user groups worldwide.

A series on model validation in R using: basic methods; in-training set measures; out-of-sample procedures; and cross-validation techniques.

BlueSky Statistics, a new open-source GUI for R.

Accessing data in Google spreadsheets with the googlesheets package for R.

Antony Unwin on the care of datasets in R packages.

General interest stories (not related to R) in the past month included: building a scale model of the solar system, a new way to visualize the Discrete Fourier Transform, and a Portal-themed remodel.

As always, thanks for the comments and please send any suggestions to me at david@revolutionanalytics.com. Don't forget you can follow the blog using an RSS reader, via email using blogtrottr, or by following me on Twitter (I'm @revodavid). You can find roundups of previous months here.

by Joseph Rickert

Early October: somewhere the leaves are turning brilliant colors, temperatures are cooling down and that back to school feeling is in the air. And for more people than ever before, it is going to seem to be a good time to commit to really learning R. I have some suggestions for R courses below, but first: What does it mean to learn R anyway? My take is that the answer depends on a person's circumstances and motivation.

I find the following graphic to be helpful in sorting things out.

The X axis is time on Malcolm Gladwell's "Outliers" scale. His idea is that it takes 10,000 hours of real effort to master anything, R, Python or Rock and Roll Guitar. The Y axis lists increasingly difficult R tasks, and the arrows within the plot area are labels increasingly proficient types of R users.

The point I want to make here is that a significant amount of very productive R work happens in the area around the red ellipse. So, while their is no avoiding "10,000" hours of hard work to become an R Jedi knight, a curious and motivated person can master enough R to accomplish his/her programming goals with a more modest commitment. There are three main reasons for this:

- R's functional programming style is very well suited for statistical modeling, data visualization and data science tasks
- The 7,000
^{+}packages available in the R ecosystem provide tens of thousands of functions that make it possible to accomplish quite a bit without having to write much code - Numerous, high quality books and online material devoted to teaching statistical theory and data science with R

If you have some background in some area of statistics or data science a viable strategy for learning R is to identify a resource that works for you and just jump into the middle of things, picking up R as you go along.

The lists below link to courses that can either start you on a formal programming path, or help you become a productive R user in a particular application area. Some of the courses are "live events" that you take with a cohort of students, others are set up for self study.

The courses devoted to teaching R as a programming language are

- The Data Scientist’s toolbox
- R Programming
- Introduction to R Programming
- Introduction to R
- R Programing - Introduction 1
- Introduction a la programacion estadistica con R
- O’Reilly Code School

The first two courses above are from Coursera's Data Science Specialization sequence. Taught by Roger Peng, Jeff Leek and Brian Caffo they are probably the gold standard for MOOC R courses. I am a little late with this post. The Data Scientists's toolbox started this past Monday but there is still time to catch up. The third course, Introduction to R Programming, is a relatively new edX course from Microsoft's online offerings that is getting great reviews. The fourth course on the list a solid introduction to R from DataCamp. R Programming - Introduction 1 is a beginner's introduction to R taught by Paul Murrell or Tal Galili. Next listed, is a Spanish language introduction to R from Coursera and O'Reilly's interactive Code School course.

These next three lists contain courses from DataCamp and statistics.com and online resources from R Studio that introduce more advanced features of R by buildng on basic R programming skills. Note that the final course on the DataCamp list introduces Big Data features of Revolution R Enterprise which is available in the Azure Marketplace.

- Intermediate R
- Data Visualization in R with ggvis
- Data Manipulation with dplyr
- Data Analysis in R, the data.table Way
- Reporting with R Markdown
- Big data Analysis with Revolution R Enterprise

This next section lists courses from the major MOOCs, and non-MOOCs DataCamp and statistics.com that use R to teach various quantitative disciplines

**Coursera Courses**

- Data Analysis and Statistical Inference
- Developing Data Products
- Exploratory Data Analysis
- Getting and Cleaning Data
- Introduction to Computational Finance and Financial Econometrics
- Measuring Causal Effects in the Social Sciences
- Regression Models
- Reproducible Research
- Statistical Inference
- Statistics One

**edX Courses**

- Data Analysis for Life Sciences 1: Statistics and R
- Data Analysis for life Sciences 2: Introduction to Linear Models and Matrix Algebra
- Data Analysis for life Sciences 6: High-performance Computing for Reproducible Genomics
- Explore Statistics with R
- Sabermetrics 101: Introduction to Baseball Analytics

**Udacity Course**

DataCamp

statistics.com

Finally, here are a couple of google apps and Swirl, a new platform for teaching and learning R that may be useful for learning on the go.

It's time to "go back to school" and make some headway against those 10,000 hours.

by Jens Carl Streibig, Professor Emeritus at University of Copenhagen

*Editor's introduction: for background on the miniCRAN package, see our previous blog posts:*

MiniCRAN saves my neck when out in regions where seamless running internet is and exception rather than the rule. R is definitely the programme to offer universities and research institutions in agriculture because it is open source, no money involved, and the help, although sometimes a bit nerdy, is easy to access. I usually tell my student not to buy books on specific topics because R is dynamic and within a couple of years some of the functions in the book is obsolete and thud discourage the average user. Look at the documentation at the r-project.org or in rseek.org.

I have recently been teaching in Turkey and Iran. Sometimes the internet is ok other times it is not. Before it was a struggle to get the particularly packages downloaded and install via RStudio. In a workshop in Iran we could not download the essential packages. A shrewd student downloaded dependencies and distributed the zipfiles to her fellow students. After some glitches we got all up and running.

When I became aware of miniCRAN at the useR!2015 meeting all my R problems were almost solved, with help from the maintainer, Andrie de Vries at Revolution Analytics, we got it to work, when given a workshop on dose-response, also in Iran two weeks ago. Everything went all right for those students who could not install the packages at home. Some windows version were in a poor state of repair, so they could not run RStudio and we had to provide all the dependencies, but no problem they were all in the miniCRAN repository.

For more than six years, the New York Times has been using the R language to develop and implement much of the fantastic data journalism on the website and in the newspaper. A few months ago graphics editor Amanda Cox was interviewed for the Data Stories podcast, where she described the process for creating the interactive data visualizations at the Times. Some highlights of the podcast include Amanda describing R as "The greatest software on Earth", and the background behind the visualization below which tells a unique story based on where you live.

This chart, by the way, was cited by the White House Chief Data Scientist DJ Patil as an examplar of the positive impact of open data, in his keynote at the Strata conference last week.

To listen to the interview, and to find useful links to the resources Amanda mentions during the interview, follow the link below.

Data Stories: Amanda Cox on Working With R, NYT Projects, Favorite Data

I just got back from the Strata+Hadoop World conference in New York, and amongst the usual talks on the technology and applications of big data and data science ran a new thread: data ethics. DJ Patil, the US government's chief data scientist, made a call for comments on data ethics in his keynote and in a follow-up discussion session. But the talk that still sticks in my head is the keynote by Maciej Ceglowski where he compared Big Data to the nuclear energy industry: a new technology once revered (you could at one time purchase radiumunderpants!) but which is now reviled by the public. It sounds alarmist, but the talk is genuinely thought-provoking, and a warning that if we don't begin to seriously consider the consequences of wide-scale data collection and analysis, our industry risks the same fate. Despite the dark message it's entertainingly delivered, and well worth your 20 minutes.

Something to think about over the weekend. We'll be back on Monday -- see you then.

Hadley Wickham, RStudio's Chief Scientist and prolific author of R books and packages, conducted an AMA (Ask Me Anything) session on Reddit this past Monday. The session was tremendously popular, generating more than 500 questions/comments and promoting the AMA to the front page of Reddit.

If you're not familiar with Hadley's work (which would be a surprise if you're an R user), his own introduction in the Reddit AMA post will fill you in:

*Broadly, I'm interested in the process of data analysis/science and how to make it easier, faster, and more fun. That's what has led to the development of my most popular packages like ggplot2, dplyr, tidyr, stringr. This year, I've been particularly interested in making it as easy as possible to get data into R. That's lead to my work on the DBI, haven, readr, readxl, and httr packages. Please feel free to ask me anything about the craft of data science.*

*I'm also broadly interested in the craft of programming, and the design of programming languages. I'm interested in helping people see the beauty at the heart of R and learn to master it as easily as possible. As well as a number of packages like devtools, testthat, and roxygen2, I've written two books along those lines: Advanced R, which teaches R as a programming language, mostly divorced from its usual application as a data analysis tool; and R packages, which teaches software development best practices for R: documentation, unit testing, etc.*

Check out the comments at the link below, where you'll find insights from Hadley on the best way to teach R, Big Data in R, the elegance (or otherwise) of the R language, being productive, the best BBQ, and much more.

by Joseph Rickert

I have been a big fan of R user groups since I attended my first meeting. There is just something about the vibe of being around people excited about what they are doing that feels good. From a speaker's perspective, presenting at an R user Group meeting must be the rough equivalent of doing "stand-up" at a club where you know mostly everyone and you are pretty sure people are going to like your material. So while user groups don't necessarily ignite R creativity (people don't do their best work just to present at an R User group meeting), they do help to shine the spotlight on some really good stuff.

I attend all of the Bay Area useR group meetings, and quite a few other R related events throughout the year, but I only get to experience a small fraction of what is going on in the R world. In the spirit of sharing the "wish I was there" feeling, here are a few recent user group presentations from around the globe that look like they were informative, entertaining and motivating.

Tommy O'Dell gave "Welcome to dply" talk to the Western Australia R Group (WARG) on September 10th. This is a very good presentation until near the very end when it becomes an absolutely great presentation!! Apparently, motivated by a desire to use dplyr with R 2.12, an older R version of R not supported by dplyr, Tommy deconstructed the dplyr "magic" to write his own package, rdplyr. This is a wonderful example of how curiosity and open source can open up many possibilities. The following slide comes from the section where Tommy explains some of the problems he encountered and how he worked through them.

On the 16th of September, Kevin Little gave a talk to MadR about how he recovered after "hitting the wall" in failed first attempt to interface to the SurveyMonkey API using the Rmonkey package. Kevin's description of how he worked through the process which included wading into some JSON scripting is a motivational case study. Kevin wrote a blog post that provides background for the project and has made his slides available here.

Also in September Jim Porzak, a long-time contributor to the San Francisco Bay Area R community, described a detailed customer segmentation analysis in a presentation to BARUG. The following slide examines the stability of the clusters.

Finally, there is a small treasure trove of relatively recent work at the BaselR presentations page. These include a presentation from Aimee Gott on the Mango Solutions development environment and one from Anne Kuemmel on using simulations to calculate confidence intervals in pharma applications. Also have a look at Daniel Sabanes Bove's presentation on using R to produce Microsoft PowerPoint presentations, and some thoughtful advice from Reinhold Koch on how to go about creating a lively R community within your company.

_______________________________________________________________________________________

_______________________________________________________________________________________

Let us all adopt this mindset!!

by Andrie de Vries

A few weeks ago I wrote about the Jupyter notebooks project and the R kernel. In the comments, I was asked how to resize the plots in a Jupyter notebook.

The answer is that the IRKnernel project contains not only the IRKernel package itself, but also the repr package. The repr package provides "*String and byte representations for all kinds of R objects*".

I had to dig a little to uncover the meaning behind this rather cryptic description. What I found was that the package provides wrappers around all kinds of R objects, including plots. Now, anybody who has used R has at some point asked the question "How to save a plot as image on the disk?". The answer is well-known: use a device like png() to capture the output and save the plot to a png file on disk.

Now, the IRKernel uses exactly this technique, and the repr package gives you control over the device.

Very simply, you need to modify two repr setting, using a call to options(). The default repr settings are for plots to be 7 inches wide and 7 inches high.

To set the plot width and height to something else, e.g. 4 inches wide and 3 inches high, use:

options(repr.plot.width=4, repr.plot.height=3)

Here is an example of setting the plot width to different sizes in the same notebook. In the first plot I set the width to 4 inches, and in the second I set the width to 8 inches. In both cases the height is the same: 3 inches rather than the default 7 inches.

Here is the full code listing: