The City of Melbourne has collected data on the more than 70,000 trees in the urban forest of this Australian metropolis. The data include the species, the health status of the tree and its life expectancy, all shown on a lovely map.

As you can see from the image above, each tree also has a unique email address. The idea is that citizens can report problems with trees, like disease or a fallen limb. But as the Atlantic reported in 2015, the addresses have also been used to write charming letters to the trees. For example, this email to a Golden Elm:

21 May 2015I’m so sorry you're going to die soon. It makes me sad when trucks damage your low hanging branches. Are you as tired of all this construction work as we are?

Sometimes the trees even reply, like this Willow Leaf Pepperment:

29 Jan 2015Hello Mr Willow Leaf Peppermint, or should I say Mrs Willow Leaf Peppermint?

Do trees have genders?

I hope you've had some nice sun today.

Regards

L

30 Jan 2015

Hello

I am not a Mr or a Mrs, as I have what's called perfect flowers that include both genders in my flower structure, the term for this is Monoicous. Some trees species have only male or female flowers on individual plants and therefore do have genders, the term for this is Dioecious. Some other trees have male flowers and female flowers on the same tree. It is all very confusing and quite amazing how diverse and complex trees can be.

Kind regards,

Mr and Mrs Willow Leaf Peppermint (same Tree)

You can find a new more letters in this news.com.au article as well.

That's all from us for this week. Hope you have a great weekend (perhaps amongst the trees?) and we'll be back with more next week.

R 3.4.4 has been released, and binaries for Windows, Mac, Linux and now available for download on CRAN. This update (codenamed "Someone to Lean On" — likely a Peanuts reference, though I couldn't find which one with a quick search) is a minor bugfix release, and shouldn't cause any compatibility issues with scripts or packages written for prior versions of R in the 3.4.x series.

This update improves automatic timezone detection on some systems, and adds fixes for a some unusual corner cases in the statistics library. For a complete list of the changes, check the NEWS file for R 3.4.4 or follow the link below.

R-announce mailing list: R 3.4.4 is released

In case you missed them, here are some articles from February of particular interest to R users.

The R Consortium opens a new round of grant applications for R-related user groups and projects, and has issued US$0.5M in grants to date for R-related projects and events.

Microsoft R Client 3.4.3 and Microsoft ML Server 9.3, both built with R 3.4.3, have been released.

An 8-step, 5-minute tutorial for setting up a cluster in Azure for use with the sparklyr package.

A smartphone app uses R and the keras package to identify "spells" using accelerometer data.

"Machine Learning with R and Tensorflow", JJ Allaire's keynote presentation at RStudio::conf.

A guide to styling base graphics in R.

A list of applications, open source projects, and extensions from Microsoft related to R.

Several upcoming R conferences offer diversity scholarships.

Introducing the DataExplorer package, for quick data summaries and visualizations.

A video overview of the Data Science Virtual Machine, featuring R.

"Asking the right questions about AI", an essay about the ethical implications of predictions.

And some general interest stories (not necessarily related to R):

- A sci-fi short about faster-than-light travel
- A robot holds a door open so another robot can pass
- SpaceX launches a car into space
- A bizarre CGI short film, Time for Sushi

As always, thanks for the comments and please send any suggestions to me at davidsmi@microsoft.com. Don't forget you can follow the blog using an RSS reader, via email using blogtrottr, or by following me on Twitter (I'm @revodavid). You can find roundups of previous months here.

n the latest Redmonk language rankings, R has risen to the #12 position, up from #14 in the June 2017 rankings. (Python remains steady in the #3 position.) The Redmonk rankings are based on activity in StackOverflow (as a proxy for user engagement) and Github (as a proxy for developer engagement). Here's the chart from January 2018 of Github popularity ranking versus StackOverflow popularity ranking.

Here's what Redmonk analyst Stephen O'Grady had to say about Powershell, R and Typescript. (R isn't a Microsoft property, but Microsoft is a founding member of the R Consortium and incorporates R into several products including Microsoft ML Server, SQL Server and Power BI.)

Powershell (+1) / R (+2) / TypeScript (+3): Of all of the vendors represented on this list, Microsoft has by a fair margin the most to crow about. Its ops-oriented language Powershell continues its steady rise, and R had a bounceback from earlier slight declines. TypeScript, meanwhile, pulled off a contextually impressive three spot jump from #17 to #14. Given that growth in the top twenty comes at a premium, hitting the ranking that a widespread language like R enjoyed in our last rankings is an impressive achievement. From a macro perspective, it’s also worth noting that Microsoft is seeing growth across three distinct categories in operations, analytics/data science and application development. More on this later, but it’s a strong indication that Microsoft’s multi-language approach to the broader market is paying dividends.

**[Update March 15]** Here's a visualization of the Redmonk language rankings over time:

You can find the complete top 20 rankings and analysis of the other languages in the list by following the link below.

Tecosystems: The RedMonk Programming Language Rankings: January 2018

I've been getting a lot more into podcasts lately, and one of my favorites (other than Not So Standard Deviations, of course), is Bombshell. Hosted by Radha Iyengar Plumb, Loren DeJonge Schulman, and Erin Simpson, it's a whip-smart, approachable and entertaining look at current events in foreign policy, national security and military affairs. Now, those aren't topics I typically immerse myself in, but between the in-depth knowledge of the hosts and their choice of guests, I'm finding a different perspective that makes it a fascinating subject area. (Interesting side note: the podcast recently entered its second year, and it wasn't until the 1-year anniversary episode that the first male guests appeared on the show.)

Also, as a statistician, I love that the standard questions asked of guests include: "What's your favorite statistical distribution", and "What's your favorite use of statistics or data?".

If you haven't listened before, the December 19, 2017 episode is a great place to start (other than going all the way back to the first episode). It introduces the three hosts, and is an interesting and educational review of the events of 2017. It's also a live episode, with questions from the audience at the Maxwell Air Force Base.

New Bombshell episodes are released every 2 weeks, and you can listed online or via your preferred podcast app. As for us, that's all from the blog here until next week. Have a great weekend!

*by Antony Unwin, **University of Augsburg, Germany*

There are many different methods for identifying outliers and a lot of them are available in **R**. But are outliers a matter of opinion? Do all methods give the same results?

Articles on outlier methods use a mixture of theory and practice. Theory is all very well, but outliers are outliers because they don’t follow theory. Practice involves testing methods on data, sometimes with data simulated based on theory, better with `real’ datasets. A method can be considered successful if it finds the outliers we all agree on, but do we all agree on which cases are outliers?

The Overview Of Outliers (O3) plot is designed to help compare and understand the results of outlier methods. It is implemented in the **OutliersO3** package and was presented at last year’s useR! in Brussels. Six methods from other **R** packages are included (and, as usual, thanks are due to the authors for making their functions available in packages).

The starting point was a recent proposal of Wilkinson’s, his HDoutliers algorithm. The plot above shows the default O3 plot for this method applied to the stackloss dataset. (Detailed explanations of O3 plots are in the **OutliersO3** vignettes.) The stackloss dataset is a small example (21 cases and 4 variables) and there is an illuminating and entertaining article (Dodge, 1996) that tells you a lot about it.

Wilkinson’s algorithm finds 6 outliers for the whole dataset (the bottom row of the plot). Overall, for various combinations of variables, 14 of the cases are found to be potential outliers (out of 21!). There are no rows for 11 of the possible 15 combinations of variables because no outliers are found with them. If using a tolerance level of 0.05 seems a little bit lax, using 0.01 finds no outliers at all for any variable combination.

Trying another method with tolerance level=0.05 (*mvBACON* from **robustX**) identifies 5 outliers, all ones found for more than one variable combination by *HDoutliers*. However, no outliers are found for the whole dataset and only one of the three variable combinations where outliers are found is a combination where *HDoutliers* finds outliers. Of course, the two methods are quite different and it would be strange if they agreed completely. Is it strange that they do not agree more?

There are four other methods available in **OutliersO3** and using all six methods on stackloss a tolerance level of 0.05 identifies the following numbers of outliers:

```
## HDo PCS BAC adjOut DDC MCD
## 14 4 5 0 6 5
```

Each method uses what I have called the tolerance level in a rather different way. Sometimes it is called alpha and sometimes (1-alpha). As so often with **R**, you start wondering if more consistency would not be out of place, even at the expense of a little individuality. **OutliersO3** transforms where necessary to ensure that lower tolerance level values mean fewer outliers for all methods, but no attempt has been made to calibrate them equivalently. This is probably why `adjOutlyingness`

finds few or no outliers (results of this method are mildly random). The default value, according to `adjOutlyingness`

’s help page, is an alpha of 0.25.

The stackloss dataset is an odd dataset and small enough that each individual case can be studied in detail (cf. Dodge’s paper for just how much detail). However, similar results have been found with other datasets (milk, Election2005, diamonds, …). The main conclusion so far is that different outlier methods identify different numbers of different cases for different combinations of variables as different from the bulk of the data (i.e. as potential outliers)—or are these datasets just outlying examples?

There are other outlier methods available in **R** and they will doubtless give yet more different results. The recommendation has to be to proceed with care. Outliers may be interesting in their own right, they may be errors of some kind—and we may not agree whether they are outliers at all.

[Find the R code for generating the above plots here: OutliersUnwin.Rmd]

Excel users starting to use R likely have some established concepts about data: where it's stored, how functions apply to data, etc. In general, R does things differently to Excel (or any spreadsheet, in fact). In a useful guide, Steph de Silva from Rex Analytics explains the concepts of data management in R and how they differ from Excel, which provides a useful mental model for those making the transition or working in both.

Click the image above for the complete guide, but in summary the differences are:

- Excel stores data in the grid structure of the worksheet, whereas R stores data in individual objects, accessed and manipulated by the R language.
- In Excel, code is formulas associated with cells of the worksheet. In R, code is functions provided by the R language or that you write yourself.
- In Excel, you store the results of calculations in the worksheet, but in R results (like data) are stored in objects.

With these differences in mind, Excel users will have a much easier time adapting to R. Find the complete details at the link below.

Rex Analytics: Where do things live in R? R for Excel Users (via Mara Averick)

Bringing over a cake is *so* passé. If you want to meet the neighbours, just invite them over to dance (via TH).

That's all from the blog for this week. We'll be back next week: have a great weekend (ideally with dancing!).

For students looking to try out cloud computing, but who don't have access to a credit card, there's a new way to get access to Azure. Microsoft now offers a free Azure account to students in 140 countries, with free access to dozens of services — plus $100 in Azure credits for everything else. This gives students direct access to a powerful platform for statistical computing and AI application development in the cloud, including:

- 128 Gb of Managed Disk storage, for persisting data files
- 5Gb of Azure Cosmos DB, for data queries
- 1500 hours of Azure B1S virtual machines (split 50:50 between Windows and Linux VMs), which you can use to run the Data Science Virtual Machine
- 100-module experiments in Machine Learning Studio
- Free use of the Face API, Speech API and Translator Text API, up to specified query and rate limits

This is similar to the Free Azure account available to everyone, except that the free Azure credits can be used over 12 months (instead of 30 days) and, again, no credit card is required.

This is just a summary, and you can find the complete details here: Free Azure for Students.