*by Bob Horton, **Data Scientist, Revolution Analytics*

From electronic medical records to genomic sequences, the data deluge is affecting all aspects of health care. The Masters of Science in Health Informatics (MSHI) program at the University of San Francisco, now in its second year, is designed to help students develop the practical computing skills and quantitative perspicacity they need to manage and exploit this wealth of data in health care applications.

This spring, I am privileged to participate in this effort by developing and teaching a new course, “Statistical Computing for Biomedical Data Analytics”, intended to motivate and prepare students for further studies in data science, such as the intensive summer courses of the MSAN bootcamp. The syllabus is on github.

As you’ve probably guessed, we will be using R. Other courses in the curriculum use Python, which seems to be favored by engineers; in contrast, R was developed by and for statisticians. We want the students to be exposed to both perspectives, and to have the technical background needed to make use of the extensive repositories of code available from CRAN and Bioconductor.

Data science is an interdisciplinary endeavor born of the synergy between computing, statistics, data management, and visualization. This can make it challenging to get started, because you have to know so many things before you get to the good stuff. We’re going to try to ease into it by starting with computational explorations of mathematical and statistical concepts. R is a fantastic environment for this; you can see a bell-shaped curve emerge from an example as simple as

`plot(0:20, choose(n=20, k=0:20))`

Note the expressive power of the vector of k values, and the easy convenience of having a world of statistical functions at your fingertips. Imagine how this little plot would have delighted Sir Francis Galton.

Data science is a journey. The enormous breadth of material and the rapid pace of development mean that the most important thing to learn is how to learn more. We’ll explore many fantastic resources for learning data science and R. For example, Coursera has excellent offerings, exemplified by the series of mini-courses from Johns Hopkins; our students will take at least one of these as a course project.

Of course, the R community itself is the biggest and most important resource. One class will be a field trip to a Bay Area useR Group (BARUG) meeting, and the comments in response to this post will be required reading. Ideas or suggestions regarding the syllabus or course materials from the github repository are welcome, as are observations or ruminations on the process of learning data science and R.

Finally, we are very interested in helping our students find outstanding internship opportunities in health-related organizations. Please don’t hesitate to contact me through community@revolutionanalytics.com if you are interested in working with us. Stay tuned for progress reports.

Johns Hopkins Biostatistics Professor (and presenter of Data Analysis at Coursera) Jeff Leek has published his list of awesome things other people did in 2014. It's well worth following the links in his 38 entries, where you'll find a wealth of useful resources in teaching, statistics, data science, and data visualization.

Many of the entries are related to R, including shout-outs to: the data wrangling, exploration, and analysis with R class at UBC; this paper on R Markdown and reproducible analysis; Hadley Wickham's R Packages; Hilary Parker's guide to writing R packages from scratch; the broom package (for tidying up statistical output in R); Karl Broman's hipsteR tutorial; Rocker (Docker containers for R); and Packrat and R markdown v2 from RStudio. I was also chuffed to see that this blog got a mention, too:

Another huge reason for the movement with R has been the outreach and development efforts of the Revolution Analytics folks. The Revolutions blog has been a must read this year.

Thanks, Jeff! Check out Jeff's complete list of awesome things at SimplyStatistics by following the link below.

SimplyStatistics: A non-comprehensive list of awesome things other people did in 2014

by Joseph Rickert

The American Statistical Association (ASA) Undergraduate Guidelines Workgroup recently published the report Curriculum Guidelines for Undergraduate Programs in Statistical Science. Although intended for educators setting up or revamping Stats programs at colleges and universities, this concise, 17 page document should be good reading for anyone who wants to take charge of their own education in learning to "think with data". Whether you are just getting started with your education or you are a working professional contemplating what to learn next to expand your knowledge and update your skills you should find the ASA report helpful.

The report places good statistical practice firmly on the foundation of the scientific method and locates statistical knowledge and skills squarely in the center modern data analysis.

However, it is far from being a complacent panegyric to statistics. The ASA report challenges educators to help students see that the "discipline of statistics is more than a collection of unrelated tools" and explicitly calls for an increased emphasis on data science and big league computational skills. Graduates of statistical programs:

should be facile with professional statistical software and other appropriate tools for data exploration, cleaning, validation, analysis, and communication. They should be able to program in a higher-level language, to think algorithmically, to use simulation-based statistical techniques . . . Graduates should be able to manage and manipulate data, including joining data from different sources and formats and restructuring data into a form suitable for analysis.

The expectations for communication skills are particularly noteworthy. The report says:

Graduates should be expected to write clearly, speak fluently, and construct effective visual displays and compelling written summaries. They should demonstrate ability to collaborate in teams and to organize and manage projects.

One could argue about the details of the topics that should be included in an undergraduate program. But, clearly the committe is aiming for far more than producing minimly competent, employable graduates. They are outlining a way of life, a competent way of being in a data driven world.

Hidden among the white papers listed on the ASA curriculum guidelines page is a treasure: Tim Hesterberg's paper on What Teachers Should Know about the Bootstrap: Resampling in the Undergraduate Curriculum. This is a lucid and fairly deep explication of bootstrapping and resampling techniques that deserves wide circulation. Tim writes that he had three goals in producing the paper: (1) To show the enormous potential of bootstrapping and permutation tests to help students understand statistical concepts . . .(2) To dig deeper . . .(3) To change statistical practice . . ."

Point (3) may sound astoundingly ambitious. However, it is grounded in a revolution that has been quietly gaining strength and whose time has come. Textbooks that rely on R based simulations to teach probability (e.g. Baclawski) and statistics (e.g. Matloff) have been available for some time, and Tim points out that undergraduate textbooks such as Chihara and Hesterberg which use resampling as the fundamental unifying idea are beginning to appear. Moreover, data scientists outside of the community of academic statisticians are well aware that programming skills more than compensate for a traditional statistics education that presents the subject as a collection of unrelated tests and techniques as this Strata Hadoop world presentation from John Rauser makes clear.

It is very good, indeed, to see the ASA leading the charge for change.

If you're a current graduate or undergraduate student and have a knack for data visualization, why not submit a paper to the 2014 ASA Statistical Graphics Student Paper Competition? Many of the past winners used R to create interesting displays of data, or created a new package for R (general statistical computing applications are also eligible). Winners will get free registration and a travel allowance to attend the Joint Statistics Meetings, which in 2015 will be held in Seattle. (You'll also get to attend the Statistical Computing Mixer at JSM and participate in the famous door prize giveaway!)

To be considered, you'll need to submit a six-page manuscript by December 14, 2014. I've copied the complete details of the announcement below. Good luck!

The Statistical Computing and Statistical Graphics Sections of the ASA are co-sponsoring a student paper competition on the topics of Statistical Computing and Statistical Graphics. Students are encouraged to submit a paper in one of these areas, which might be original methodological research, some novel computing or graphical application in statistics, or any other suitable contribution (for example, a software-related project). The selected winners will present their papers in a topic-contributed session at the 2015 Joint Statistical Meetings. The Sections will pay registration fees for the winners as well as a substantial allowance for transportation to the meetings and lodging.

Anyone who is a student (graduate or undergraduate) on or after September 1, 2014 is eligible to participate. An entry must include an abstract, a six page manuscript (including figures, tables and references), blinded versions of the abstract and manuscript (with no authors and no references that easily lead to identifying the authors), a C.V., and a letter from a faculty member familiar with the student's work. The applicant must be the first author of the paper. The faculty letter must include a verification of the applicant's student status and, in the case of joint authorship, should indicate what fraction of the contribution is attributable to the applicant. We prefer that electronic submissions of papers be in Postscript or PDF. All materials must be in English.

Students may submit papers to no more than two sections and may accept only one section's award. Students must inform both sections applied to when he or she wins and accepts an award, thereby removing the student from the award competition for the second section.

All application materials MUST BE RECEIVED by 5:00 PM EST, Sunday, December 14, 2014 at the address below. They will be reviewed by the Student Paper Competition Award committee of the Statistical Computing and Graphics Sections. The selection criteria used by the committee will include innovation and significance of the contribution as well as the professional quality of the manuscript. Award announcements will be made by January 15th, 2015.

Additional important information on the competition can be accessed on ASA's "Student Paper Competition/Travel Award to Attend the Joint Statistical Meetings".

Inquiries and application materials should be emailed or mailed to:

Student Paper Competition

c/o Aarti Munjal

Colorado School of Public Health

University of Colorado Denver

aarti.munjal@ucdenver.edu

ASA Statistical Computing and Statistical Graphics Section: Student Paper Competition 2014

by Norman Matloff

The American Statistical Association (ASA) leadership, and many in Statistics academia. have been undergoing a period of angst the last few years, They worry that the field of Statistics is headed for a future of reduced national influence and importance, with the feeling that:

- The field is to a large extent being usurped by other disciplines, notably Computer Science (CS).
- Efforts to make the field attractive to students have largely been unsuccessful.

I had been aware of these issues for quite a while, and thus was pleasantly surprised last year to see then-ASA president Marie Davidson write a plaintive editorial titled, “Aren’t *We* Data Science?”

Good, the ASA is taking action, I thought. But even then I was startled to learn during JSM 2014 (a conference tellingly titled “Statistics: Global Impact, Past, Present and Future”) that the ASA leadership is so concerned about these problems that it has now retained a PR firm.

This is probably a wise move–most large institutions engage in extensive PR in one way or another–but it is a sad statement about how complacent the profession has become. Indeed, it can be argued that the action is long overdue; as a friend of mine put it, “They [the statistical profession] lost the PR war because they never fought it.”

In this post, I’ll tell you the rest of the story, as I see it, viewing events as a statistician, computer scientist and R activist.

**CS vs. Statistics**

Let’s consider the CS issue first. Recently a number of new terms have arisen, such as data science, Big Data, and analytics, and the popularity of the term machine learning has grown rapidly. To many of us, though, this is just “old wine in new bottles,” with the “wine” being Statistics. But the new “bottles” are disciplines outside of Statistics–especially CS.

I have a foot in both the Statistics and CS camps. I’ve spent most of my career in the Computer Science Department at the University of California, Davis, but I began my career in Statistics at that institution. My mathematics doctoral thesis at UCLA was in probability theory, and my first years on the faculty at Davis focused on statistical methodology. I was one of the seven charter members of the Department of Statistics. Though my departmental affiliation later changed to CS, I never left Statistics as a field, and most of my research in Computer Science has been statistical in nature. With such “dual loyalties,” I’ll refer to people in both professions via third-person pronouns, not first, and I will be critical of both groups. However, in keeping with the theme of the ASA’s recent actions, my essay will be Stat-centric: What is poor Statistics to do?

Well then, how did CS come to annex the Stat field? The primary cause, I believe, came from the CS subfield of Artificial Intelligence (AI). Though there always had been some probabilistic analysis in AI, in recent years the interest has been almost exclusively in predictive analysis–a core area of Statistics.

That switch in AI was due largely to the emergence of Big Data. No one really knows what the term means, but people “know it when they see it,” and they see it quite often these days. Typical data sets range from large to huge to astronomical (sometimes literally the latter, as cosmology is one of the application fields), necessitating that one pay key attention to the computational aspects. Hence the term data science, combining quantitative methods with speedy computation, and hence another reason for CS to become involved.

Involvement is one thing, but usurpation is another. Though not a deliberate action by any means, CS is eclipsing Stat in many of Stat’s central areas. This is dramatically demonstrated by statements that are made like, “With machine learning methods, you don’t need statistics”–a punch in the gut for statisticians who realize that machine learning really IS statistics. ML goes into great detail in certain aspects, e.g. text mining, but in essence it consists of parametric and nonparametric curve estimation methods from Statistics, such as logistic regression, LASSO, nearest-neighbor classification, random forests, the EM algorithm and so on.

Though the Stat leaders seem to regard all this as something of an existential threat to the well-being of their profession, I view it as much worse than that. The problem is not that CS people are doing Statistics, but rather that they are doing it poorly: Generally the quality of CS work in Stat is weak. It is not a problem of quality of the researchers themselves; indeed, many of them are very highly talented. Instead, there are a number of **systemic reasons** for this, structural problems with the CS research “business model”:

**CS, having grown out of a research on fast-changing software and hardware systems, became accustomed to the “24-hour news cycle”**–very rapid publication rates, with the venue of choice being (refereed) frequent conferences rather than slow journals. This leads to research work being less thoroughly conducted, and less thoroughly reviewed, resulting in poorer quality work. The fact that some prestigious conferences have acceptance rates in the teens or even lower doesn’t negate these realities.- Because CS Depts. at research universities tend to be housed in Colleges of Engineering, there is
**heavy pressure to bring in lots of research funding, and produce lots of PhD students**. Large amounts of time is spent on trips to schmooze funding agencies and industrial sponsors, writing grants, meeting conference deadlines and managing a small army of doctoral students–instead of time spent in careful, deep, long-term contemplation about the problems at hand. This is made even worse by the rapid change in the fashionable research topic du jour. making it difficult to go into a topic in any real depth. Offloading the actual research onto a large team of grad students can result in faculty not fully applying the talents they were hired for; I’ve seen too many cases in which the thesis adviser is not sufficiently aware of what his/her students are doing. **There is rampant “reinventing the wheel.”**The above-mentioned lack of “adult supervision” and lack of long-term commitment to research topics results in weak knowledge of the literature. This is especially true for knowledge of the Stat literature, which even the “adults” tend to have very little awareness of. For instance, consider a paper on the use of unlabeled training data in classification. (I’ll omit names.) One of the two authors is one of the most prominent names in the machine learning field, and the paper has been cited over 3,000 times, yet the paper cites nothing in the extensive Stat literature on this topic, consisting of a long stream of papers from 1981 to the present.- Again for historical reasons, CS research is largely empirical/experimental in nature. This causes what in my view is
**one of the most serious problems plaguing CS research in Stat – lack of rigor**. Mind you, I am not saying that every paper should consist of theorems and proofs or be overly abstract; data- and/or simulation-based studies are fine. But there is no substitute for precise thinking, and in my experience, many (nominally) successful CS researchers in Stat do not have a solid understanding of the fundamentals underlying the problems they work on. For example, a recent paper in a top CS conference incorrectly stated that the logistic classification model cannot handle non-monotonic relations between the predictors and response variable; actually, one can add quadratic terms, and so on, to models like this. **This “engineering-style” research model causes a cavalier attitude towards underlying models and assumptions.**Most empirical work in CS doesn’t have any models to worry about. That’s entirely appropriate, but in my observation it creates a mentality that inappropriately carries over when CS researchers do Stat work. A few years ago, for instance, I attended a talk by a machine learning specialist who had just earned her PhD at one of the very top CS Departments. in the world. She had taken a Bayesian approach to the problem she worked on, and I asked her why she had chosen that specific prior distribution. She couldn’t answer – she had just blindly used what her thesis adviser had given her–and moreover, she was baffled as to why anyone would want to know why that prior was chosen.**Again due to the history of the field, CS people tend to have grand, starry-eyed ambitions–laudable, but a double-edged sword**. On the one hand, this is a huge plus, leading to highly impressive feats such as recognizing faces in a crowd. But this mentality leads to an oversimplified view of things, with everything being viewed as a paradigm shift. Neural networks epitomize this problem. Enticing phrasing such as “Neural networks work like the human brain” blinds many researchers to the fact that neural nets are not fundamentally different from other parametric and nonparametric methods for regression and classification.(Recently I was pleased to discover–“learn,” if you must–that the famous book by Hastie, Tibshirani and Friedman complains about what they call “hype” over neural networks; sadly, theirs is a rare voice on this matter.) Among CS folks, there is a failure to understand that the celebrated accomplishments of “machine learning” have been mainly the result of applying a lot of money, a lot of people time, a lot of computational power and prodigious amounts of tweaking to the given problem – not because fundamentally new technology has been invented.

All this matters – a LOT. In my opinion, the above factors result in highly lamentable opportunity costs. Clearly, I’m not saying that people in CS should stay out of Stat research. But the sad truth is that the usurpation process is causing precious resources–research funding, faculty slots, the best potential grad students, attention from government policymakers, even attention from the press–to go quite disproportionately to CS, even though Statistics is arguably better equipped to make use of them. This is not a CS vs. Stat issue; Statistics is important to the nation and to the world, and if scarce resources aren’t being used well, it’s everyone’s loss.

**Making Statistics Attractive to Students**

This of course is an age-old problem in Stat. Let’s face it–the very word statistics sounds hopelessly dull. But I would argue that a more modern development is making the problem a lot worse – the Advanced Placement (AP) Statistics courses in high schools.

Professor Xiao-Li Meng has written extensively about the destructive nature of AP Stat. He observed, “Among Harvard undergraduates I asked, the most frequent reason for not considering a statistical major was a ‘turn-off’ experience in an AP statistics course.” That says it all, doesn’t it? And though Meng’s views predictably sparked defensive replies in some quarters, I’ve had exactly the same experiences as Meng in my own interactions with students. No wonder students would rather major in a field like CS and study machine learning–without realizing it is Statistics. It is especially troubling that Statistics may be losing the “best and brightest” students.

One of the major problems is that AP Stat is usually taught by people who lack depth in the subject matter. A typical example is that a student complained to me that his AP Stat teacher could not answer his question as to why it is customary to use n-1 rather than n in the denominator of s^2 , even though he had attended a top-quality high school in the heart of Silicon Valley. But even that lapse is really minor, compared to the lack among the AP teachers of the broad overview typically possessed by Stat professors teaching university courses, in terms of what can be done with Stat, what the philosophy is, what the concepts really mean and so on. AP courses are ostensibly college level, but the students are not getting college-level instruction. The “teach to the test” syndrome that pervades AP courses in general exacerbates this problem.

The most exasperating part of all this is that AP Stat officially relies on TI-83 pocket calculators as its computational vehicle. The machines are expensive, and after all we are living in an age in which R is free! Moreover, the calculators don’t have the capabilities of dazzling graphics and analyzing of nontrivial data sets that R provides – exactly the kinds of things that motivate young people.

So, unlike the “CS usurpation problem,” whose solution is unclear, here is something that actually can be fixed reasonably simply. If I had my druthers, I would simply ban AP Stat, and actually, I am one of those people who would do away with the entire AP program. Obviously, there are too many deeply entrenched interests for this to happen, but one thing that can be done for AP Stat is to switch its computational vehicle to R.

As noted, R is free and is multi platform, with outstanding graphical capabilities. There is no end to the number of data sets teenagers would find attractive for R use, say the Million Song Data Set.

As to a textbook, there are many introductions to Statistics that use R, such as Michael Crawley’s Statistics: An Introduction Using R, and Peter Dalgaard’s Introductory Statistics Using R. But to really do it right, I would suggest that a group of Stat professors collaboratively write an open-source text, as has been done for instance for Chemistry. Examples of interest to high schoolers should be used, say this engaging analysis on OK Cupid.

This is not a complete solution by any means. There still is the issue of AP Stat being taught by people who lack depth in the field, and so on. And even switching to R would meet with resistance from various interests, such as the College Board and especially the AP Stat teachers themselves.

But given all these weighty problems, it certainly would be nice to do something, right? Switching to R would be doable–and should be done.

*[Crossposted with permission from the Mad (Data) Scientist blog.]*

Check out this tweet:

```
```Just found out that @rdpeng @jtleek @bcaffo have enrolled 1,000,000 students in statistics classes on @coursera in last 2 years.

— Simply Statistics (@simplystats) May 30, 2014

Those courses include Roger Peng's Computing for Data Analysis and Jeff Leek's "Data Analysis", and they all use the R language. That means more than 1,000,000 new students have been exposed to R in the last two years! On this basis alone it seems clear that the estimate of two million R users wordwide is an *underestimate*.

by Joseph Rickert

Norman Matloff professor of computer science at UC Davis, and founding member of the UCD Dept. of Statistics has begun posting as Mad(Data)Scientist. (You may know Norm from his book, *The Art of R Programming:* NSP, 2011.) In his second post (out today) on the new R package, freqparcoord, that he wrote with Yinkang Xie, Norm looks into outliers in baseball data.

> library(freqparcoord) > data(mlb) > freqparcoord(mlb,-3,4:6,7)

We would like to welcome Norm as a new R blogger and we are looking forward to future posts!

Mad(Data)Scientist: More on freqparcoord

If you're a student and you'd like to improve R by developing a new R package for 3 months over the summer — and get paid $5000 by Google in the process — why not apply for an R project in the Google Summer of Code? All you need to do browse the list of R projects available (if you have any questions about the projects, ask on the gsoc-r google group), and apply to GSOC 2014 using this template. The deadline for student applications is March 21, so act quickly. Students accepted into the program will be announced on April 21.

R Wiki: Google Summer of Code 2014

by James Paul Peruvankal, Senior Program Manager at Revolution Analytics

At Revolution Analytics, we are always interested in how people teach and learn R, and what makes R so popular, yet ‘quirky’ to learn. To get some insight from a real pro we interviewed Bob Muenchen. Bob is the author of *R for SAS and SPSS Users* and, with Joseph M. Hilbe, *R for Stata Users*. He is also the creator of r4stats.com, a popular web site devoted to analyzing trends in analytics software and helping people learn the R language. Bob is an Accredited Professional Statistician™ with 30 years of experience and is currently the manager of OIT Research Support (formerly the Statistical Consulting Center) at the University of Tennessee. He has conducted research for a variety of public and private organizations and has assisted on more than 1,000 graduate theses and dissertations. He has written or coauthored over 60 articles published in scientific journals and conference proceedings.

Bob has served on the advisory boards of SAS Institute, SPSS Inc., StatAce OOD, the Statistical Graphics Corporation and PC Week Magazine. His suggested improvements have been incorporated into SAS, SPSS, JMP, STATGRAPHICS and several R packages. His research interests include statistical computing, data graphics and visualization, text analysis, data mining, psychometrics and resampling.

**James: How did you get started teaching people how to use statistical software? **

Bob: When I came to UT in 1979, many people were switching from either FORTRAN or SPSS to SAS. There was quite a lot of demand for SAS training, and I enjoyed teaching the workshops. Back then SAS could save results, like residuals or predicted values, much more easily than SPSS, which drove the switch.

When the Windows version of SPSS came out, people started switching back. The SPSS user interface designer, Sheri Gilley, really understood what ease of use was all about, and the SAS folks didn’t get that until quite recently. I was just as happy teaching the SPSS workshops. However, many SPSS users at UT avoid programming, which I think is a big mistake. Pointing-and-clicking your way through an analysis can be a time-saving way to work, but I always keep the program so I have a record of what I did.

I started teaching R workshops in 2005 and attendance was quite sparse. Now it’s one of our Research Computing Support team’s most popular topics.

**James: Is there anything special about teaching people how to use R, any particular difficulties?**

Bob: In other analytics software, the focus is on variables. It sounds too simple to even bother saying: "Every procedure accepts variables." There are very few ways to specify them, such as by simple name, A, B, C, or lists like A TO Z or A—Z.

Rather than just variables, R has a variety of objects such as vectors, factors and matrices. Some procedures (called functions in R) require particular kinds of objects and there are many more ways to specify which objects to use. From a new user's perspective that may seem like needless complexity. However it provides significant benefits. Once an R user has defined a categorical variable as a factor, analyses will then try to “do the right thing” with that variable. For instance, you could include it in a regression equation and R would create the indicator variables needed to handle a categorical variable automatically.

Another important benefit to R’s object orientation is that it allows a total merger of what would normally be a separate matrix language into the main language of R. This attracts developers, who are helping grow R’s capabilities very rapidly.

**James: How do you handle such a broad range of backgrounds in your classes?**

Bob: The workshop participants do come from a very wide range of fields, but they share a common set of knowledge: what a variable is, how to analyze data, and so on. So I save a great deal of time by not having to explain all that. Instead, I redirect it into pointing out where R is likely to surprise them. You can have variables that are *not* in a data set? That’s a bizarre concept to a SAS, SPSS or Stata user. You can have X in one data set and Y in another, but include both in the same regression model? That sounds very strange at first and, of course, it’s quite risky if you’re not careful. I introduce most topics with, “You’re expecting this, but here comes something very different…”. Different doesn’t necessarily mean better, of course. SAS, SPSS and Stata are all top-quality packages and they do some things with less effort. I love R, but I like to point out where I think the others do a better job.

**James: How do you find teaching online compared to classroom courses?**

Bob: I teach my workshops in-person at The University of Tennessee and I’ve taught at the American Statistical Association’s Joint Statistical Meeting as well as the UseR! Conference. Teaching “live” is great fun, and being able to see the participants’ expressions is helpful in adjusting the presentation pace and knowing when to stop and ask for questions.

However, live workshops have major drawbacks. Travel costs can easily exceed the fee for a workshop, but worse, minimizing those expenses means cramming too much material into a short timeframe. That’s why I teach my webinars in half-day stretches skipping a day in between. We break every hour and fifteen minutes so people can relax. On their days off they can catch up on their regular work, review the workshop material, work on the exercises and email me with questions. At the end of a live workshop people are happy but exhausted and they leave quickly. At the end of a webinar-based workshop, they often stay for a long time afterwards asking questions. I stay online as long as it takes to answer them all.

**James: Some people like learning actively, with their hands on the keyboard. Others prefer to focus more on what’s being said and taking notes. How do you handle these styles?**

Bob: This is an excellent question! When I take a workshop myself, I usually prefer hands-on but sometimes I don’t. Each of my workshop attendees receives setup instructions a week early so their computer has the software installed and the files in the right place by the time we start. They’re ready for whichever learning style they prefer.

For hands-on learners, I use a single R program that contains the course notes as programming comments interspersed with executable code. Since the “slides” are right in front of them, they never need to take their eyes off their screens. The examples are designed to be easy to covert to their own projects. They build in a step-by-step fashion, going from simple to more complex to make sure no one gets lost. Participants can run each example as I cover it, and see the results on their own computers.

For people focused more on listening and taking notes, everyone also has a complete set of slides. The slides have the notes that describe each concept, then the code for it, followed by the output. The notes follow a numbering scheme that is used in both the program and the handout. This way, both types of learners stay in sync.

This dual approach has another benefit. It’s very easy to switch from one style to the other at any time. If someone gets tired of typing, or his or her computer malfunctions, switching to the notes is seamless. Conversely, if someone is following the printed notes and want switch to run an example, it’s very easy to find.

**James: What motivated you to start writing books?**

Bob: I’ve always enjoyed writing newsletter and journal articles. My books on R started out just as a set of notes that I kept for myself. When I put them online, they started getting thousands of hits and Springer called to ask if I could make it a book. I really didn’t think I had enough information, but it kept growing. The second edition of R for SAS and SPSS Users is 686 pages and I have notes on a few topics that I wish I had added. If I ever find time for a third edition, they’ll be in there.

**James: Thank you Bob for your time!**

If you are looking to learn R and are already familiar with software like SAS, SPSS or Stata, do check out Bob’s upcoming workshops here and here.

*by Joseph Rickert*

One of the key ideas in topological data analysis is to consider a data set to be a sample from a manifold in some high dimensional topological space and then to use the tools of algebraic topology to reconstruct the manifold. It turns out that the converse problem of taking a random sample from a given topological manifold also has some very useful applications in statistics. In their 2012 paper, Sampling From A Manifold, the mathematical statisticians Persi Diaconis, Susan Holmes and Mehrdad Shahshahani develop a general approach to this converse problem using fundamental ideas from geometric measure theory. They show how this topological / geometrical approach can be used not only to construct algorithms for generating test data for topological statistics, but also for testing the goodness of fit of sufficient statistics for exponential family distributions and for other statistical problems that may be conceptualized as sampling from a manifold.

The first example in the paper shows how to correctly sample from a torus:

M = { [(R+rcos(q))cos(y), (R+rcos(q))sin(y),sin(q)]}

where q and y are both in [0,2p) and R > r > 0

The authors point out that naively drawing q and y from uniformly distributed random variables will lead to sampled points that are denser than they should be in regions of higher curvature such as the inside of the torus. The correct way to sample is to use the theory they develop to derive the joint density, g(q, y), of q and y on the manifold. The theorem that enables this is an extension of the change of variables formula from calculus. Readers who have used Jacobians to work through tricky problems involving the sums and products of random variables in a probability course will recognize what is going on.

As it turns out, g(q, y) factors into into:

g1(q) = (1/2p)(1 + (r/R)cos(q))

where q e [0,2p) and g2(y) is uniform on [0,2p).

To actually sample from g1(q) the authors provide the R code for a rejection sampling algorithm.

Created by Pretty R at inside-R.org

The following plot shows the histogram and density curve for 10,000 draws from g1(q).

(Look here for the code to draw the plot: Download G1(theta)_histogram)

The math in Sampling from a Manifold is intense. However, the authors help where they can. Much of the paper is expository providing an introduction to geometric measure theory and amid the literature review the authors are kind enough to point out more elementary references that they think are useful. This is hard work, but it is nice to see that R code has a place among all the abstract ideas.