Revolutions: statistics

statistics

April 14, 2020

Forecasting Best Practices, from Microsoft

Microsoft has released a GitHub repository to share best practices for time series forecasting. From the repo:

Time series forecasting is one of the most important topics in data science. Almost every business needs to predict the future in order to make better decisions and allocate resources more effectively.

This repository provides examples and best practice guidelines for building forecasting solutions. The goal of this repository is to build a comprehensive set of tools and examples that leverage recent advances in forecasting algorithms to build solutions and operationalize them. Rather than creating implementations from scratch, we draw from existing state-of-the-art libraries and build additional utilities around processing and featurizing the data, optimizing and evaluating models, and scaling up to the cloud.

The repository includes detailed examples of various time series modeling techniques, as Jupyter Notebooks for Python, and R Markdown documents for R. It also includes Python notebooks to fit time series models in the Azure Machine Learning service, and then operationalize the forecasts as a web service.

The R examples demonstrate several techniques for forecasting time series, specifically data on refrigerated orange juice sales from 83 stores (sourced from the the bayesm package). The forecasting techniques vary (mean forecasting with interpolation, ARIMA, exponential smoothing, and additive models), but all make extensive use of the tidyverts suite of packages, which provides "tidy time series forecasting for R". The forecasting methods themselves are explained in detail in the book (readable online) Forecasting: Principles and Practice by Rob J Hyndman and George Athanasopoulos (Monash University).

You can try out the examples yourself by cloning the repository and knitting the RMarkdown files in R. If you have git installed, a quick and easy way to do this in with RStudio. Choose File > New Project > Version Control > Git, and enter https://github.com/microsoft/forecasting in the Repository URL field. (You might prefer to fork the repository first.)

Open each .Rmd file in turn, accept the prompt to install packages, and click the Knit button to generate the document. The computations can take a while (particularly the Prophet Models example), but if you have a multi-core machine the notebooks do use the parallel package to speed things up. If you don't want to wait, the repository does include HTML versions of the rendered documents, made available at the links below via GitHub Pages:

This repository will be updated over time, and contributions are welcome as pull requests to the repository linked below.

GitHub (Microsoft): Forecasting Best Practices

Posted by David Smith at 09:34 in Microsoft, python, R, statistics | Permalink | Comments (0)

February 01, 2019

Tutorial: Sequential Pattern Mining in R for Business Recommendations

by Allison Koenecke, Data Scientist, AI & Research Group at Microsoft, with acknowledgements to Amita Gajewar and John-Mark Agosta.

In this tutorial, Allison Koenecke demonstrates how Microsoft could recommend to customers the next set of services they should acquire as they expand their use of the Azure Cloud, by using a temporal extension to conventional Market Basket Analysis.

Problem Statement

Market Basket Analysis (MBA) answers a standard business question: given a set of grocery store receipts, can we find bundles of products often purchased together (e.g., peanut butter and jelly)? Suppose we instead want to model the evolution of a customer’s services, such as determining whether buying peanut butter in the past indicates a higher likelihood of buying bread in the future. For this, we apply a sequential version of MBA, sometimes called “sequential itemset mining” or “sequential pattern mining”, to introduce a time component to the analysis [1]. Sequential itemset mining has been applied across many industries, from determining a patient’s sequence of medical prescriptions [2] to detecting misuse intrusions such as application layer attacks [3]. In this tutorial, when given a series of purchases over time, we determine whether we can (1) find product bundles that we expect to be consumed simultaneously, and also (2) examine how these bundles evolve over time.

In the following tutorial, we answer both questions using the R package arulesSequences [4], which implements the SPADE algorithm [5]. Concretely, given data in an Excel spreadsheet containing historical customer service purchase data, we produce two separate Excel sheet deliverables: a list of service bundles, and a set of temporal rules showing how service bundles evolve over time. We will focus on interpreting the latter result by showing how to use temporal rules in making predictive sales recommendations.

Our running example below is inspired by the need for Microsoft’s Azure Services salespeople to suggest which additional products to recommend to customers, given the customers’ current cloud product consumption services mix. We’d like to know, for instance, if customers who have implemented web services also purchase web analytics within the next month. Actual Azure Service names have been removed for confidentiality reasons.

Step 1: Reformat Data into Transaction Matrix

Continue reading "Tutorial: Sequential Pattern Mining in R for Business Recommendations" »

Posted by Guest Blogger at 10:30 in Microsoft, R, statistics | Permalink | Comments (4)

October 24, 2018

When the numbers don't tell the whole story

Anscombe's Quartet is a famous collection of four small data sets — just 11 (x,y) pairs each — that was developed in the 1970s to emphasize the fact that sometimes, numerical summaries of data aren't enough. (For a modern take on this idea, see also the Datasaurus Dozen.) In this case, it takes visualizing the data to realize that the for data sets are qualitatively very different, even though the means, variances, and regression coefficients are all the same. In the video below for Guy in a Cube, Buck Woody uses R to summarize the data (which is conveniently built into R) and visualize it using an R script in Power BI.

The video also makes for a nice introduction to R for those new to statistics, and also demonstrates using R code to generate graphics in Power BI.

Posted by David Smith at 14:19 in graphics, Microsoft, R, statistics | Permalink | Comments (0)

October 23, 2018

Computer Vision for Model Assessment

One of the differences between statistical data scientists and machine learning engineers is that while the latter group are concerned primarily with the predictive performance of a model, the former group are also concerned with the fit of the model. A model that misses important structures in the data — for example, seasonal trends, or a poor fit to specific subgroups — is likely to be lacking important variables or features in the source data. You can try different machine learning techniques or adjust hyperparameters to your heart's content, but you're unlikely to discover problems like this without evaluating the model fit.

One of the most powerful tools for assessing model fit is the residual plot: a scatterplot of the predicted values of the model versus the residuals (the difference between the predictions and the original data). If the model fits well, there should be no obvious relation between the two. For example, visual inspection shows us that the model below first fairly well for 19 of the 20 subgroups in the data, but there may be a missing variable that would explain the apparent residual trend for group 19.

Assessing residual plots like these is something of an art, which is probably why it isn't routine in machine learning circles. But what if we could use deep learning and computer vision to assess residual plots like these automatically? That's what Professor Di Cook from Monash University proposes, in the 2018 Belz Lecture for the Statistical Society of Australia, Human vs computer: when visualising data, who wins?

(The slides for the presentation, created using R, are also available online.) The first part of the talk also includes a statistical interpretation of neural networks, and an intuitive explanation of deep learning for computer vision. The talk also references a related paper, Visualizing Statistical Models : Removing the Blindfold (by Hadley Wickham, Dianne Cook and Jeike Hofman and published in the ASA Data Science Journal in 2015) which is well worth a read as well.

Posted by David Smith at 11:25 in AI, R, statistics | Permalink | Comments (0)

September 14, 2018

How many deaths were caused by the hurricane in Puerto Rico?

President Trump is once again causing distress by downplaying the number of deaths caused by Hurricane Maria's devastation of Puerto Rico last year. Official estimates initially put the death toll at 15 before raising it to 64 months later, but it was clear even then that those numbers were absurdly low. The government of Puerto Rico commissioned an official report from the Millikan Institute of Public Health at George Washington University (GWU) to obtain a more accurate estimate, and with its interim publication official toll stands at 2,975.

Why were the initial estimates so low? I read the interim GWU report to find out. The report itself is clearly written, quite detailed, and composed by an expert team of social and medical scientists, demographers, epidemiologists and biostatisticians, and I find its analysis and conclusions compelling. (Sadly however the code and data behind the analysis have not yet been released; hopefully they will become available when the final report is published.) In short:

In the earliest days of the hurricane, the death-recording office was closed and without power, which suppressed the official count.
Even once death certificates were collected, it became clear that officials throughout Puerto Rico has not been trained on how to record deaths in the event of a natural disaster, and most deaths were not attributed correctly in official records.

Given these deficiencies in the usual data used to calculate death tolls (death certificates) the GWU team used a different approach to calculate the death toll. The basis of the method was to estimate excess mortality, in other words, how many deaths occurred in the post-Maria period compared to the number of deaths that would have been expected if it had never happened. This calculation required two quantitative studies:

An estimate of what the population would have been if the hurricane hadn't happened. This was based on a GLM model of monthly data from the prior years, accounting for factors including recorded population, normal emigration and mortality rates.
The total number of deaths in the post-Maria period, based on death certificates from the Puerto Rico government (irrespective of how the cause of death was coded).
(A third study examined the communication protocols before, during and after the disaster. This study did not affect the quantiative conclusions, but formed the basis of some of the report's recommendations.)

The difference between the actual mortality, and the estimated "normal" mortality formed the basis for the estimate of excess deaths attributed to the hurricane. You can see those estimates of excess deaths one month, three months, and five months after the event in the table below; the last column represents the current official estimate.

These results are consistent in scale with another earlier study by Nishant Kishore et al. (The data and R code behind this study is available on GitHub.) This study attempted to quantify deaths attributed to the hurricane directly, by visiting 3299 randomly chosen households across Puerto Rico. At each household, inhabitants were asked about any household members who had died and their cause of death (related to or unrelated to the hurricane), and whether anyone had left Puerto Rico because of the hurricane. From this survey, the paper's authors extrapolated the number hurricane-related deaths to the entire island. The headline estimate of 4,625 at three months is somewhat larger than the middle column of the study above, but due to the small number of recorded deaths in the survey sample the 95% confidence interval is also much larger: 793 to 8498 excess deaths. (Gelman's blog has some good discussion of this earlier study, including some commentary from the authors.)

With two independent studies reporting excess deaths well into the thousands attributable directly to Hurricane Maria, it's a fair question to ask whether a more effective response before and after the storm could have reduced the scale of this human tragedy.

Milken Institute School of Public Health: Study to Estimate the Excess Deaths from Hurricane Maria in Puerto Rico

Posted by David Smith at 05:49 in current events, R, statistics | Permalink | Comments (10)

March 08, 2018

Compare outlier detection methods with the OutliersO3 package

by Antony Unwin, University of Augsburg, Germany

There are many different methods for identifying outliers and a lot of them are available in R. But are outliers a matter of opinion? Do all methods give the same results?

Articles on outlier methods use a mixture of theory and practice. Theory is all very well, but outliers are outliers because they don’t follow theory. Practice involves testing methods on data, sometimes with data simulated based on theory, better with `real’ datasets. A method can be considered successful if it finds the outliers we all agree on, but do we all agree on which cases are outliers?

The Overview Of Outliers (O3) plot is designed to help compare and understand the results of outlier methods. It is implemented in the OutliersO3 package and was presented at last year’s useR! in Brussels. Six methods from other R packages are included (and, as usual, thanks are due to the authors for making their functions available in packages).

An O3 plot of the stackloss dataset. There is one row for each variable combination (defined by the columns to the left) for which outliers were found, and one column for each case identified as an outlier (the columns to the right).

The starting point was a recent proposal of Wilkinson’s, his HDoutliers algorithm. The plot above shows the default O3 plot for this method applied to the stackloss dataset. (Detailed explanations of O3 plots are in the OutliersO3 vignettes.) The stackloss dataset is a small example (21 cases and 4 variables) and there is an illuminating and entertaining article (Dodge, 1996) that tells you a lot about it.

Wilkinson’s algorithm finds 6 outliers for the whole dataset (the bottom row of the plot). Overall, for various combinations of variables, 14 of the cases are found to be potential outliers (out of 21!). There are no rows for 11 of the possible 15 combinations of variables because no outliers are found with them. If using a tolerance level of 0.05 seems a little bit lax, using 0.01 finds no outliers at all for any variable combination.

An O3plot comparing outliers identified by HDoutliers and mvBACON in the stackloss dataset.

Trying another method with tolerance level=0.05 (mvBACON from robustX) identifies 5 outliers, all ones found for more than one variable combination by HDoutliers. However, no outliers are found for the whole dataset and only one of the three variable combinations where outliers are found is a combination where HDoutliers finds outliers. Of course, the two methods are quite different and it would be strange if they agreed completely. Is it strange that they do not agree more?

There are four other methods available in OutliersO3 and using all six methods on stackloss a tolerance level of 0.05 identifies the following numbers of outliers:

##    HDo    PCS    BAC adjOut    DDC    MCD 
##     14      4      5      0      6      5

An O3 plot of stackloss using the methods HDoutliers, FastPCS, mvBACON, adjOutlyingness, DectectDeviatingCells, covMCD. The darker the cell, the more methods agree. If they all agree, the cell is coloured red and if all but one agree then orange. No case is identified by all the methods as an outlier for any combination of variables when the tolerance level is set at 0.05 for all.

Each method uses what I have called the tolerance level in a rather different way. Sometimes it is called alpha and sometimes (1-alpha). As so often with R, you start wondering if more consistency would not be out of place, even at the expense of a little individuality. OutliersO3 transforms where necessary to ensure that lower tolerance level values mean fewer outliers for all methods, but no attempt has been made to calibrate them equivalently. This is probably why adjOutlyingness finds few or no outliers (results of this method are mildly random). The default value, according to adjOutlyingness’s help page, is an alpha of 0.25.

The stackloss dataset is an odd dataset and small enough that each individual case can be studied in detail (cf. Dodge’s paper for just how much detail). However, similar results have been found with other datasets (milk, Election2005, diamonds, …). The main conclusion so far is that different outlier methods identify different numbers of different cases for different combinations of variables as different from the bulk of the data (i.e. as potential outliers)—or are these datasets just outlying examples?

There are other outlier methods available in R and they will doubtless give yet more different results. The recommendation has to be to proceed with care. Outliers may be interesting in their own right, they may be errors of some kind—and we may not agree whether they are outliers at all.

[Find the R code for generating the above plots here: OutliersUnwin.Rmd]

Posted by Guest Blogger at 09:30 in packages, R, statistics | Permalink | Comments (1)

March 06, 2018

Choosing Priors: Double-Yolk Bayesian Egg

by Subhadeep (Deep) Mukhopadhyay and Douglas Fletcher, Department of Statistical Science, Temple University

Bayesians and Frequentists have long been ambivalent toward each other. The concept of “Prior” remains the center of this 250 years old tug-of-war: frequentists view prior as a weakness that can cloud the final inference, whereas Bayesians view it as a strength to incorporate expert knowledge into the data analysis. So, the question naturally arises, how can we develop a Bayes-frequentist consolidated data analysis workflow that enjoys the best of both worlds?

To develop a “defendable and defensible” Bayesian learning model, we have to go beyond blindly ‘turning the crank’ based on a “go-as-you-like” [approximate guess] prior. A lackluster attitude towards prior modeling could lead to disastrous inference, impacting various fields from clinical drug development to presidential election forecasts. The real questions are: How can we uncover the blind spots of the conventional wisdom-based prior? How can we develop the science of prior model-building that combines both data and science [DS-prior] in a testable manner – a double-yolk Bayesian egg? Unfortunately, these questions are outside the scope of business-as-usual Bayesian modus operandi and require new ideas, which is the goal of the paper, “Bayesian Modeling via Goodness of Fit”. In the following, we demonstrate how to prepare the “Bayesian omelet” — the operational part — using the R package BayesGOF.

Our model-building approach proceeds sequentially as follows:

it starts with a scientific (or empirical) parametric prior \(g(\theta;\alpha,\beta)\),
inspects the adequacy and the remaining uncertainty of the elicited prior using a graphical exploratory tool,
estimates the necessary “correction” for assumed \(g\) by looking at the data,
generates the final statistical estimate \(\hat \pi(\theta)\), and
executes macro and micro-level inference.

Our algorithmic solution yields answers to all five of the phases using one single algorithm, which we will now demonstrate for rat tumor data. The rat tumor data consists of observations of endometrial stromal polyp incidence in \(k=70\) groups of rats. For each group, \(y_i\) is the number of rats with polyps and \(n_i\) is the total number of rats in the experiment. The dataset is available in the R package BayesGOF.

The Rat-data model: \(y_i\,\overset{{\rm ind}}{\sim}\,\mbox{Binomial}( n_i, \theta_i)\), \(i=1,\ldots,k\), where the unobserved parameters \(\theta=(\theta_1,\ldots,\theta_k)\) are independent realizations from the unknown \(\pi(\theta)\).

Step 1. We begin by finding the starting parameter values for parametric conjugate \(g \sim Beta(\alpha, \beta)\):

library(BayesGOF)
set.seed(8697)
data(rat)
###Use MLE to determine starting values
rat.start <- gMLE.bb(rat$y, rat$n)$estimate

We use our starting parameter values to run the main DS.prior function:

rat.ds <- DS.prior(rat, max.m = 6, rat.start, family = "Binomial")

Next we will discuss how to interpret and use this rat.ds object for exploratory Bayes modeling and prior uncertainty quantification.

Step 2. We display the U-function to quantify and characterize the uncertainty of the a priori selected \(g\):

plot(rat.ds, plot.type = "Ufunc")

The deviations from the uniform distribution (the red dashed line) indicates that our initial selection for \(g\), \(\text{Beta}(\alpha = 2.3,\beta = 14.1)\), is incompatible with the observed data and requires repair; the data indicate that there are, in fact, two different groups of incidence in the rats.

Step 3a. Extract the parameters for the nonparametrically corrected prior \(\hat{\pi}\):

rat.ds

## $g.par
##     alpha      beta 
##  2.304768 14.079707 
## 
## $LP.coef
##        LP1        LP2        LP3 
##  0.0000000  0.0000000 -0.5040361

Therefore, our estimated DS(G.m) prior is given by: \[\hat{\pi}(\theta) = g(\theta; \alpha,\beta)\Big[1 - 0.52T_3(\theta;G) \Big].\]

The DS-prior has a unique two-component structure that combines parametric \(g\), and a nonparametric \(d\) (which we call the U-function). Here \(T_j(\Theta;G)\), \(j = 1,\ldots,m\) are a specialized orthonormal basis given by \(\text{Leg}_j[G(\Theta)]\), members of LP-class of rank-polynomials. Note that \({\rm DS}(G,m=0) \equiv g(\theta;\alpha,\beta)\). The truncation point \(m\) reflects the concentration of true unknown \(\pi\) around the pre-selected \(g\).

Step 3b. Plot the estimated DS prior \(\hat{\pi}\) along with the original parametric \(g\):

plot(rat.ds, plot.type = "DSg")

MacroInference

The term "MacroInference" aims to answer the following question: How to combine \(k\) binomial parameters to come up with an overall, macro-level aggregated statistical behavior of \(\theta_1,\ldots,\theta_k\)? This is often important in applied analysis, as the limited sample size of a single study hardly provides adequate evidence for a definitive conclusion.

Step 4. Here we are interested in the overall macro-level inference by combining the \(k=70\) parallel studies. The group-specific modes along with their SEs can be computed as follows:

rat.macro.md <- DS.macro.inf(rat.ds, num.modes = 2 , iters = 25, method = "mode")

rat.macro.md

##      1SD Lower Limit   Mode 1SD Upper Limit
## [1,]          0.0161 0.0340          0.0520
## [2,]          0.1442 0.1562          0.1681

plot(rat.macro.md)

MicroInference

"Microinference" refers to the process of using information from historical studies to improve the estimates of one or more studies of particular interest. This is known as “borrowing strength” in Bayesian inference literature. It is noteworthy to mention that the classical Stein’s shrinkage does not work for rat data due to the presence of multiple partially exchangeable studies. Our adaptive (or selective) shrinkage technology selectively borrows strength from ‘similar’ experiments in an automated manner, by answering the important question: where to shrink?.

Step 5. In addition to the earlier \(k=70\) studies for the rat tumor data, we have a current experimental study that shows \(y_{71}=4\) out of \(n_{71}=14\) rats developed tumors. The following code performs the desired microinference for \(\theta_{71}\) (posterior distribution along with its mean and mode):

rat.y71.micro <- DS.micro.inf(rat.ds, y.0 = 4, n.0 = 14)
rat.y71.micro

## Posterior summary for y = 4, n = 14:
##  Posterior Mean = 0.1897
##  Posterior Mode = 0.1833
## Use plot(x) to generate posterior plot

The left plot (a) compares the posterior distributions for the parametric \(g\) (blue) and the DS posterior (red). The right plot (b) compares our adaptive shrinkage with Stein’s estimates. The vertical red triangles indicate the modes of the DS prior, while the blue triangle is the mode of the parametric \(g\). For additional real-data examples, please see the package vignette.

Conclusion

All most all modern scientific research utilizes domain-knowledge and data to come up with breakthrough results. But the fundamental problem of how to fuse these “approximate” scientific prior knowledge with the data at hand is not a settled issue even 250 years after the discovery of the Bayes law (for more details, see “Bayes’ Theorem in the 21st Century” and “Statistical thinking for 21st century scientists”). Bayesian modeling via goodness-of-fit technology, synthesized in the R package BayesGOF, allows us to determine a scientific prior that is consistent with the data at hand, in a systematic and principled way.

Posted by David Smith at 09:30 in packages, R, statistics | Permalink | Comments (3)

December 18, 2017

The Trouble with Bias, by Kate Crawford

Bias is a major issue in machine learning. But can we develop a system to "un-bias" the results? In this keynote at NIPS 2017, Kate Crawford argues that treating this as a technical problem means ignoring the underlying social problem, and has the potential to make things worse.

You can read more about biases in AI systems in this article at the Microsoft AI blog.

Posted by David Smith at 09:00 in AI, Microsoft, statistics | Permalink | Comments (0)

December 05, 2017

On the biases in data

Whether we're developing statistical models, training machine learning recognizers, or developing AI systems, we start with data. And while the suitability of that data set is, lamentably, sometimes measured by its size, it's always important to reflect on where those data come from. Data are not neutral: the data we choose to use has profound impacts on the resulting systems we develop. A recent article in Microsoft's AI Blog discusses the inherent biases found in many data sets:

“The people who are collecting the datasets decide that, ‘Oh this represents what men and women do, or this represents all human actions or human faces.’ These are types of decisions that are made when we create what are called datasets,” she said. “What is interesting about training datasets is that they will always bear the marks of history, that history will be human, and it will always have the same kind of frailties and biases that humans have.”
— Kate Crawford, Principal Researcher at Microsoft Research and co-founder of AI Now Institute.

“When you are constructing or choosing a dataset, you have to ask, ‘Is this dataset representative of the population that I am trying to model?’”
— Hanna Wallach, Senior Researcher at Microsoft Research NYC.

The article discusses the consequences of the data sets that aren't representative of the populations they are set to analyze, and also the consequences of the lack of diversity in the fields of AI research and implementation. Read the complete article at the link below.

Microsoft AI Blog: Debugging data: Microsoft researchers look at ways to train AI systems to reflect the real world

Posted by David Smith at 07:15 in AI, python, R, statistics | Permalink | Comments (1)

November 08, 2017

Calculating the house edge of a slot machine, with R

Modern slot machines (fruit machine, pokies, or whatever those electronic gambling devices are called in your part of the world) are designed to be addictive. They're also usually quite complicated, with a bunch of features that affect the payout of a spin: multiple symbols with different pay scales, wildcards, scatter symbols, free spins, jackpots ... the list goes on. Many machines also let you play multiple combinations at the same time (20 lines, or 80, or even more with just one spin). All of this complexity is designed to make it hard for you, the player, to judge the real odds of success. But rest assured: in the long run, you always lose.

All slot machines are designed to have a "house edge" — the percentage of player bets retained by the machine in the long run — greater than zero. Some may take 1% of each bet (over a long-run average); some may take as much as 15%. But every slot machine takes something.

That being said, with all those complex rules and features, trying to calculate the house edge, even when you know all of the underlying probabilities and frequencies, is no easy task. Giora Simchoni demonstrates the problem with an R script to calculate the house edge of an "open source" slot machine Atkins Diet. Click the image below to try it out.

This virtual machine is at a typical level of complexity of modern slot machines. Even though we know the pay table (which is always public) and the relative frequency of the symbols on the reels (which usually isn't), calculating the house edge for this machine requires several pages of code. You could calculate the expected return analytically, of course, but it turns out to be a somewhat error-prone combinatorial problem. The simplest approach is to simulate playing the machine 100,000 times or so. Then we can have a look at the distribution of the payouts over all of these spins:

The x axis here is log(Total Wins + 1), in log-dollars, from a single spin. It's interesting to see the impact of the bet size (which increases variance but doesn't change the distribution), and the number of lines played. Playing one 20-line game isn't the same as playing 20 1-line games, because the re-use of the symbols means multi-line wins are not independent: a high-value symbol (like a wild) may contribute to wins on multiple lines. Conversely, losing combinations have a tendency to cluster together, too. It all balances in the end, but the possibility of more frequent wins (coupled with higher-value losses) is apparently appealing to players, since many machines encourage multi-line play.

Nonetheless, whichever method you play, the house edge is always positive. For Atkins Diet, it's about between 3% and 4%. (The simulations suggest 4% for single-line play and about 3.2% for multi-line play, but per-line expected returns are the same in each case.) You can see the details of the calculation, and the complete R code behind it, at the link below.

Giora Simchoni: Don't Drink and Gamble (via the author)

Posted by David Smith at 14:32 in R, statistics | Permalink | Comments (0)

Revolutions

Milestones in AI, Machine Learning, Data Science, and visualization with R and Python since 2008