by Joseph Rickert
One of the most interesting R related presentations at last week’s Strata Hadoop World Conference in New York City was the session on Distributed R by Sunil Venkayala and Indrajit Roy, both of HP Labs. In short, Distributed R is an open source project with the end goal of running R code in parallel on data that is distributed across multiple machines. The following figure conveys the general idea.
A master node controls multiple worker nodes each of which runs multiple R processes in parallel.
As I understand it, the primary use case for the Distributed R software is to move data quickly from a database into distributed data structures that can be accessed by multiple, independent R instances for coordinated, parallel computation. The Distributed R infrastructure automatically takes care of the extraction of the data and the coordination of the calculations, including the occasional movement of data from a worker node to the master node when required by the calculation. The user interface to the Distributed R mechanism is through R functions that have been designed and optimized to work with the distributed data structures, and through a special “Distributed R aware” foreach() function that allow users to write their own distributed functions using ordinary R functions.
To make all of this happen, Distributed R platform contains several components that may be briefly described as follows:
The distributed R package contains:
A really nice feature of the distributed data structures is that they can be populated and accessed by rows, columns and blocks making it possible to write efficient algorithms tuned to the structure of particular data sets. For example, data cleaning for wide data sets (many more columns than rows) can be facilitated by preprocessing individual features.
vRODBC is an ODBC client that provides R with database connectivity. This is the connection mechanism that permits the parallel loading of data from various sources data including HP’s Vertica database.
The HPdata package contains the functions that allow you to actually load distributed data structures from various data sources
The HPDGLM package implements a parallel, distributed GLM models (Presently only linear regression, logistic regression and Poisson regression models are available), The package also contains functions for cross validation and splitsample validation.
The HPdclassifier package is intended to contain several distributed classification algorithms. It currently contains a parallel distributed implementation of the random forests algorithm.
The HPdcluster package contains a parallel, distributed kmeans algorithm.
The HPdgraph package is intended to contain distributed algorithms for graph analytics. It currently contains a parallel, distributed implementation of the pagerank algorithm for directed graphs.
The following sample code, taken directly from the HPdclassifier User Guide, but modified slightly for presentation here, is similar to the examples that Venkayala and Roy showed in their presentation. Note, that after the distributed arrays are set up they are loaded in parallel with data using the the foreach function from the distributedR package.
library(HPdclassifier) # loading the library
Loading required package: distributedR
Loading required package: Rcpp
Loading required package: RInside
Loading required package: randomForest
distributedR_start() # starting the distributed environment
Workers registered  1/1.
All 1 workers are registered.
[1] TRUE
nparts < sum(ds$Inst) # number of available distributed instances
# Describe the data
nSamples < 100 # number of samples
nAttributes < 5 # number of attributes of each sample
nSpllits < 1 # number of splits in each darray
# Create the distributed arrays
dax < darray(c(nSamples,nAttributes),c(round(nSamples/nSpllits),nAttributes))
day < darray(c(nSamples,1), c(round(nSamples/nSpllits),1))
# Load the distributed arrays
foreach(i, 1:npartitions(dax),
function(x=splits(dax,i),y=splits(day,i),id=i){
x < matrix(runif(nrow(x)*ncol(x)), nrow(x),ncol(x))
y < matrix(runif(nrow(y)), nrow(y), 1)
update(x)
update(y)
})
# Fit the Random Forest Model
myrf < hpdrandomForest(dax, day, nExecutor=nparts)
# prediction
dp < predictHPdRF(myrf, dax)
Notwithstanding all of its capabilities, Distributed R is still clearly work in progress. It is only available on Linux platforms. Algorithms and data must be resident in memory. Distributed R is not available on CRAN, and even with an excellent Installation Guide, installing the platform is a bit of an involved process.
Nevertheless, Distributed R is impressive, and I think a valuable contribution to open source R. I expect that users with distributed data will find the platform to be a viable way to begin high performance computing with R.
Note that the Distributed R project discussed in this post is an HP initiative and is not in anyway related to http://www.revolutionanalytics.com/revolutionrenterprisedistributedr.
Many R scripts depend on CRAN packages, and most CRAN packages in turn depend on other CRAN packages. If you install an R package, you'll also be installing its dependencies to make it work, and possibly other packages as well to enable its full functionality.
My colleague Andrie posted some R code to map package dependencies a couple of months ago, but now you can easily explore the dependencies of any CRAN package at MRAN. Simply search for a package and click the Dependencies Graph tab. Here's a very simple one: the foreach package.
The foreach package depends on two others: iterators and codetools, which will be automatically installed for you by install.packages when you install foreach. (We'll duscuss the use of "Suggests" — as here with randomForest — later.) Now let's look at a more complex example: the caret package.
The caret package provides an interface to many of the predictive modeling packages on CRAN, and so it has several dependencies (nine, in fact — you can see the list by clicking on the Dependencies Table tab). But it also Suggests many more packages — these are packages that are not required to run caret, but if you do have them, there are more model types you can use within the caret framework.
Here's a quick overview of the types of dependencies you'll find in the charts and tables on MRAN:
MRAN is updated daily, and so the Dependencies Graph is always uptodate with the latest CRAN packages and their connections. Start exploring at the link below.
MRAN: Explore Packages
The ability to create reproducible research is an important topic for many users of R. So important, that several groups in the R community have tackled this problem. Notably, packrat from RStudio, and gRAN from Genentech (see our previous blog post).
The Reproducible R Toolkit is a new opensource initiative from Revolution Analytics. It takes a simple approach to dealing with R package versions, consisting of an R package checkpoint, and an associated daily CRAN snapshot archive, checkpointserver. Here's one illustration of the problem it solves (with apologies to xkcd):
To achieve reproducibility, we store daily snapshots of all CRAN packages. At midnight UTC each day we refresh the CRAN mirror and then store a snapshot of CRAN as it exists at that very moment. You can access these daily snapshots using the checkpoint package, which installs and consistently use these packages just as they existed at the snapshot date. Daily snapshots exist starting from 20140917.
checkpoint package
The goal of the checkpoint package is to solve the problem of package reproducibility in R. Since packages get updated on CRAN all the time, it can be difficult to recreate an environment where all your packages are consistent with some earlier state. To solve this issue, checkpoint allows you to install packages locally as they existed on a specific date from the corresponding snapshot (stored on the checkpoint server) and it configures your R session to use only these packages. Together, the checkpoint package and the checkpoint server act as a "CRAN time machine", so that anyone using checkpoint can ensure the reproducibility of scripts or projects at any time.
How to use checkpoint
One you have the checkpoint package installed, using the checkpoint() function is as simple as adding the following lines to the top of your script:
Typically, you will use the date you created the script as the argument to checkpoint. The first time you run the script, checkpoint will inspect your script (and other R files in the same project folder) for the packages used, and install the required packages with versions as of the specified date. (The first time you run the script, it will take some time to download and install the packages, but subsequent runs will use the previouslyinstalled package versions.)
The checkpoint package installs the packages in a folder specific to the current project (in a subfolder of
If you want to update the packages you use at a later date, just update the date in the checkpoint() call and checkpoint() will automatically update the locallyinstalled packages.
The checkpoint package is available on CRAN:
Worked example
The Reproducible R Toolkit was created by the Open Source Solutions group at Revolution Analytics. Special thanks go to Scott Chamberlain who helped with early development.
We'd love to know what you think about checkpoint. Leave comments here on the blog, or via the checkpoint GitHub page.
by Andrie deVries
One of the reasons that R is so popular is the CRAN archive of useful packages. However, with more than 5,900 packages on CRAN, many organisations need to maintain a private mirror of CRAN with only a subset of packages that are relevant to them.
The package miniCRAN makes this possible by determining the dependency tree for a given set of packages, then downloading all of the package dependencies. (My previous post showed how to do this.)
There are many reasons for not creating a complete mirror CRAN using rsync:
The ambition of miniCRAN is to eventually satisfy many of these considerations. For example, the github version of miniCRAN already allows you to draw a dependency graph using packages on CRAN as well as github. In due course we plan to extend the package to also download packages from any public repository or private file location, as well as github packages.
You can find miniCRAN on CRAN. To install the package, use
install.packages("miniCRAN") library("miniCRAN")
During September gave a presentation about miniCRAN at the first EARL conference in London:
For more information about miniCRAN, take a look at this Introduction to using miniCRAN. To suggest improvements or other features, please visit my miniCRAN page on GitHub. I hope you find it useful!
Hadley Wickham's dplyr package is a great toolkit for getting data ready for analysis in R. If you haven't yet taken the plunge to using dplyr, Kevin Markham has put together a great handson video tutorial for his Data School blog, which you can see below. The video covers the five main datamanipulation "verbs" that dplyr provides: filter, select, arrange, mutate and summarise/group_by. (It also introduces the glimpse function, a handy alternative to str, that I had overlooked before.)
The video also provides an introduction to the %>% ("then") operator from magrittr, which you'll likely fund useful for many other applications in addition to dplyr. Also, Kevin's video works from an Rmarkdown script to show how dplyr works, and so serves as a minitutorial for Rmarkdown as well. It's well worth 40 minutes of your time. Also, check out Kevin's blog post linked below for links to many other useful dplyr resources.
Data School: Handson dplyr tutorial for faster data manipulation in R (via Peter Aldhous)
by Joseph Rickert
One of the most difficult things about R, a problem that is particularly vexing to beginners, is finding things. This is an unintended consequence of R's spectacular, but mostly uncoordinated, organic growth. The R core team does a superb job of maintaining the stability and growth of the R language itself, but the innovation engine for new functionality is largely in the hands of the global R communty.
Several structures have been put in place to address various apsects of the finding things problem. For example, Task Views represent a monumental effort to collect and classify R packages. The RSeek site is an effective tool for web searches. RBloggers is a good place to go for R applications and CRANberries let's you know what's new. But, how do you find things that you didn't even know you were looking for?For this, the so called "misc packages" can be very helpful. Whereas the majority of R packages are focused on a particular type of analysis or class of models, or special tool, misc packages tend to be collections of functions that facilitate common tasks. (Look below for a partial list).
DescTools is a new entry to the misc package scene that I think could become very popular. The description for the package begins:
DescTools contains a bunch of basic statistic functions and convenience wrappers for efficiently describing data, creating specific plots, doing reports using MS Word, Excel or PowerPoint. The package's intention is to offer a toolbox, which facilitates the (notoriously time consuming) first descriptive tasks in data analysis, consisting of calculating descriptive statistics, drawing graphical summaries and reporting the results. Many of the included functions can be found scattered in other packages and other sources written partly by Titans of R.
So far, of the 380 functions in this collection the Desc function has my attention. This function provides very nice tabular and graphic summaries of the variables in a data frame with output that is specific to the data type. The d.pizza data frame that comes with the package has a nice mix of data types
head(d.pizza) index date week weekday area count rabate price operator driver delivery_min temperature wine_ordered wine_delivered 1 1 20140301 9 6 Camden 5 TRUE 65.655 Rhonda Taylor 20.0 53.0 0 0 2 2 20140301 9 6 Westminster 2 FALSE 26.980 Rhonda Butcher 19.6 56.4 0 0 3 3 20140301 9 6 Westminster 3 FALSE 40.970 Allanah Butcher 17.8 36.5 0 0 4 4 20140301 9 6 Brent 2 FALSE 25.980 Allanah Taylor 37.3 NA 0 0 5 5 20140301 9 6 Brent 5 TRUE 57.555 Rhonda Carter 21.8 50.0 0 0 6 6 20140301 9 6 Camden 1 FALSE 13.990 Allanah Taylor 48.7 27.0 0 0 wrongpizza quality 1 FALSE medium 2 FALSE high 3 FALSE <NA> 4 FALSE <NA> 5 FALSE medium 6 FALSE low
Here is some of the voluminous output from the function. The data frame as a whole is summarized as follows
'data.frame': 1209 obs. of 16 variables: 1 $ index : int 1 2 3 4 5 6 7 8 9 10 ... 2 $ date : Date, format: "20140301" "20140301" "20140301" "20140301" ... 3 $ week : num 9 9 9 9 9 9 9 9 9 9 ... 4 $ weekday : num 6 6 6 6 6 6 6 6 6 6 ... 5 $ area : Factor w/ 3 levels "Brent","Camden",..: 2 3 3 1 1 2 2 1 3 1 ... 6 $ count : int 5 2 3 2 5 1 4 NA 3 6 ... 7 $ rabate : logi TRUE FALSE FALSE FALSE TRUE FALSE ... 8 $ price : num 65.7 27 41 26 57.6 ... 9 $ operator : Factor w/ 3 levels "Allanah","Maria",..: 3 3 1 1 3 1 3 1 1 3 ... 10 $ driver : Factor w/ 7 levels "Butcher","Carpenter",..: 7 1 1 7 3 7 7 7 7 3 ... 11 $ delivery_min : num 20 19.6 17.8 37.3 21.8 48.7 49.3 25.6 26.4 24.3 ... 12 $ temperature : num 53 56.4 36.5 NA 50 27 33.9 54.8 48 54.4 ... 13 $ wine_ordered : int 0 0 0 0 0 0 1 NA 0 1 ... 14 $ wine_delivered: int 0 0 0 0 0 0 1 NA 0 1 ... 15 $ wrongpizza : logi FALSE FALSE FALSE FALSE FALSE FALSE ... 16 $ quality : Ord.factor w/ 3 levels "low"<"medium"<..: 2 3 NA NA 2 1 1 3 3 2 ...
The factor variable driver gets a table and a plot.
10  driver (factor) length n NAs levels unique dupes 1'209 1'204 5 7 7 y level freq perc cumfreq cumperc 1 Carpenter 272 .226 272 .226 2 Carter 234 .194 506 .420 3 Taylor 204 .169 710 .590 4 Hunter 156 .130 866 .719 5 Miller 125 .104 991 .823 6 Farmer 117 .097 1108 .920 7 Butcher 96 .080 1204 1.000
and so does the numeric variable delivery.
11  delivery_min (numeric) length n NAs unique 0s mean meanSE 1'209 1'209 0 384 0 25.653 0.312 .05 .10 .25 median .75 .90 .95 10.400 11.600 17.400 24.400 32.500 40.420 45.200 rng sd vcoef mad IQR skew kurt 56.800 10.843 0.423 11.268 15.100 0.611 0.095 lowest : 8.8 (3), 8.9, 9 (3), 9.1 (5), 9.2 (3) highest: 61.9, 62.7, 62.9, 63.2, 65.6 ShapiroWilks normality test p.value : 2.2725e16
Pretty nice for an automatic first look at the data.
For some more R treasure hunting have a look into the following short list of misc packages.
Package 
Description 
Tools for manipulating data (No 1 package downloaded for 2013) 

Convenience wrappers for functions for manipulating strings 

One of the most popular R packages of all time: functions for data analysis, graphics, utilities and much more 

Package development tools 

The “go to” package for machine learning, classification and regression training 

Good svm implementation and other machine learning algorithms 

Tools for describing data and descriptive statistics 

Tools for plotting decision trees 

Functions for numerical analysis, linear algebra, optimization, differential equations and some special functions 

Contains different highlevel graphics functions for displaying large datasets 

Relatively new package with various functions for survival data extending the methods available in the survival package. 

New this year: miscellaneous R tools to simplify the working with data types and formats including functions for working with data frames and character strings 

Some functions for Kalman filters 

Misc 3d plots including isosurfaces 

New package with utilities for producing maps 

Various programming tools like ASCIIfy() to convert characters to ASCII and checkRVersion() to see if a newer version of R is available 

A grab bag of utilities including progress bars and function timers 
by Joseph Rickert
While preparing for the DataWeek R Bootcamp that I conducted this week I came across the following gem. This code, based directly on a Max Kuhn presentation of a couple years back, compares the efficacy of two machine learning models on a training data set.
# # SET UP THE PARAMETER SPACE SEARCH GRID ctrl < trainControl(method="repeatedcv", # use repeated 10fold cross validation repeats=5, # do 5 repititions of 10fold cv summaryFunction=twoClassSummary, # Use AUC to pick the best model classProbs=TRUE) # Note that the default search grid selects 3 values of each tuning parameter # grid < expand.grid(.interaction.depth = seq(1,7,by=2), # look at tree depths from 1 to 7 .n.trees=seq(10,100,by=5), # let iterations go from 10 to 100 .shrinkage=c(0.01,0.1)) # Try 2 values of the learning rate parameter # BOOSTED TREE MODEL set.seed(1) names(trainData) trainX <trainData[,4:61] registerDoParallel(4) # Registrer a parallel backend for train getDoParWorkers() system.time(gbm.tune < train(x=trainX,y=trainData$Class, method = "gbm", metric = "ROC", trControl = ctrl, tuneGrid=grid, verbose=FALSE)) # # SUPPORT VECTOR MACHINE MODEL # set.seed(1) registerDoParallel(4,cores=4) getDoParWorkers() system.time( svm.tune < train(x=trainX, y= trainData$Class, method = "svmRadial", tuneLength = 9, # 9 values of the cost function preProc = c("center","scale"), metric="ROC", trControl=ctrl) # same as for gbm above ) # # COMPARE MODELS USING RESAPMLING # Having set the seed to 1 before running gbm.tune and svm.tune we have generated paired samplesfor comparing models using resampling. # # The resamples function in caret collates the resampling results from the two models rValues < resamples(list(svm=svm.tune,gbm=gbm.tune)) rValues$values # # BOXPLOTS COMPARING RESULTS bwplot(rValues,metric="ROC") # boxplot
After setting up a grid to search the parameter space of a model, the train() function from the caret package is used used to train a generalized boosted regression model (gbm) and a support vector machine (svm). Setting the seed produces paired samples and enables the two models to be compared using the resampling technique described in Hothorn at al, "The design and analysis of benchmark experiments", Journal of Computational and Graphical Statistics (2005) vol 14 (3) pp 675699
The performance metric for the comparison is the ROC curve. From examing the boxplots of the sampling distributions for the two models it is apparent that, in this case, the gbm has the advantage.
Also, notice that the call to registerDoParallel() permits parallel execution of the training algorithms. (The taksbar showed all foru cores of my laptop maxed out at 100% utilization.)
I chose this example because I wanted to show programmers coming to R for the first time that the power and productivity of R comes not only from the large number of machine learning models implemented, but also from the tremendous amount of infrastructure that package authors have built up, making it relatively easy to do fairly sophisticated tasks in just a few lines of code.
All of the code for this example along with the rest of the my code from the Datweek R Bootcamp is available on GitHub.
Google has just released a new package for R: CausalImpact. Amongst many other things, this package allows Google to resolve the classical conundrum: how can we asses the impact of an intervention (for example, the effect of an advertising campaign on website clicks) when we can't know what would have happened if we hadn't run the campaign? For a marketer, the worry is that the spike in clicks was partially or wholly the result of something unrelated (say, a general increase in web traffic) rather than your campaign.
The CausalImpact package uses Bayesian structural timeseries models to resolve this question. All you need is a second time series to act a a "virtual" control, which is unaffected by your actions but which is still subject to the extraneous effects you're worried about. (For the marketing example, you might choose web clicks from a region where the campaign didn't run.) Then, you can model the extraneous effects and subtract them from your actual results, so see how your things would have played out had the intervention not occurred.
In the chart below (from the Google Open Source blog post) you can see the results of the campaign in black, with the campaign launch at the dotted line. The blue line shows the estimated results had the campaign not run, clearly showing that it was effective.
Google uses R and the CausalImpact package to measure the returnoninvestment on advertising campaigns its customers run:
We've been testing and applying structural timeseries models for some time at Google. For example, we've used them to better understand the effectiveness of advertising campaigns and work out their return on investment. We've also applied the models to settings where a randomised experiment was available, to check how similar our effect estimates would have been without an experimental control.
Now that Google has made the CausalImpact package available, you can do the same for any kind of intervention, as long as you have a time series of results before and after the intervation, and a "control" time series where the intervention had no effect. Read more about the package and the methodology behind it at the link below.
Google Open Source Blog: CausalImpact: A new opensource package for estimating causal effects in time series
If you're looking for just the right package to solve your R problem, you could always browse through the list of available packages on CRAN. But with almost 6000 entries, that's not going to be the most efficient process. And even then, many very useful packages aren't found on CRAN: there are more than 800 packages hosted on BioConductor and more than 200 commonlyused packages on GitHub (with more than 3 GitHub stars).
Rdocumentation.org, a handy website from the team at DataCamp, makes it much easier to fund just the function you're looking for, whether it's found in a package in CRAN, BioConductor or GitHub. Just search for a keyword, and it returns all of the matching packages, plus the full help packages for each of the matching functions. Or, you can just browse the most popular packages, based on their download statistics from the RStudio CRAN mirror.
(Also check out this poster for an analysis of the topranked packages from July 2014.)
GitHub and bioconductor: New! Search GitHub and BioConductor packages on RDocumentation
by Joseph Rickert
I think we can be sure that when American botanist Edgar Anderson meticulously collected data on three species of iris in the early 1930s he had no idea that these data would produce a computational storm that would persist well into the 21st century. The calculations started, presumably by hand, when R. A. Fisher selected this data set to illustrate the techniques described in his 1936 paper on discriminant analysis. However, they really got going in the early 1970s when the pattern recognition and machine learn community began using it to test new algorithms or illustrate fundamental principles. (The earliest reference I could find was: Gates, G.W. "The Reduced Nearest Neighbor Rule". IEEE Transactions on Information Theory, May 1972, 431433.) Since then, the data set (or one of its variations) has been used to test hundreds, if not thousands, of machine learning algorithms. The UCI Machine Learning Repository, which contains what is probably the “official” iris data set, lists over 200 papers referencing the iris data.
So why has the iris data set become so popular? Like most success stories, randomness undoubtedly plays a huge part. However, Fisher’s selecting it to illustrate a discrimination algorithm brought it to peoples attention, and the fact that the data set contains three classes, only one of which is linearly separable from the other two, makes it interesting.
For similar reasons, the airlines data set used in the 2009 ASA Sections on Statistical Computing and Statistical Graphics Data expo has gained a prominent place in the machine learning world and is well on its way to becoming the “iris data set for big data”. It shows up in all kinds of places. (In addition to this blog, it made its way into the RHIPE documentation and figures in several college course modeling efforts.)
Some key features of the airlines data set are:
An additional, really nice feature of the airlines data set is that it keeps getting bigger! RITA, The Research and Innovative Technology Administration Bureau of Transportation Statistics continues to collect data which can be downloaded in .csv files. For your convenience, we have a 143M+ record version of the data set Revolution Analytics test data site which contains all of the RITA records from 1987 through the end of 2012 available for download.
The following analysis from Revolution Analytics’ Sue Ranney uses this large version of the airlines data set and illustrates how a good model, driven with enough data, can reveal surprising features of a data set.
# Fit a Tweedie GLM tm < system.time( glmOut < rxGlm(ArrDelayMinutes~Origin:Dest + UniqueCarrier + F(Year) + DayOfWeek:F(CRSDepTime) , data = airData, family = rxTweedie(var.power =1.15), cube = TRUE, blocksPerRead = 30) ) tm # Build a dataframe for three airlines: Delta (DL), Alaska (AK), HA airVarInfo < rxGetVarInfo(airData) predData < data.frame( UniqueCarrier = factor(rep(c( "DL", "AS", "HA"), times = 168), levels = airVarInfo$UniqueCarrier$levels), Year = as.integer(rep(2012, times = 504)), DayOfWeek = factor(rep(c("Mon", "Tues", "Wed", "Thur", "Fri", "Sat", "Sun"), times = 72), levels = airVarInfo$DayOfWeek$levels), CRSDepTime = rep(0:23, each = 21), Origin = factor(rep("SEA", times = 504), levels = airVarInfo$Origin$levels), Dest = factor(rep("HNL", times = 504), levels = airVarInfo$Dest$levels) ) # Use the model to predict the arrival day for the three airlines and plot predDataOut < rxPredict(glmOut, data = predData, outData = predData, type = "response") rxLinePlot(ArrDelayMinutes_Pred~CRSDepTimeUniqueCarrier, groups = DayOfWeek, data = predDataOut, layout = c(3,1), title = "Expected Delay: Seattle to Honolulu by Departure Time, Day of Week, and Airline", xTitle = "Scheduled Departure Time", yTitle = "Expected Delay")
Here, rxGLM()fits a Tweedie Generalized Linear model that looks at arrival delay as a function of the interaction between origin and destination airports, carriers, year, and the interaction between days of the week and scheduled departure time. This function kicks off a considerable amount of number crunching as Origin is a factor variable with 373 levels and Dest, also a factor, has 377 levels. The F() function makes Year and CrsDepTIme factors "on the fly" as the model is being fit. The resulting model ends up with 140,852 coefficients 8,626 of which are not NA. The calculation takes 12.6 minutes to run on a 5 node (4 cores and 16GB of RAM per node) IBM Platform LFS cluster.
The rest of the code uses the model to predict arrival delay for three airlines and plots the fitted values by day of the week and departure time.
It looks like Saturday is the best time to fly for these airlines. Note that none of the sturcture revealed in these curves was put into the model in the sense that there are no polynomial terms in the model.
It will take a few minutes to download the zip file with the 143M airlines records, but please do, and let us know how your modeling efforts go.