by Joseph Rickert
The installr package has some really nice functions for working with the daily package download logs for the RStudio CRAN mirror which RStudio graciously makes available at http://cran-logs.rstudio.com/. The following code uses the download_RStudio_CRAN_data() function to download a month's worth of .gz compressed daily log files into the test3 directory and then uses the function read_RStudio_CRAN_data()to read all of these files into a data frame. (The portion of the status output provided shows the files being read in one at a time.). Next, the function most_downloaded_packages() calculates that the top six downloads for the month were: Rcpp, stringr, ggplot2, stringi, magrittr and plyr.
# CODE TO DOWNLOAD LOG RILES FROM RSTUDIO CRAN MIRROR # FIND MOST DOWNLOADED PACKAGE AND PLOT DOWNLOADS # FOR SELECTED PACKAGES # ----------------------------------------------------------------- library(installr) library(ggplot2) library(data.table) #for downloading # ---------------------------------------------------------------- # Read data from RStudio site RStudio_CRAN_dir <- download_RStudio_CRAN_data(START = '2015-05-15',END = '2015-06-15', log_folder="C:/DATA/test3") # read .gz compressed files form local directory RStudio_CRAN_data <- read_RStudio_CRAN_data(RStudio_CRAN_dir) #> RStudio_CRAN_data <- read_RStudio_CRAN_data(RStudio_CRAN_dir) #Reading C:/DATA/test3/2015-05-15.csv.gz ... #Reading C:/DATA/test3/2015-05-16.csv.gz ... #Reading C:/DATA/test3/2015-05-17.csv.gz ... #Reading C:/DATA/test3/2015-05-18.csv.gz ... #Reading C:/DATA/test3/2015-05-19.csv.gz ... #Reading C:/DATA/test3/2015-05-20.csv.gz ... #Reading C:/DATA/test3/2015-05-21.csv.gz ... #Reading C:/DATA/test3/2015-05-22.csv.gz ... dim(RStudio_CRAN_data) # [1] 8055660 10 # Find the most downloaded packages pkg_list <- most_downloaded_packages(RStudio_CRAN_data) pkg_list #Rcpp stringr ggplot2 stringi magrittr plyr #125529 115282 103921 103727 102083 97183 lineplot_package_downloads(names(pkg_list),RStudio_CRAN_data) # Look at plots for some packages barplot_package_users_per_day("checkpoint",RStudio_CRAN_data) #$total_installations #[1] 359 barplot_package_users_per_day("Rcpp", RStudio_CRAN_data) #$total_installations #[1] 23832
The function lineplot_package_downloads() produces a multiple time series plot for the top five packages:
and the barplot_package_users_per_day() function provides download plots. Here we contrast downloads for the Revolution Analytics' checkpoint package and Rcpp.
Downloads for the checkpoint package look pretty uniform over the month. checkpoint is a relatively new, specialized package for dealing with reproducibility issues. The download pattern probably represents users discovering it. Rccp, on the other hand, is essential to an incredible number of other R packages. The right skewed plot most likely represents the tail end of the download cycle that started after Rcpp was upgraded on 5/1/15.
All of this works well for small amounts of data. However, the fact that read_RStudio_CRAN_data() puts everything in a data frame presents a bit of a problem for working with longer time periods with the 6GB of RAM on my laptop. So, after downloading the files representing the period (5/28/14 to 5/28/15) to my laptop,
# Convert .gz compresed files to .csv files in_names <- list.files("C:/DATA/RStudio_logs_1yr_gz", pattern="*.csv.gz", full.names=TRUE) out_names <- sapply(strsplit(in_names,".g",fixed = TRUE),"[[",1) length(in_names) for(i in 1:length(in_names)){ df <- read.csv(in_names[i]) write.csv(df, out_names[i],row.names=FALSE) }
I used the external memory algorithms in Revolution R Enterprise to work with the data on disk. First, rxImport() brings all of the .csv files into a single .xdf file and stores it on my laptop. (Note that the rxSummary() function indicates that the file has over 90 million rows.) Then, the super efficient rxCube() function to tabulate the package counts.
# REVOSCALE R CODE TO IMPORT A YEARS WORTH OF DATA data_dir <- "C:/DATA/RStudio_logs_1yr" in_names <- list.files(data_dir, pattern="*.csv.gz", full.names=TRUE) out_names <- sapply(strsplit(in_names,".g",fixed = TRUE),"[[",1) #---------------------------------------------------- # Import to .xdf file # Establish the column classes for the variables colInfo <- list( list(name = "date", type = "character"), list(name = "time", type = "character"), list(name = "size", type = "integer"), list(name = "r_version", type = "factor"), list(name = "r_arch", type = "factor"), list(name = "r_os", type = "factor"), list(name = "package", type = "factor"), list(name = "version", type = "factor"), list(name = "country", type = "factor"), list(name = "1p_1d", type = "integer")) num_files <- length(out_names) out_file <- file.path(data_dir,"RStudio_logs_1yr") append = FALSE for(i in 1:num_files){ rxImport(inData = out_names[i], outFile = out_file, colInfo = colInfo, append = append, overwrite=TRUE) append = TRUE } # Look at a summary of the imported data rxGetInfo(out_file)
#File name: C:\DATA\RStudio_logs_1yr\RStudio_logs_1yr.xdf
#Number of observations: 90200221
#Number of variables: 10 # Long form tablualtion cube1 <- rxCube(~ package,data= out_file) # Computation time: 5.907 seconds. cube1 <- as.data.frame(cube1) sort1 <- rxSort(cube1, decreasing = TRUE, sortByVars = "Counts") #Time to sort data file: 0.078 seconds write.csv(head(sort1,100),"Top_100_Packages.csv")
Here are the download counts for top 100 packages for the period (5/28/14 to 5/28/15).
You can download this data here: Download Top_100_Packages
Comments
You can follow this conversation by subscribing to the comment feed for this post.