by Peter Hickey (@PeteHaitch)

One of the keys to R's success as a software environment for data analysis is the availability of user-contributed packages. Most useRs will be familiar with (and very grateful for) the Comprehensive R Archive Network (CRAN). The packages available on CRAN, nearly 7000 at last count, cover common data analysis tasks, such as importing data and plotting, through to more specialised tasks, such as packages for parsing data from the web, analysing financial time series data, or analysing data from clinical trials. What may be less familiar to useRs is another large R package repository and software development project, Bioconductor.

Bioconductor is an open source, open development software project that focuses on providing tools for the analysis of high-throughput genomic data, an area of research known variously as bioinformatics or computational biology. Examples of these data are sequencing the DNA of human genomes or measuring the level of expression of genes in hundreds of tumours. Recent advances in technology mean that such data are a central part of modern biological research, be it medical, agricultural, or basic science.

The Bioconductor project began in 2001 and initiated by Robert Gentleman, one of the originators of the R language. Nowadays there is a core team of nine developers, led by Martin Morgan, who develop some of the important core packages and maintain the infrastructure of the project. As with CRAN, it is the user-contributed packages that make the Bioconductor project the valuable resource that it is. There are more than 1000 software packages in the most recent Bioconductor release. In addition to these packages, Bioconductor includes more than 900 annotation packages and 200 experiment data packages. Annotation packages help streamline the oft-tedious bookkeeping and annotation of data associated with bioinformatics research while the experiment data packages contain processed data and are a valuable teaching resource.

Since its establishment, two of the main goals of the Bioconductor project have been reproducible research and high-quality documentation. In support of these aims, Bioconductor releases packages under a biannual schedule, which is tied to the most recent 'release' version of R, and each Bioconductor software package must contain a vignette. Each vignette is a document that provides a task-oriented description of package functionality, more like a book chapter than the technical and often terse function-level documentation accessible via `?`

or `help()`

at the R console. Some of these vignettes, such as the User's Guide that accompanies the `limma`

package (pdf), include multiple case studies and carefully explain the statistical foundations of the methods implemented in the package. There is also a dedicated support forum containing many years worth of questions on common problems with answers from experts in the field.

Bioconductor has recently begun publishing separate "workflows", along with teaching materials used in courses and conferences to help users learn how to analyse high-throughput biological data. These are excellent resources for those wishing to learn more about what is available in Bioconductor and how to get the most from the project. The website also hosts detailed instructions on installing Bioconductor on your local machine or trying it out a preconfigured setup using Amazon Machine Images or Docker images.

The teaching resources have been further bolstered by material from the recent Bioconductor meeting, held in Seattle, USA on July 21-22. This annual meeting is a great mix of basic science and data analysis methodology talks, presentations on interesting Bioconductor packages, and afternoon workshops where you can learn from the developers themselves. All the workshop materials, and most of the slides from the presentations, can be found here. The meeting was preceded by Developer Day, a less formal get-together including talks and brainstorming sessions about the current state and future directions of the Bioconductor project. There is also an annual European Bioconductor meeting and, for the first time, an Asia-Pacific Bioconductor Developer's Meeting and workshop, to be held as part of GIW/InCoB 2015 in Tokyo, Japan on September 8-11.

With its more specialised focus than CRAN, Bioconductor strongly encourages that package developers make use of the excellent infrastructure provided by existing Bioconductor packages. The intention is to reduce the number of times the wheel is re-invented, as well increasing the interoperability of objects and methods from different packages. The source code of these core packages can make useful reading for R developers, particularly those wishing to learn more about the S4 object oriented system. This source code can be accessed using Subversion or via the GitHub mirror of all Bioconductor packages.

Bioconductor has been, and continues to be, an incredibly useful resource for people analysing high-throughput genomic data. The development and maintainence of the project is a considerable undertaking, and there is a great debt owed to those who established the project, brought it into being, and continue its day-to-day running. But just as important is the community of users and developers. It is this community of users and developers that sees such a project succeed and be exciting to be a part of.

by Joseph Rickert

New R packages just keep coming. The following plot, constructed with information from the monthly files on Dirk Eddelbuettel's CRANberries site, shows a plot of the number of new packages released to CRAN between January 1, 2013 and July 27, 2015 by month (not quite 31 months).

This is amazing growth! The mean rate is about 125 new packages a month. How can anyone keep up? The direct approach, of course, would be to become an avid, frequent reader of CRANberries. Every day the CRAN:New link presents the relentless roll call of new arrivals. However, dealing with this extreme level of tediousness is not for everyone.

At MRAN we are attempting to provide some help with the problem of keeping up with what's new through the old fashioned (pre-machine learning) practice of making some idiosyncratic, but not completely capricious, human generated recommendations. With every new release of RRO we publish on the Package Spotlight page brief descriptions of packages in three categories: New Packages, Updated Packages and GitHub packages. None of these lists are intended to be either comprehensive or complete in any sense.

The New Packages list includes new packages that have been released to CRAN since the previous release of RRO. My general rules for selecting packages for this list are: (1) that they should either be tools or infrastructure packages that may prove to be useful to a wide audience or (2) they should involve a new algorithm or statistical technique that I think will be of interest to statisticians and data scientists working in many different areas. The following two packages respectively illustrate these two selection rules:

metricsgraphics V0.8.5: provides an htmlwidgets widgets interface to the MetricsGraphics.js D3 JavaScript library for plotting time series data. The vignette shows what it can do

rotationForest V0.1: provides an implementation of the new Rotation Forest binary ensemble classifier described in the paper by Rodriguez et. al

I also tend to favor packages that are backed by a vignette, paper or url that provides additional explanatory material.

Of course, any scheme like this is limited by the knowledge and biases of the curator. I am particularly worried about missing packages targeted towards biotech applications that may indeed have broader appeal. The way to mitigate the shortcomings of this approach is to involve more people. So if you come across a new package that you think may have broad appeal send us a note and let us know why (open@revolutionanalytics.com).

The Updated Package list is constructed with the single criterion that the fact that the package was updated should convey news of some sort. Most of the very popular and useful packages are updated frequently, some approaching monthly updates. So, even though they are important packages the fact that they have been updated is generally no news at all. It is also the case that package authors generally do not put much effort in to describing the updates. In my experience poking around CRAN I have found that the NEWS directories for packages go mostly unused. (An exemplary exception is the NEWS for ggplot2.)

Finally, the GitHub list is mostly built from repositories that are trending on GitHub with a few serendipitous finds included.

We would be very interested in learning how you keep up with new R packages. Please leave us a comment.

Post Script:

Note that the information from CRANberries about CRAN's new, updated and removed packages is also available as an RSS feed: Download Index.

The code for generating the plot may be found here: Download New_packages

Also, we have written quite a few posts over the last year or so about the difficulties of searching for relevant packages on CRAN. Here are links to three recent posts:

How many packages are there really on CRAN?

Fishing for packages in CRAN

Working with R Studio CRAN Logs

by John Mount

Data Scientist, Win-Vector LLC

R has a number of very good packages for manipulating and aggregating data (plyr, sqldf, RevoScaleR, data.table, and more), but when it comes to accumulating results the beginning R user is often at sea. The R execution model is a bit exotic so many R users are very uncertain which methods of accumulating results are efficient and which are inefficient.

In this latest "R as it is" we will quickly become expert at efficiently accumulating results in R. To read more please click here.

by Joseph Rickert

The XXXV Sunbelt Conference of the International Network for Social Network Analysis (INSNA) was held last month at Brighton beach in the UK. (And I am still bummed out that I was not there.)

A run of 35 conferences is impressive indeed, but the social network analysts have been at it for an even longer time than that:

and today they are still on the cutting edge of the statistical analysis of networks. The conference presentations have not been posted yet, but judging from the conference workshops program there was plenty of R action in Brighton.

Social network analysis at this level involves some serious statistics and mastering a very specialized vocabulary. However, it seems to me that some knowledge of this field will become important to everyone working in data science. Supervised learning models and statistical models that assume independence among the predictors will most likely represent only the first steps that data scientists will take in exploring the complexity of large data sets.

And, maybe of equal importance is that fact that working with network data is great fun. Moreover, software tools exist in R and other languages that make it relatively easy to get started with just a few pointers.

From a statistical inference point of view what you need to know is Exponential Random Graph Models (ERGMs) are at the heart of modern social network analysis. An ERGM is a statistical model that enables one to predict the probability of observing a given network from a specified given class of networks based on both observed structural properties of the network plus covariates associated with the vertices of the network. The exponential part of the name comes from exponential family of functions used to specify the form of these models. ERGMs are analogous to generalized linear models except that ERGMs take into account the dependency structure of ties (edges) between vertices. For a rigorous definition of ERGMs see sections 3 and 4 of the paper by Hunter et al. in the 2008 special issue of the JSS, or Chapter 6 in Kolaczyk and Csárdi's book *Statistical Analysis of Network Data with R*. (I have found this book to be very helpful and highly recommend it. Not only does it provide an accessible introduction to ERGMs it also begins with basic network statistics and the igraph package and then goes on to introduce some more advanced topics such as modeling processes that take place on graphs and network flows.)

In the R world, the place to go to work with ERGMs is the statnet.org. statnet is a suite of 15 or so CRAN packages that provide a complete infrastructure for working with ERGMs. statnet.org is a real gem of a site that contains documentation for all of the statnet packages along with tutorials, presentations from past Sunbelt conferences and more.

I am particularly impressed with the Shiny based GUI for learning how to fit ERGMs. Try it out on the Shiny webpage or in the box below. Click the **Get Started** button. Then select "built-in network" and "ecoli 1" under **File type**. After that, click the right arrow in the upper right corner. You should see a plot of the ecoli graph.

--------------------------------------------------------------------------------------------------------------------------

You will be fitting models in no time. And since the commands used to drive the GUI are similar to specifying the parameters for the functions in the ergm package you will be writing your own R code shortly after that.

by Andrie de Vries

My experience of UseR!2015 drew to an end shortly after I gave a Kaleidoscope presentation discussing "The Network Structure of CRAN".

My talk drew heavily on two previous blog posts, Finding the essential R packages using the pagerank algorithm and Finding clusters of CRAN packages using igraph.

However, in this talk I went further, attempting to create a single visualiziation of all ~6,700 packages on CRAN. To do this, I did all the analysis in R, then exported a GraphML file, and used Gephi to create a network visualization.

My first version of the graph was in a single colour, where each node is a package, and each segment is a dependency on another package. Although this graph indicates dense areas, it reveals little of the deeper structure of the network.

To examine the structure more closely, I did two things:

- Use the page.rank() algorithm to compute package importance, then changed the font size so that more "important" packages have a bigger font
- Used the walktrap.community() algorithm to assign colours to "clusters". This algorithm uses random walks of a short length to find clusters of densely connected nodes

This image (click to enlarge) quite clearly highlights several clusters:

- MASS, in yellow. This is a large cluster of packages that includes lattice and Matrix, together with many others that seem to expose statistical functionality
- Rcpp, in light blue. Rcpp allows any package or script to use C++ code for highly performant code
- ggplot2, in darker blue. This cluster, sometimes called the Hadleyverse, contains packages such as plyr, dplyr and their dependencies, e.g. scales and RColorBrewer.
- sp, in green. This cluster contains a large number of packages that expose spatial statistics features, including spatstat, maps and mapproj

It turns out that Rcpp has a slightly higher page rank than MASS. This made Dirk Eddelbuettel very happy:

You can find my slides at SlideShare and my source code on github.

Finally, my thanks to Gabor Csardi, maintainer of the igraph package, who listened to my ideas and gave helpful hints prior to the presentation.

by Joseph Rickert

The installr package has some really nice functions for working with the daily package download logs for the RStudio CRAN mirror which RStudio graciously makes available at http://cran-logs.rstudio.com/. The following code uses the download_RStudio_CRAN_data() function to download a month's worth of .gz compressed daily log files into the test3 directory and then uses the function read_RStudio_CRAN_data()to read all of these files into a data frame. (The portion of the status output provided shows the files being read in one at a time.). Next, the function most_downloaded_packages() calculates that the top six downloads for the month were: Rcpp, stringr, ggplot2, stringi, magrittr and plyr.

# CODE TO DOWNLOAD LOG RILES FROM RSTUDIO CRAN MIRROR # FIND MOST DOWNLOADED PACKAGE AND PLOT DOWNLOADS # FOR SELECTED PACKAGES # ----------------------------------------------------------------- library(installr) library(ggplot2) library(data.table) #for downloading # ---------------------------------------------------------------- # Read data from RStudio site RStudio_CRAN_dir <- download_RStudio_CRAN_data(START = '2015-05-15',END = '2015-06-15', log_folder="C:/DATA/test3") # read .gz compressed files form local directory RStudio_CRAN_data <- read_RStudio_CRAN_data(RStudio_CRAN_dir) #> RStudio_CRAN_data <- read_RStudio_CRAN_data(RStudio_CRAN_dir) #Reading C:/DATA/test3/2015-05-15.csv.gz ... #Reading C:/DATA/test3/2015-05-16.csv.gz ... #Reading C:/DATA/test3/2015-05-17.csv.gz ... #Reading C:/DATA/test3/2015-05-18.csv.gz ... #Reading C:/DATA/test3/2015-05-19.csv.gz ... #Reading C:/DATA/test3/2015-05-20.csv.gz ... #Reading C:/DATA/test3/2015-05-21.csv.gz ... #Reading C:/DATA/test3/2015-05-22.csv.gz ... dim(RStudio_CRAN_data) # [1] 8055660 10 # Find the most downloaded packages pkg_list <- most_downloaded_packages(RStudio_CRAN_data) pkg_list #Rcpp stringr ggplot2 stringi magrittr plyr #125529 115282 103921 103727 102083 97183 lineplot_package_downloads(names(pkg_list),RStudio_CRAN_data) # Look at plots for some packages barplot_package_users_per_day("checkpoint",RStudio_CRAN_data) #$total_installations #[1] 359 barplot_package_users_per_day("Rcpp", RStudio_CRAN_data) #$total_installations #[1] 23832

The function lineplot_package_downloads() produces a multiple time series plot for the top five packages:

and the barplot_package_users_per_day() function provides download plots. Here we contrast downloads for the Revolution Analytics' checkpoint package and Rcpp.

Downloads for the checkpoint package look pretty uniform over the month. checkpoint is a relatively new, specialized package for dealing with reproducibility issues. The download pattern probably represents users discovering it. Rccp, on the other hand, is essential to an incredible number of other R packages. The right skewed plot most likely represents the tail end of the download cycle that started after Rcpp was upgraded on 5/1/15.

All of this works well for small amounts of data. However, the fact that read_RStudio_CRAN_data() puts everything in a data frame presents a bit of a problem for working with longer time periods with the 6GB of RAM on my laptop. So, after downloading the files representing the period (5/28/14 to 5/28/15) to my laptop,

# Convert .gz compresed files to .csv files in_names <- list.files("C:/DATA/RStudio_logs_1yr_gz", pattern="*.csv.gz", full.names=TRUE) out_names <- sapply(strsplit(in_names,".g",fixed = TRUE),"[[",1) length(in_names) for(i in 1:length(in_names)){ df <- read.csv(in_names[i]) write.csv(df, out_names[i],row.names=FALSE) }

I used the external memory algorithms in Revolution R Enterprise to work with the data on disk. First, rxImport() brings all of the .csv files into a single .xdf file and stores it on my laptop. (Note that the rxSummary() function indicates that the file has over 90 million rows.) Then, the super efficient rxCube() function to tabulate the package counts.

# REVOSCALE R CODE TO IMPORT A YEARS WORTH OF DATA data_dir <- "C:/DATA/RStudio_logs_1yr" in_names <- list.files(data_dir, pattern="*.csv.gz", full.names=TRUE) out_names <- sapply(strsplit(in_names,".g",fixed = TRUE),"[[",1) #---------------------------------------------------- # Import to .xdf file # Establish the column classes for the variables colInfo <- list( list(name = "date", type = "character"), list(name = "time", type = "character"), list(name = "size", type = "integer"), list(name = "r_version", type = "factor"), list(name = "r_arch", type = "factor"), list(name = "r_os", type = "factor"), list(name = "package", type = "factor"), list(name = "version", type = "factor"), list(name = "country", type = "factor"), list(name = "1p_1d", type = "integer")) num_files <- length(out_names) out_file <- file.path(data_dir,"RStudio_logs_1yr") append = FALSE for(i in 1:num_files){ rxImport(inData = out_names[i], outFile = out_file, colInfo = colInfo, append = append, overwrite=TRUE) append = TRUE } # Look at a summary of the imported data rxGetInfo(out_file)

#File name: C:\DATA\RStudio_logs_1yr\RStudio_logs_1yr.xdf

#Number of observations: 90200221

#Number of variables: 10 # Long form tablualtion cube1 <- rxCube(~ package,data= out_file) # Computation time: 5.907 seconds. cube1 <- as.data.frame(cube1) sort1 <- rxSort(cube1, decreasing = TRUE, sortByVars = "Counts") #Time to sort data file: 0.078 seconds write.csv(head(sort1,100),"Top_100_Packages.csv")

Here are the download counts for top 100 packages for the period (5/28/14 to 5/28/15).

You can download this data here: Download Top_100_Packages

by Bill Jacobs, Director Technical Sales, Microsoft Advanced Analytics

In the course of working with our Hadoop users, we are often asked, what's the best way to integrate R with Hadoop?

The answer, in nearly all cases is, It depends.

Alternatives ranging from open source R on workstations, to parallelized commercial products like Revolution R Enterprise and many steps in between present themselves. Between these extremes, lie a range of options with unique abilities scale data, performance, capability and ease of use.

And so, the right choice or choices depends on your data size, budget, skill, patience and governance limitations.

In this post, I’ll summarize the alternatives using pure open source R and some of their advantages. In a subsequent post, I’ll describe the options for achieving even greater scale, speed, stability and ease of development by combining open source and commercial technologies.

These two posts are written to help current R users who are novices at Hadoop understand and select solutions to evaluate.

As with most thing open source, the first consideration is of course monetary. Isn’t it always? The good news is that there are multiple alternatives that are free, and additional capabilities under development in various open source projects.

We see generally 4 options for building R to Hadoop integration using entirely open source stacks.

This baseline approach’s greatest advantage is simplicity and cost. It’s free. End to end free. What else in life is?

Through packages Revolution contributed to open source including rhdfs and rhbase, R users can directly ingest data from both the hdfs file system and the hbase database subsystems in Hadoop. Both connectors are part of the RHadoop package created and maintained by Revolution and are a go-to choice.

Additional options exist as well. The RHive package executes Hive’s HQL SQL-like query language directly from R, and provides functions for retrieving metadata from Hive such as database names, table names, column names, etc.

The rhive package, in particular, has the advantage that its data operations some work to be pushed down into Hadoop, avoiding data movement and parallelizing operations for big speed increases. Similar “push-down” can be achieved with rhbase as well. However, neither are particularly rich environments, and invariably, complex analytical problems will reveal some gaps in capability.

Beyond the somewhat limited push-down capabilities, R’s best at working on modest data sampled from hdfs, hbase or hive, and in this way, current R users can get going with Hadoop quickly.

Once you tire of R’s memory barriers on your laptop the obvious next path is a shared server. With today’s technologies, you can equip a powerful server for only a few thousand dollars, and easily share it between a few users. Using Windows or Linux with 256GB, 512GB of RAM, R can be used to analyze files in to the hundreds of gigabytes, albeit not as fast as perhaps you’d like.

Like option 1, R on a shared server can also leverage push-down capabilities of the rhbase and rhive packages to achieve parallelism and avoid data movement. However, as with workstations, the pushdown capabilities of rhive and rhbase are limited.

And of course, while lots of RAM keeps the dread out of memory exhustion at bay, it does little for compute performance, and depends on sharing skills learned [or perhaps not learned] in kindergarten. For these reasons, consider a shared server to be a great add-on to R on workstations but not a complete substitute.

Replacing the CRAN download of R with the R distribution: Revolution R Open (RRO) enhances performance further. RRO is, like R itself, open source and 100% R and free for the download. It accelerates math computations using the Intel Math Kernel Libraries and is 100% compatible with the algorithms in CRAN and other repositories like BioConductor. No changes are required to R scripts, and the acceleration the MKL libraries offer varies from negligible to an order of magnitude for scripts making intensive use of certain math and linear algebra primitives. You can anticipate that RRO can double your average performance if you’re doing math operations in the language.

As with options 1 and 2, Revolution R Open can be used with connectors like rhdfs, and can connect and push work down into Hadoop through rhbase and rhive.

Once you find that your problem set is too big, or your patience is being taxed on a workstation or server and the limitations of rhbase and rhive push down are impeding progress, you’re ready for running R inside of Hadoop.

The open source RHadoop project that includes rhdfs, rhbase and plyrmr also includes a package rmr2 that enables R users to build Hadoop map and reduce operations using R functions. Using mappers, R functions are applied to all of the data blocks that compose an hdfs file, an hbase table or other data sets, and the results can be sent to a reducer, also an R function, for aggregation or analysis. All work is conducted inside of Hadoop but is built in R.

Let’s be clear. Applying R functions on each hdfs file segment is a great way to accelerate computation. But for most, it is the avoidance of moving data that really accentuates performance. To do this, rmr2 applies R functions to the data residing on Hadoop nodes rather than moving the data to where R resides.

While rmr2 gives essentially unlimited capabilities, as a data scientist or statistician, your thoughts will soon turn to computing entire algorithms in R on large data sets. To use rmr2 in this way complicates development, for the R programmer because he or she must write the entire logic of the desired algorithm or adapt existing CRAN algorithms. She or he must then validate that the algorithm is accurate and reflects the expected mathematical result, and write code for the myriad corner cases such as missing data.

rmr2 requires coding on your part to manage parallelization. This may be trivial for data transformation operations, aggregates, etc., or quite tedious if you’re trying to train predictive models or build classifiers on large data.

While rmr2 can be more tedious than other approaches, it is not untenable, and most R programmers will find rmr2 much easier than resorting to Java-based development of Hadoop mappers and reducers. While somewhat tedious, it is a) fully open source, b) helps to parallelize computation to address larger data sets, c) skips painful data movement, d) is broadly used so you’ll find help available, and e), is free. Not bad.

rmr2 is not the only option in this category – a similar package called rhipe is also and provides similar capabilities. rhipe is described here and here and is downloadable from GitHub.

The range of open source-based options for using R with Hadoop is expanding. The Apache Spark community, for example is rapidly improving R integration via the predictably named SparkR. Today, SparkR provides access to Spark from R much as rmr2 and rhipe do for Hadoop MapReduce do today.

We expect that, in the future, the SparkR team will add support for Spark’s MLLIB machine learning algorithm library, providing execution directly from R. Availability dates haven’t been widely published.

Perhaps the most exciting observation is that R has become “table stakes” for platform vendors. Our partners at Cloudera, Hortonworks, MapR and others, along with database vendors and others, are all keenly aware of the dominance of R among the large and growing data science community, and R’s importance as a means to extract insights and value from the burgeoning data repositories built atop Hadoop.

In a subsequent post, I’ll review the options for creating even greater performance, simplicity, portability and scale available to R users by expanding the scope from open source only solutions to those like Revolution R Enterprise for Hadoop.

by Joseph Rickert

It is incredibly challenging to keep up to date with R packages. As of today (6/16/15), there are 6,789 listed on CRAN. Of course, the CRAN Task Views are probably the best resource for finding what's out there. A tremendous amount of work goes into maintaining and curating these pages and we should all be grateful for the expertise, dedication and efforts of the task view maintainers. But, R continues to grow at a tremendous rate. (Have a look at growth curve in Bob Muenchen's 5/22/15 post R Now Contains 150 Times as Many Commands as SAS). CRANberries, a site that tracks new packages and package updates, indicates that over the last few months the list of R packages has been growing by about 100 packages per month. How can anybody hope to keep current?

So, on any given day, expect that finding out what R packages exist that may pertain to any particular topic will require some work. What follows, is a beginners guide to fishing for packages in CRAN. This example looks for "Bayesian" packages using some simple web page scraping and elementary text mining.

The Bayesian Inference Task View lists 144 packages. This is probably everything that is really important, but let's see what else is to be found that has anything at all to do with Bayesian Inference. In the first block of code, R's available.packages() function fetches the list of packages available from my Windows PC. (This is an extremely interesting function and I don't do justice to it here.) Then, this list is used to scrape the package descriptions from the various package webpages. The loop takes some time to run so I saved the package descriptions both in a csv file and a in a .RData workspace.

library(svTools) library(RCurl) library(tm) #----------------------------------------- # TWO HELPER FUNCTIONS # Funcion to get ackage description from CRAN package page getDesc <- function(package){ l1 <- regexpr("</h2>",package) ind1 <- as.integer(l1[[1]]) + 9 l2 <- regexpr("Version",package) ind2 <- as.integer(l2[[1]]) - (46 + nchar("package")) desc <- substring(package,ind1,ind2) return(desc) } # Function to get CRAN package page getPackage <- function(name){ url <- paste("http://cran.r-project.org/web/packages/",name,"/index.html",sep="") txt <- getURL(url,ssl.verifypeer=FALSE) return(txt) } #-------------------------------------------- # SCRAPE PACKAGE DATA FROM CRAN # Get the list of R packages packages <- as.data.frame(available.packages()) head(packages) dim(packages) pkgNames <- rownames(packages) rm(packages) # Dont need this any more pkgDesc <- vector() for (i in 1:length(pkgNames)){ pkgDesc[i] <- getDesc(getPackage(pkgNames[i])) } length(pkgDesc) #6598 #---------------------------------------------- # SOME HOUSEKEEPING # cranP <- data.frame(pkgNames,pkgDesc) # write.csv(cranP,"C:/DATA/CRAN/CRAN_pkgs_6_15_15") # save.image("pkgs.RData") # load("pkgs.RData")

When I did this a few days ago 6,598 packages were available. The next section of code turns the vector of package descriptions into a document corpus and creates a document term matrix with a row for each package and 20,781worth of terms. Taking the transpose of the term matrix makes it easier to see what is going on. The matrix is extremely sparse (only one 1 shows up) as this small portion of the matrix illustrates and all of the terms are pretty much useless. Removing the sparse terms cuts the matrix down to only 372 terms.

# SOME SIMPLE TEXT MINING # Make a corpus out of package descriptions pCorpus <- VCorpus(VectorSource(pkgDesc)) pCorpus inspect(pCorpus[1:3]) # Function to prepare corpus prepC <- function(corpus){ c <- tm_map(corpus, stripWhitespace) c <- tm_map(c,content_transformer(tolower)) c <- tm_map(c,removeWords,stopwords("english")) c <- tm_map(c,removePunctuation) c <- tm_map(c,removeNumbers) return(c)} pCorpusPrep <- prepC(pCorpus) #------------------------------------------------------------ # Create the document term matrix dtm <- DocumentTermMatrix(pCorpusPrep) dtm # <<DocumentTermMatrix (documents: 6598, terms: 20781)>> # Non-/sparse entries: 142840/136970198 # Sparsity : 100% # Maximal term length: 83 # Weighting : term frequency (tf) # Work with the transpose to list keywords as rows inspect(t(dtm[100:105,90:105])) # Docs # Terms 100 101 102 103 104 105 # accomodated 0 0 0 0 0 0 # accompanied 0 0 0 0 0 0 # accompanies 0 0 0 0 0 0 # accompany 0 0 0 0 0 0 # accompanying 0 0 0 0 0 0 # accomplished 0 0 0 0 0 0 # accomplishes 0 0 0 0 0 0 # accordance 0 0 0 0 0 0 # according 0 0 1 0 0 0 # accordingly 0 0 0 0 0 0 # accordinglyp 0 0 0 0 0 0 # account 0 0 0 0 0 0 # accounted 0 0 0 0 0 0 # accounting 0 0 0 0 0 0 # accountp 0 0 0 0 0 0 # accounts 0 0 0 0 0 0 # Reduce the number of sparse terms dtms <- removeSparseTerms(dtm,0.99) dim(dtms) # 6598 372

I am pretty much counting on some luck here, hoping that "Bayesian" will be one of the remaining 372 terms. This last bit of code finds 229 packages associated with the keyword "Bayesian"

# Find the Bayesian packages dtmsT <- t(dtms) keywords <- row.names(dtmsT) bi <- which(keywords == "bayesian") # Find the index of an interesting keyword bayes <- inspect(dtmsT)[bi,] # Vexing that it prints to console bayes_packages_index <- names(bayes[bayes==1]) # Here are the "Bayesian" packages bayes_packages <- pkgNames[as.numeric(bayes_packages_index)] length(bayes_packages) #229 # Here are the descriptions of the "Bayesian" packages bayes_pkgs_desc <- pkgDesc[bayes==1])

Here is the list of packages found.

Not all of these "fish" are going to be worth keeping, but at least we have reduced the search to something manageable. In 10 or 15 minutes of fishing you might catch something interesting.