by Joseph Rickert
New R packages keep rolling into CRAN at a prodigious rate: 184 in May, 195 in June and July looks like it will continue the trend. I spent some time sorting through them and have picked out a few that that are interesting from a data science point of view.
fakeR generates fake data based on a given data set. Factors are sampled from contingency tables and numerical data is sampled from a multivariate normal distribution. The method works reasonably well for small data sets producing fake data with the same correlation structure as the original data sets. The following example, uses the simulate_dataset() function on a subset of the mtcars data set. The correlation plots look pretty much identical.
df1 <- mtcars[,c(1:4,6)]
df2 <- as.data.frame(simulate_dataset(df1))
Note however, that the simulated data might need some processing before being used in a model. The last row in the df2 simulated data set contains a negative value for displacement. fakeR contains functions for both time dependent and time independent variables. There is a vignette.
heatmaply produces interactive heatmaps. More than just being attractive graphics these increase the already considerable value of the heatmap as an exploratory analysis tool. The following code from the package vignette shows the correlation structure of variables in the mtcars data set.
k_col = 2, k_row = 2,
limits = c(-1,1)) %>%
layout(margin = list(l = 40, b = 40))
mscstexta4r provides an R Client for the Microsoft Cognitive Services Text Analytics REST API, suite of text analytics web services built with Azure Machine Learning that can be used to analyze unstructured text. The vignette explains the explains the text processing capabilities as well as how to get going with it.
mvtboost extends the GBM model to fit boosted decision trees to multivariate continuous variables. The algorithm jointly fits models with multiple outcomes over common set of predictors. According to the authors this makes it possible to: (1) chose the number of trees and shrinkage to jointly minimize prediction error in a test set over all outcomes, (2) compare tree models across outcomes, and (3) estimate the "covariance explained" by predictors in pairs of outcomes. For this last point, the authors explain:
The covariance explained matrix can then be organized in a table where is the number of outcomes, and is the number of predictors. Each element is the covariance explained by a predictor for any pair of outcomes. When the outcomes are standardized to unit variance, each element can be interpreted as the correlation explained in any pair of outcomes by a predictor. Like the of the linear model, this decomposition is unambiguous only if the predictors are independent. Below we show the original covariance explained matrix.
In a classification problem, label noise refers to the incorrect labeling of the training instances. The NoiseFilterR package presents an extensive collection label noise filtering algorithms along with impressive documentation and references for each algorithm. For example, the documentation for the Generalized Edition (GE), a similarity-based filter is organized like this:
The package and its vignette comprise a valuable resource for machine learning.
preprosim describes itself as a "lightweight data quality simulation for classification". It contains functions to add noise, missing values, outliers, irrelevant features and other transformations that are useful in evaluating the classification accuracy. The vignette provides a small example how missing values and noise can affect accuracy.
Finally, there are two packages I recommend installing just for the sake of sanity.
gaussfacts provides random "facts" about Carl Friedrich Gauss. One or two of these and you should feel fine again.
rmsfact invokes random quotes from Richard M. Stallman. To be used when gaussfacts doesn't quite do it for you.