by Boxuan Cui, Data Scientist at Smarter Travel
Once upon a time, there was a joke:
In Data Science, 80% of time spent prepare data, 20% of time spent complain about need for prepare data.
— Big Data Borat (@BigDataBorat) February 27, 2013
According to a Forbes article, cleaning and organizing data is the most time-consuming and least enjoyable data science task. Of all the resources out there, DataExplorer is one of them, with its sole mission to minimize the 80%, and make it enjoyable. As a result, one fundamental design principle is to be extremely user-friendly. Most of the time, one function call is all you need.
Data manipulation is powered by data.table, so tasks involving big datasets usually complete in a few seconds. In addition, the package is flexible enough with input data classes, so you should be able to throw in any data.frame
-like objects. However, certain functions require a data.table
class object as input due to the update-by-reference feature, which I will cover in later part of the post.
Now enough said and let's look at some code, shall we?
Take the BostonHousing
dataset from the mlbench
library:
library(mlbench)
data("BostonHousing", package = "mlbench")
Initial Visualization
Without knowing anything about the data, my first 3 tasks are almost always:
library(DataExplorer)
plot_missing(BostonHousing) ## Are there missing values, and what is the missing data profile?
plot_bar(BostonHousing) ## How does the categorical frequency for each discrete variable look like?
plot_histogram(BostonHousing) ## What is the distribution of each continuous variable?
While there are not many interesting insights from plot_missing
and plot_bar
, below is the output from plot_histogram
.
Upon scrutiny, the variable rad looks like discrete, and I want to group crim, zn, indus and b into bins as well. Let's do so:
## Set `rad` to factor
BostonHousing$rad <- as.factor(BostonHousing$rad)
## Create new discrete variables
for (col in c("crim", "zn", "indus", "b"))
BostonHousing[[paste0(col, "_d")]] <- as.factor(ggplot2::cut_interval(BostonHousing[[col]], 2))
## Plot bar chart for all discrete variables
plot_bar(BostonHousing)
At this point, we have much better understanding of the data distribution. Now assume we are interested in medv (median value of owner-occupied homes in USD 1000's), and would like to build a model to predict it. Let's plot it against all other variables:
plot_boxplot(BostonHousing, by = "medv")
plot_scatterplot(
subset(BostonHousing, select = -c(crim, zn, indus, b)),
by = "medv", size = 0.5)
plot_correlation(BostonHousing)
And this is how you slice & dice your data, and analyze correlation with merely 3 lines of code.
Feature Engineering
Feature engineering is a crucial step in building better models. DataExplorer provides a couple of functions to ease the process. All of them require a data.table
as the input object, because it is lightning fast. However, if you don't feel like coding in data.table
syntax, you may adopt the following process:
## Set your data to `data.table` first
your_data <- data.table(your_data)
## Apply DataExplorer functions
group_category(your_data, ...)
drop_columns(your_data, ...)
set_missing(your_data, ...)
## Set data back to the original object
class(your_data) <- "original_object_name"
Let's return to the BostonHousing
dataset. For the rest of this section, we'll assume the data has been converted to a data.table
already.
library(data.table)
BostonHousingDT <- data.table(BostonHousing)
Remember those transformed continuous variables? Let's drop them:
drop_columns(BostonHousingDT, c("crim", "zn", "indus", "b"))
Note: Because data.table
updates by reference, the original object is updated without the need to re-assign a returned object.
Let's take a look at the discrete variable rad:
plot_bar(BostonHousingDT$rad)
I think categories other than 4, 5 and 24 are too sparse, and might skew my model fit. How could I group all the sparse categories together?
group_category(BostonHousingDT, "rad", 0.25, update = FALSE)
# rad cnt pct cum_pct
# 1: 24 132 0.2608696 0.2608696
# 2: 5 115 0.2272727 0.4881423
# 3: 4 110 0.2173913 0.7055336
Looks like grouping by bottom 25% of rad would give me what I need. Let's do so:
group_category(BostonHousingDT, "rad", 0.25, update = TRUE)
plot_bar(BostonHousingDT$rad)
In addition to categorical frequency, you may also play with the measure
argument to group by the sum of a different variable. See ?group_category
for more example use cases.
Data Report
To generate a report of your data:
create_report(BostonHousing)
Currently, there is not much to do with this, but it is my plan to support customization of the generated report, so stay tuned for more features!
I hope you enjoyed exploring the Boston housing data with me, and finally here are some additional resources about the DataExplorer package:
The vignette aircraft examples are a bit misleading, as data needs a bit more cleanup, I think. Airbus is in the list with two different strings, McDonnell Douglas with at least three, and Canada with two. If those were first lumped together into one each, before lumping the long tail together into an "other" bin, this could make a big difference in further modeling, as Airbus would jump to largest group by far, not the third, with about half of the Airbus data being lumped into "other". #oops
Posted by: Jarno Peschier | February 08, 2018 at 20:08
Great stuff! I really like the report. Is it possible to add the dataset name and the boxplots?
Posted by: btadams | February 09, 2018 at 12:05
@JarnoPeschier, thanks for identifying that! I totally overlooked it. I have created issue #55 for tracking, and update it with the next release!
Posted by: Boxuan Cui | February 09, 2018 at 12:47
@btadams Yes, I plan to improve the reporting functionality with next release. See issue #41. Thanks for using DataExplorer!
Posted by: Boxuan Cui | February 09, 2018 at 12:49
I like the package, but why the inconsistent ggplot theming: Defaults for boxplots, but odd semi-transparent bars with not really prett black outlines for the barplots and histograms? Sticking to ggplot standards would have been nicer imho.
plot_str just gives:
Error: C stack usage 7970280 is too close to the limit
Posted by: Holger Brandl | February 13, 2018 at 04:44
Do you know of a similar tool for python?
Posted by: Hoang | February 21, 2018 at 17:20