*by Błażej Moska, computer science student and data science intern *

Got stuck with too large a dataset? R speed drives you mad? Divide, parallelize and go with Rcpp!

One of the frustrating moments while working with data is when you need results urgently, but your dataset is large enough to make it impossible. This happens often when we need to use algorithm with high computational complexity. I will demonstrate it on the example I've been working with.

Suppose we have large dataset consisting of association rules. For some reasons we want to slim it down. Whenever two rules consequents are the same and one rule's antecedent is a subset of second rule's antecedent, we want to choose the smaller one (probability of obtaining smaller set is bigger than probability of obtaining bigger set). This is illustrated below:

**{A,B,C}=>{D}**

{E}=>{F}

**{A,B}=>{D}**

{A}=>D

How can we achieve that? For example, using below pseudo algorithm:

```
For i=1 to n:
For j=i+1 to n:
# check if antecedent[i] contains antecedent[j]
(if consequents[i]=consequents[j]), then flag antecedent[i] with 1,
otherwise with 0
else: # check if antecedent[j] contains antecedent[i]
(if consequents[i]=consequents[j]), then flag antecedent[j] with 1,
otherwise with 0
```

How many operations do we need to perform with this simple algorithm?

For the first `i`

we need to iterate \(n-1\) times, for the second `i`

\(n-2\) times, for the third `i`

\(n-3\) and so on, reaching finally \(n-(n-1)\). This leads to (proof can be found here):

\[ \sum_{i=1}^{n}{i}= \frac{n(n-1)}{2} \]

So the above has asymptotic complexity of \(O(n^2)\). It means, more or less, that the computational complexity grows with the square of the size of the data. Well, for the dataset containing around 1,300,000 records this becomes serious issue. With R I was unable to perform computation in reasonable time. Since a compiled language performs better with simple arithmetic operations, the second idea was to use Rcpp. Yes, it is faster, to some extent — but with such a large dataframe I was still unable to get results in satisfying time. So are there any other options?

Yes, there are. If we take a look at our dataset, we can see that it can be aggregated in such way that each individual "chunk" will consist of records with exactly same consequents:

*{A,B}=>{D}*

*{A}=>{D}*

**{C,G}=>{F}**

**{Y}=>{F}**

After such division I got 3300 chunks, so the average number of observations per chunk was around 400. Next step was to retry sequentially for each chunk. Since our algorithm has square complexity, it is faster to do it that way rather than on the whole dataset at once. While R failed again, Rcpp finally returned result (after 5 minutes). But still there is a room for improvement. Since our chunks can be calculated independently, there is a possibility to perform parallel computation using for example, foreach package (which I demonstrated in previous article). While passing R functions to foreach is a simple task, parallelizing Rcpp is a little bit more time consuming. We need to do below steps:

- Create
`.cpp`

file, which includes all of functions needed - Create a package using Rcpp. This can be achieved using for example:

`Rcpp.package.skeleton("nameOfYourPackage",cpp_files = "directory_of_your_cpp_file")`

- Install your Rcpp package from source:

`install.packages("directory_of_your_rcpp_package", repos=NULL, type="source")`

- Load your library:

`library(name_of_your_rcpp_package)`

Now you can use your Rcpp function in foreach:

```
results=foreach(k=1:length(len),
.packages=c(name_of_your_package)) %dopar%
{your_cpp_function(data)}
```

Even with foreach I waited forever for the R results, but Rcpp gave them in approximately 2.5 minutes. Not too bad!

- Here is the Rcpp code for the
`issub`

function - Here is the R code that partitions the data and calls the
`issub`

function in parallel

Here are some conclusions. Firstly, it's worth knowing more languages/tools than just R. Secondly, there is often escape from the large dataset trap. There is little chance that somebody will do exactly the same task as mentioned in above example, but much higher probability that someone will face similar problem, with a possibility to solve it in the same way.

Looks like the link to R code was uploaded two times, here is the github gist link for the Rcpp code: https://gist.github.com/bmoska/965c79e641ba609762095d779328c56f

Regards,

Błażej

Posted by: Błażej | January 04, 2018 at 11:26

Thanks for this post.

The gist link is still the R code. Can you provide the Rcpp code with some data to reproduce the timings here please?

PS: the sum goes to (n-1).

Posted by: Privefl | January 05, 2018 at 01:44

I've updated the link to the Rcpp code above.

Posted by: David Smith | January 08, 2018 at 07:52