*by Błażej Moska, computer science student and data science intern*

Suppose that we have performed clustering K-means clustering in R and are satisfied with our results, but later we realize that it would also be useful to have a membership matrix. Of course it would be easier to repeat clustering using one of the fuzzy kmeans functions available in R (like `fanny`

, for example), but since it is slightly different implementation the results could also be different and for some reasons we don’t want them to be changed. Knowing the equation we can construct this matrix on our own, after using the `kmeans`

function. The equation is defined as follows (source: Wikipedia):

$$w_{ij} = \frac{1}{ \sum_ {k=1}^c ( \frac{ \| x_{i} - c_{j} \| }{ \| x_{i} - c_{k} \| }) ^{ \frac{2}{m-1} } } $$

\(w_{ij}\) denotes to what extent the \(i\)th object belongs to the \(j\)th cluster. So the total number of rows of this matrix equals number of observation and total number of columns equals number of variables included in clustering. \(m\) is a parameter, typically set to \(m=2\). \(w_{ij}\) values range between 0 and 1 so they are easy and convenient to compare. In this example I will use \(m = 2\) so the Euclidean distance will be calculated.

To make computations faster I also used the Rcpp package, then I compared speed of execution of function written in R with this written in C++.

In implementations `for`

loops were used, although it is not a commonly used method in R (see this blog post for more information and alternatives), but in this case I find it more convenient.

### Rcpp (C++ version)

#include <Rcpp.h> #include <math.h> using namespace Rcpp; // [[Rcpp::export]] NumericMatrix fuzzyClustering(NumericMatrix data, NumericMatrix centers, int m) { /* data is a matrix with observations(rows) and variables, centers is a matrix with cluster centers coordinates, m is a parameter of equation, c is a number of clusters */ int c=centers.rows(); int rows = data.rows(); int cols = data.cols(); /*number of columns equals number of variables, the same as is in centers matrix*/ double tempDist=0; /*dist and tempDist are variables storing temporary euclidean distances */ double dist=0; double denominator=0; //denominator of “main” equation NumericMatrix result(rows,c); //declaration of matrix of results for(int i=0;i<rows;i++){ for(int j=0;j<c;j++){ for(int k=0;k<c;k++){ for(int p=0;p<cols;p++){ tempDist = tempDist+pow(centers(j,p)-data(i,p),2); //in innermost loop an euclidean distance is calculated. dist = dist + pow(centers(k,p)-data(i,p),2); /*tempDist is nominator inside the sum operator in the equation, dist is the denominator inside the sum operator in the equation*/ } tempDist = sqrt(tempDist); dist = sqrt(dist); denominator = denominator+pow((tempDist/dist),(2/(m-1))); tempDist = 0; dist = 0; } result(i,j) = 1/denominator; // nominator/denominator in the main equation denominator = 0; } } return result; }

We can save this in a file with .cpp extension. To compile it from R we can write:

sourceCpp("path_to_cpp_file")

If everything goes right our function `fuzzyClustering`

will then be available from R.

### R version

fuzzyClustering=function(data,centers,m){ c <- nrow(centers) rows <- nrow(data) cols <- ncol(data) result <- matrix(0,nrow=rows,ncol=c) #defining membership matrix denominator <- 0 for(i in 1:rows){ for(j in 1:c){ tempDist <- sqrt(sum((centers[j,]-data[i,])^2)) #euclidean distance, nominator inside a sum operator for(k in 1:c){ Dist <- sqrt(sum((centers[k,]-data[i,])^2)) #euclidean distance, denominator inside a sum operator denominator <- denominator +((tempDist/Dist)^(2/(m-1))) #denominator of an equation } result[i,j] <- 1/denominator #inserting value into membership matrix denominator <- 0 } } return(result); }

Result looks as follows. Columns are cluster numbers (in this case 10 clusters were created), rows are our objects (observations). Values were rounded to the third decimal place, so the sums of rows can be slightly different than 1:

1 2 3 4 5 6 7 8 9 10 [1,] 0.063 0.038 0.304 0.116 0.098 0.039 0.025 0.104 0.025 0.188 [2,] 0.109 0.028 0.116 0.221 0.229 0.080 0.035 0.116 0.017 0.051 [3,] 0.067 0.037 0.348 0.173 0.104 0.066 0.031 0.095 0.018 0.062 [4,] 0.016 0.015 0.811 0.049 0.022 0.017 0.009 0.023 0.007 0.031 [5,] 0.063 0.048 0.328 0.169 0.083 0.126 0.041 0.079 0.018 0.045 [6,] 0.069 0.039 0.266 0.226 0.102 0.111 0.037 0.084 0.017 0.048 [7,] 0.045 0.039 0.569 0.083 0.060 0.046 0.025 0.071 0.015 0.046 [8,] 0.070 0.052 0.399 0.091 0.093 0.054 0.034 0.125 0.022 0.062 [9,] 0.095 0.037 0.198 0.192 0.157 0.088 0.038 0.121 0.019 0.055 [10,] 0.072 0.024 0.132 0.375 0.148 0.059 0.025 0.081 0.015 0.067

### Performance comparison

Shown below is the output of `Sys.time`

for the C++ and R versions, running against a simulated matrix with 30000 observations, 3 variables and 10 clusters.

The hardware I used was a low-cost notebook Asus R556L with Intel Core i3-5010 2.1 GHz processor and 8 GB DDR3 1600 MHz RAM memory.

C++ version:

user system elapsed 0.32 0.00 0.33

R version:

user system elapsed 15.75 0.02 15.94

In this example, the function written in C++ executed about 50 times faster than the equivalent function written in pure R.

Very Nice Article. K mean clustering technique is useful in various field of data science. Thank you.

Posted by: Sukesh Kumar | August 25, 2017 at 05:10

I’m really appreciated you for this Epic Article.

K-means stores k centroids that it uses to define clusters..

thank you...

Posted by: statswork | August 29, 2017 at 04:36