One of the key data analysis tools that the BellKor team used to win the Netflix Prize was the Singular Value Decomposition (SVD) algorithm. As a file on disk, the Neflix Prize data (a matrix of about 480,000 members' ratings for about 18,000 movies) was about 65Gb in size -- too large to be read into the standard in-memory data model of open-source R directly. But in the video below, Brian Lewis shows us how to use the sparse Matrix object in R to efficiently store the data (about 99 million actual movie ratings) and the irlba package (which features a fast and efficient SVD algorithm for big data) to perform SVD analysis on the Netflix data in R.
Big Computing: Bryan Lewis's Vignette on IRLBA for SVD in R
There is also 'svd' package: http://cran.r-project.org/web/packages/svd/index.html providing several state-of-the-art SVD implementations. It probably should be faster than irlba since the latter is pure R.
Posted by: Anton Korobeynikov | May 31, 2011 at 14:09
That's awesome.
Posted by: Frank | June 01, 2011 at 11:54
Any guidance on converting the netflix download training_set directory to cmr.txt?
thanks
Posted by: fionn | June 02, 2011 at 04:23
Getting different sizes and counts [any ideas?], but still impressed.
data: http://stackoverflow.com/questions/1407957/netflix-prize-dataset
Ruby Code from: http://www.syleum.com/2006/10/03/convert-netflix-prize-data-to-csv/
added '|' delimiter to line 11 --> out = CSV::Writer.create(File.open('ratings.txt', 'w'),'|')
8GB RAM on Intel(R) Core(TM) i7-2620 CPU @ 2.70GHz
> x <- scan('C:/.../download/ratings.txt', what=list(integer(),numeric(),NULL,integer(),NULL,NULL), sep='|')[c(1,4,2)]
Read 51073786 records
user system elapsed
302.95 5.12 309.87
> library('Matrix')
Loading required package: lattice
Attaching package: 'Matrix'
The following object(s) are masked from 'package:base':
det
> N=sparseMatrix(i=x[[1]], j=x[[2]], x=x[[3]])
> object.size(N)
612957936 bytes #compare to 1188937848
> nnzero(N)
[1] 51073786
> S <- svd(N, nu=5, nv=5)
Error in asMethod(object) :
Cholmod error 'problem too large' at file:../Core/cholmod_dense.c, line 106
> library('irlba')
> S <- irlba(N, nu=5, nv=5)
user system elapsed
33.13 3.14 36.52
> plot(S$d, main="Five largest singular values")
>
Posted by: Fionn | June 03, 2011 at 19:29
Actually, it is not quite true that SVD was the algorithm that was used by BellKor and their partners to win the Netflix prize.
They used (among other things) several regularized matrix factorization models that are somewhat similar to SVD, but not the same.
Standard SVD will not give you very good predictive accuracy in a rating prediction scenario.
Besides that: Nice video, good to see that you can work with rather large sparse matrices in R.
Posted by: Zeno | June 09, 2011 at 08:00
Cool. I'm trying to get my hands on the dataset and try the different techniques. Anybody has an idea where I can get it?
Posted by: earmijo | June 20, 2011 at 21:58
Dataset is still online in asia (stackoverflow.com/questions/1407957/netflix-prize-dataset)..
Also, the Neflix Prize data was not 65Gb in size.. training set was only ~2GB unzipped.
Posted by: Jay Julian Payne | July 13, 2013 at 07:40