« Presentations from R/Finance 2011 available | Main | The Economist on Big Data's impact on business »

May 31, 2011


Feed You can follow this conversation by subscribing to the comment feed for this post.

There is also 'svd' package: http://cran.r-project.org/web/packages/svd/index.html providing several state-of-the-art SVD implementations. It probably should be faster than irlba since the latter is pure R.

That's awesome.

Any guidance on converting the netflix download training_set directory to cmr.txt?


Getting different sizes and counts [any ideas?], but still impressed.

data: http://stackoverflow.com/questions/1407957/netflix-prize-dataset
Ruby Code from: http://www.syleum.com/2006/10/03/convert-netflix-prize-data-to-csv/
added '|' delimiter to line 11 --> out = CSV::Writer.create(File.open('ratings.txt', 'w'),'|')

8GB RAM on Intel(R) Core(TM) i7-2620 CPU @ 2.70GHz

> x <- scan('C:/.../download/ratings.txt', what=list(integer(),numeric(),NULL,integer(),NULL,NULL), sep='|')[c(1,4,2)]
Read 51073786 records
user system elapsed
302.95 5.12 309.87

> library('Matrix')
Loading required package: lattice

Attaching package: 'Matrix'

The following object(s) are masked from 'package:base':


> N=sparseMatrix(i=x[[1]], j=x[[2]], x=x[[3]])
> object.size(N)
612957936 bytes #compare to 1188937848
> nnzero(N)
[1] 51073786
> S <- svd(N, nu=5, nv=5)
Error in asMethod(object) :
Cholmod error 'problem too large' at file:../Core/cholmod_dense.c, line 106
> library('irlba')
> S <- irlba(N, nu=5, nv=5)
user system elapsed
33.13 3.14 36.52
> plot(S$d, main="Five largest singular values")

Actually, it is not quite true that SVD was the algorithm that was used by BellKor and their partners to win the Netflix prize.

They used (among other things) several regularized matrix factorization models that are somewhat similar to SVD, but not the same.

Standard SVD will not give you very good predictive accuracy in a rating prediction scenario.

Besides that: Nice video, good to see that you can work with rather large sparse matrices in R.

Cool. I'm trying to get my hands on the dataset and try the different techniques. Anybody has an idea where I can get it?

Dataset is still online in asia (stackoverflow.com/questions/1407957/netflix-prize-dataset)..

Also, the Neflix Prize data was not 65Gb in size.. training set was only ~2GB unzipped.

The comments to this entry are closed.

R for the Enterprise

Got comments or suggestions for the blog editor?
Email David Smith.
Follow revodavid on Twitter Follow David on Twitter: @revodavid

Search Revolutions Blog