« R for Data Mining | Main | The 'Big Analytics' Revolution Starts with R: Webinar June 14 »

June 07, 2011

TrackBack

TrackBack URL for this entry:
http://www.typepad.com/services/trackback/6a010534b1db25970b015432631694970c

Listed below are links to weblogs that reference K-Means Clustering on Big Data:

Comments

Feed You can follow this conversation by subscribing to the comment feed for this post.

It seems like maybe the horizontal clusters are due to the fact that income varies on a much larger scale than age. Without renormalizing, the distance measures might be dominated by income.

Does this seem right? If so, I might think this effect is more an artifact of the scale than a real pattern in the data. Renormalizing so that each variable is on the same scale might be reasonable.

I'm asking without any prior knowledge of k-means or the functions you're using, so I apologize if my ignorance is resulting in a silly question.

Thanks,
David

Can you explain what the "~ age + incwage" argument represents, and how we can get it to run kmeans on the elements in the vectors?

Also, what format is datafile in, and can we simply use an existing xdf (from a csv using import) instead of including the lines
rxGetInfoXdf(dataFile, getVarInfo=TRUE)
rxDataStepXdf(inFile=dataFile, outFile="AgeInc",
varsToKeep =c("age","incwage"),overwrite=TRUE)
?

Thanks.

Can we use it to cluster30k point in 49 dimentional space to find 4-5 clusters?
I do not care if it gets long time (in order of 1-2 days),if it gives me some good result.
I use a quad core pc with 16MB RAM.
Thanks

30kpoint= 30k points= 30000 vector of sie 49! ;)

The comments to this entry are closed.


R for the Enterprise

Got comments or suggestions for the blog editor?
Email David Smith.
Follow revodavid on Twitter Follow David on Twitter: @revodavid

Search Revolutions Blog