« R for Data Mining | Main | The 'Big Analytics' Revolution Starts with R: Webinar June 14 »

June 07, 2011


Feed You can follow this conversation by subscribing to the comment feed for this post.

It seems like maybe the horizontal clusters are due to the fact that income varies on a much larger scale than age. Without renormalizing, the distance measures might be dominated by income.

Does this seem right? If so, I might think this effect is more an artifact of the scale than a real pattern in the data. Renormalizing so that each variable is on the same scale might be reasonable.

I'm asking without any prior knowledge of k-means or the functions you're using, so I apologize if my ignorance is resulting in a silly question.


Can you explain what the "~ age + incwage" argument represents, and how we can get it to run kmeans on the elements in the vectors?

Also, what format is datafile in, and can we simply use an existing xdf (from a csv using import) instead of including the lines
rxGetInfoXdf(dataFile, getVarInfo=TRUE)
rxDataStepXdf(inFile=dataFile, outFile="AgeInc",
varsToKeep =c("age","incwage"),overwrite=TRUE)


Can we use it to cluster30k point in 49 dimentional space to find 4-5 clusters?
I do not care if it gets long time (in order of 1-2 days),if it gives me some good result.
I use a quad core pc with 16MB RAM.

30kpoint= 30k points= 30000 vector of sie 49! ;)

The comments to this entry are closed.

Search Revolutions Blog

Got comments or suggestions for the blog editor?
Email David Smith.
Follow revodavid on Twitter Follow David on Twitter: @revodavid
Get this blog via email with Blogtrottr