by Luba Gloukhov
The first time I had an Islay single malt, my mind was blown. In my first foray into the world of whiskies, I took the plunge into the smokiest, peatiest beast of them all — Laphroig. That same night, dreams of owning a smoker were replaced by the desire to roam the landscape of smoky single malts.
As an Islay fan, I wanted to investigate whether distilleries within a given region do in fact share taste characteristics. For this, I used a dataset profiling 86 distilleries based on 12 flavor categories.
The data was obtained from https://www.mathstat.strath.ac.uk/outreach/nessie/nessie_whisky.html.
whiskies <- read.csv("data/whisky/whiskies.txt", row.names = 1, stringsAsFactors = FALSE)
I first went ahead and ensured that the dataset had no missing observations. I generated a subset of the data that included only the 12 flavor variables, rescaled for comparability using scale().
sum(is.na(whiskies)) # no missing observations
## [1] 0
whiskies_k <- scale(whiskies[2:13]) # rescale selected vars for kmeans
K-means clustering assigns each observation membership to one of k clusters in such a way that minimizes the distance between each observation and it's cluster's mean. K-means clustering requires us to specify the number of clusters. Below, we iterate through kmeans() with clusters argument varying from 1 to maxCluster and plot the within groups sum of squares for each iteration.
ssPlot <- function(data, maxCluster = 9) {
# Initialize within sum of squares
SSw <- (nrow(data) - 1) * sum(apply(data, 2, var))
SSw <- vector()
for (i in 2:maxCluster) {
SSw[i] <- sum(kmeans(data, centers = i)$withinss)
}
plot(1:maxCluster, SSw, type = "b", xlab = "Number of Clusters", ylab = "Within groups sum of squares")
}
ssPlot(whiskies_k)
Naturally, the within groups sum of squares decreases as we increase the number of clusters. However, there is a trend of diminishing marginal returns as we increase the number of clusters. I select the number of clusters based on the point at which the marginal return of adding one more cluster is less than was the marginal return for adding the clusters prior to that.
fit <- kmeans(whiskies_k, 4) # 4 cluster solution
# append cluster assignment
whiskies <- data.frame(whiskies, fit$cluster)
whiskies$fit.cluster <- as.factor(whiskies$fit.cluster)
Cluster centers can inform on how taste profiles differ between clusters.
fit$centers
## Body Sweetness Smoky Medicinal Tobacco Honey Spicy Winey
## 1 -0.6255 0.10478 -0.39342 -0.1825 -0.3606 -0.3720 -0.3647 -0.5765
## 2 -0.2541 0.05944 -0.04039 0.1214 2.7407 -0.2862 0.3606 0.2036
## 3 0.5113 0.05944 0.04733 -0.3071 -0.3606 0.7438 0.3220 0.7721
## 4 1.7163 -1.10234 2.46845 2.8149 1.7069 -1.2630 0.3606 -0.5111
## Nutty Malty Fruity Floral
## 1 -0.2395 -0.2673 0.06586 0.29654
## 2 -0.3632 0.5792 -0.38788 0.15866
## 3 0.4297 0.3624 0.13698 -0.07171
## 4 -0.3632 -0.7455 -0.81553 -1.79062
Based on these centers, I anticipate that my love for the full bodied, smoky and medicinal lies in cluster 4.
subset(whiskies, fit.cluster == 4)
## Distillery Body Sweetness Smoky Medicinal Tobacco Honey Spicy Winey
## 4 Ardbeg 4 1 4 4 0 0 2 0
## 22 Caol Ila 3 1 4 2 1 0 2 0
## 24 Clynelish 3 2 3 3 1 0 2 0
## 58 Lagavulin 4 1 4 4 1 0 1 2
## 59 Laphroig 4 2 4 4 1 0 0 1
## 78 Talisker 4 2 3 3 0 1 3 0
## Nutty Malty Fruity Floral Postcode Latitude Longitude fit.cluster
## 4 1 2 1 0 \tPA42 7EB 141560 646220 4
## 22 2 1 1 1 \tPA46 7RL 142920 670040 4
## 24 1 1 2 0 \tKW9 6LB 290250 904230 4
## 58 1 1 1 0 PA42 7DZ 140430 645730 4
## 59 1 1 0 0 PA42 7DU 138680 645160 4
## 78 1 2 2 0 IV47 8SR 137950 831770 4
I identified the most representative whisky of each cluster by seeking out the observation closest to the center based on all 12 variables.
whiskies_r <- whiskies[c(2:13, 17)]
# extract just flavor variables & cluster
candidates <- by(whiskies_r[-13], whiskies_r[13], function(data) {
# we apply this function to observations for each level of fit.cluster
dists <- sapply(data, function(x) (x - mean(x))^2)
# for each variable, calc each observation's deviation from average of the
# variable across observations
dists <- rowSums(dists)
# for each observation, sum the deviations across variables
rownames(data)[dists == min(dists)]
# obtain the row number of the smallest sum
})
candidates <- as.numeric(unlist(candidates))
whiskies[candidates, ]
## Distillery Body Sweetness Smoky Medicinal Tobacco Honey Spicy Winey
## 42 Glenallachie 1 3 1 0 0 1 1 0
## 70 RoyalBrackla 2 3 2 1 1 1 2 1
## 1 Aberfeldy 2 2 2 0 0 2 1 2
## 4 Ardbeg 4 1 4 4 0 0 2 0
## Nutty Malty Fruity Floral Postcode Latitude Longitude fit.cluster
## 42 1 2 2 2 AB38 9LR 326490 841240 1
## 70 0 2 3 2 IV12 5QY 286040 851320 2
## 1 2 2 2 2 \tPH15 2EB 286580 749680 3
## 4 1 2 1 0 \tPA42 7EB 141560 646220 4
The dataset contains coordinates that I used to investigate how flavor profiles differ geographically. The dataset's Latitude and Longitude variables are coordinates defined according to Great Britain's Ordnance Survey National Grid reference system. I converted the coordinates to standard latitude and longitude in order to plot them using ggmap.
library(maptools)
## Loading required package: sp Checking rgeos availability: TRUE
library(rgdal)
## rgdal: version: 0.8-14, (SVN revision 496) Geospatial Data Abstraction
## Library extensions to R successfully loaded Loaded GDAL runtime: GDAL
## 1.10.1, released 2013/08/26 Path to GDAL shared files:
## C:/Revolution/R-Enterprise-6.2/R-2.15.3/library/rgdal/gdal GDAL does not
## use iconv for recoding strings. Loaded PROJ.4 runtime: Rel. 4.8.0, 6 March
## 2012, [PJ_VERSION: 480] Path to PROJ.4 shared files:
## C:/Revolution/R-Enterprise-6.2/R-2.15.3/library/rgdal/proj
whiskies.coord <- data.frame(whiskies$Latitude, whiskies$Longitude)
coordinates(whiskies.coord) = ~whiskies.Latitude + whiskies.Longitude
proj4string(whiskies.coord) = CRS("+init=epsg:27700") # Specify that our coords are in osgb grid coord
whiskies.coord <- spTransform(whiskies.coord, CRS("+init=epsg:4326")) # spTransform to convert osgb grid to lat/lon
whiskies <- cbind(whiskies, whiskies.coord)
Alternatively, the ggmap package ships with a geocode function which uses Google Maps to determine the lat/lon based on a character string specifying the location.
library("ggmap")
## Loading required package: ggplot2
whiskies <- cbind(whiskies, geocode(paste(whiskies$Location, "Scotland", sep = " ,")))
whiskyMap <- qmap(location = "Scotland", zoom = 6, legend = "topleft", maptype = "terrain",
color = "bw", darken = 0.5)
## Map from URL :
## http://maps.googleapis.com/maps/api/staticmap?center=Scotland&zoom=6&size=%20640x640&scale=%202&maptype=terrain&sensor=false
## Google Maps API Terms of Service : http://developers.google.com/maps/terms
## Information from URL :
## http://maps.googleapis.com/maps/api/geocode/json?address=Scotland&sensor=false
## Google Maps API Terms of Service : http://developers.google.com/maps/terms
whiskyMap + geom_point(data = whiskies, aes(x = whiskies.Latitude, y = whiskies.Longitude,
colour = fit.cluster, size = 2))
I zoomed in and examine which Distilleries lie within the Islay region.
whiskyMap <- qmap(location = "Islay", zoom = 10, legend = "topleft", maptype = "terrain",
color = "bw", darken = 0.5)
## Map from URL :
## http://maps.googleapis.com/maps/api/staticmap?center=Islay&zoom=10&size=%20640x640&scale=%202&maptype=terrain&sensor=false
## Google Maps API Terms of Service : http://developers.google.com/maps/terms
## Information from URL :
## http://maps.googleapis.com/maps/api/geocode/json?address=Islay&sensor=false
## Google Maps API Terms of Service : http://developers.google.com/maps/terms
whiskyMap + geom_point() + geom_text(data = whiskies, aes(x = whiskies.Latitude,
y = whiskies.Longitude, label = Distillery, color = fit.cluster, face = "bold"))
## Warning: Removed 78 rows containing missing values (geom_text).
The results indicate that there is a lot of variation in flavor profiles within the different scotch whisky regions. Note that initial cluster centers are chosen at random. In order to replicate the results, you will need to run the following code before your analysis.
set.seed(1)
Further data analysis would be required to determine whether proximity to types of water sources or terrain types drive common flavor profiles. This could be done by obtaining shape files and adding them as an additional layer to the ggmap plot.
For me, I have identified my next to-try single malt. Talisker is still within the familiar realm of cluster 4 but a little more malty, fruity and spicy. Sounds like the perfect holiday mix.
Great post. I took the liberty to build a little app around it (just for a demo). http://www.econinfo.de/2013/12/31/clustering-whiskies-by-taste/
Posted by: Ojessen | December 31, 2013 at 12:35
Very nice analysis and graphics. We had a bottle of 18 year old Talisker at my wedding and it was wonderful; pleasant amount of peat with a lot of complexity underneath. Hope you enjoy :)
Posted by: www.google.com/accounts/o8/id?id=AItOawmS-r_AzCinRaXkJU4Rak-2zBBN6J7XALo | December 31, 2013 at 12:42
Congratulations!
You correctly discovered that Talisker is, in fact, the perfect Whisky.
;-)
All kidding aside, this is awesome work and just the right mixture of nerdiness and tastiness.
Posted by: Fidepus | January 01, 2014 at 03:37
Nice work! Looking forward to playing around with your code examples.
I'm assuming the data set is several years old, since it doesn't have Kilchoman on Islay. They opened in 2005, and while I haven't tried all their expressions, their Machir Bay should fit well in cluster 4. (And I wonder if flavor profiles for any of these 86 scotches have evolved/changed since the data was put together?)
But ah yes, Talisker! It was to me what Laphroaig was to you--the one that first introduced me to the wonders of peat. You're going to love it!
Posted by: Jeff Jetton | January 01, 2014 at 05:46
Playing around with the number of clusters, it's amazing how stable the Ardberg-cluster is, it has the same members with 4 or 9 clusters.
Posted by: Ojessen | January 01, 2014 at 09:40
Yes, Talisker is your next stop. Either that, or you need to explore the landscape in a different way, to bound your tastes (ie go as far away as possible, finding something you dislike, and closing in on the boundaries of your taste - perhaps, to achieve that, you may need to go well beyond Scotch, if it is possible that you like all Scotches or even all whisky/ey).
Posted by: BenK | January 01, 2014 at 10:13
Hi,
great job :-p and it happens at the right time!! :-D
I am a little sad, because I'm not able to read end the command line in almost all cases!
Posted by: ElCep | January 01, 2014 at 11:51
It'd be interesting to have tasting notes for all expressions for one distillery and do the same thing for them - how are they grouped, or does the additional ageing (or dare I say, blending) change their categories?
Posted by: Peter | January 02, 2014 at 00:53
Awesome, I ran into this while drinking a Laphroaig. Funny how all the bottles I have right now are peaty. Laphroaig, Talisker, Lagavulin.
Posted by: Bryan | January 02, 2014 at 03:56
There is at least one other scotch tasting site with hundreds (possibly thousands) of individuals' tasting notes with which I believe a similar analysis could be performed. I know the site's admin personally and even suggested he consider a "genome"-type analysis such as the Music Genome Project a few years ago to characterize the subjective experiences in a good dram. If interested in a introduction to the admin, hit me with an email.
Posted by: Evan | January 02, 2014 at 05:46
This analysis will not tell me anything about the difference between Ledaig and Tobermory. Who knows how it would choke on all the outputs of Bruichladdich over the last decade!
Garbage in, garbage out. First off, Laphroaig is not *just* peaty; it's also got strong notes of iodine, seaweed and such. And that's just the house style. The difference between, say, Laph 10 OB, Laph new make, and a sherry finished independent bottling of Laphroaig at 25 years is wildly different even within that house style. At least you figured out that Islay malts might be a bit smoky, so yay that I guess.
(The main way one figures out peat levels in a bottle of malt is the peating level as measured in ppm phenol. Do that and you'll see a closer correlation between sensed peat and ppm phenols. This is not going to be a perfect correlation, as aged malt tends to whack you less with the peat reek than younger malts.)
Secondly, the data is not granular enough and is of unknown provenance. There is a list of malts with a series of categories rated 0-4. Which bottlings of the distillery? Who did the sensory analysis? Why those categories? I realize that you're just experimenting with a readily available data set, but you'll miss out if you rely on it exclusively. The data sets from the scorings at the Malt Maniacs's website (here) is better as a data set, though it's twelve evaluators only and there's only an aggregate score without location (hint: location is not necessarily causation). Another option would be the word clouds that pop up on this nose's website (here, and make sure to check the SGP scores for more useful details).
If you like Laphroaig, then Islay malts might be up your alley. If you just want to get hit in the head with a brick of dried peat, then try younger peated malts, and don't forget off-island ones like Ledaig or weird ones like Lost Spirits' Leviathan or Balcones Brimstone. Perhaps this analysis can indicate the limitations of the "plug and chug" method before someone tries to do this with, say Abstract Expressionist art....
Posted by: Chap | January 02, 2014 at 05:54
Another lesson implied by this analysis: It's better sometimes to have someone who knows the subject deeply get a quant to crunch data than the other way around. You should see some of the odd number crunching the Maniacs and related friends do...
Posted by: Chap | January 02, 2014 at 05:57
I think you would like to have a look at http://www.amazon.de/gp/aw/d/B0092L4ZXQ/
Posted by: Kristian Köhntopp | January 02, 2014 at 07:18
My friend and I liked Bunnahabhain. It was just the regular one. I am sure the 12 and 25 year olds are better.
Posted by: Shashi | January 02, 2014 at 13:46
Shouldn't we start with the question, "Are there clusters of whiskies by flavor at all?" before asking what characteristics each cluster has and the geographical distributions of them? Nothing you've shown refutes the null hypothesis that each flavor component is independently distributed amongst the whiskies. If that's true, then the whole cluster analysis is a waste of time, no? Something as simple as a multi-dimensional scaling plot of all the whiskies should quickly reveal whether there appears to be evidence of clustering or not.
Posted by: N | January 03, 2014 at 17:45
Evan--I would love to have a look at that site, whether or not the data are used in this manner. I'm currently expanding this (fantastic) idea to use a fuzzy-c means clustering (so each scotch is not locked into only one cluster), and would love to access that dataset.
When I finish that code (if I ever get to it) I'll post a link here to the source for everyone, though it will probably be in Python instead of R. Thanks Luba for a great post/idea (I'll cite you on all future work of course)!
Posted by: Andy K | January 04, 2014 at 10:45
This is very nice. I tried this using spherical k-means, which uses the cosine distance, instead of kmeans. The cosine distance might make more sense if you want to match each of those 12 qualities. Choosing k=4 clusters and your criteria of full bodied, smoky and medicinal, I got the same six whiskies you get plus a few more. This is the full list:
Ardbeg Balblair Bowmore Bruichladdich Caol Ila
Clynelish GlenGarioch GlenScotia HighlandPark Isle of Jura Lagavulin Laphroig Oban OldFettercairn OldPulteney
Springbank Talisker Teaninich Tormore
The mean for this group was Isle of Jura. The grouping does not seem to depend on geo coordinates. I don't drink whisky but someone who does might be able to make sense of this.
Posted by: Meena | January 04, 2014 at 15:41
My favourite for many years was Lagavulin, and I've always loved the Islay malts, so when I visited Scotland in 2007 I was utterly surprised to discover Mortlach, from Dufftown, eclipsed it for me.
So we bought a bottle of Mortlach and brought it home to New Zealand for some theoretical special occasion that was never quite good enough and a few days ago I was looking at it, still unopened after seven years in the cupboard.
In a few months we're moving to the other side of the world with Scotland only a short ferry journey away. Friends were around so I decided to crack it open and it is just absolutely as I remember it, once again blitzing all memory of Bowmores, Laphroig, Talisker, Scapa or even Lagavulin.
Posted by: Andrew McMillan | January 06, 2014 at 02:52
I suspect the smokier "Talisker Storm" would be a better fit for cluster 4, and probably more in line with your other tastes.
Posted by: Antony | January 06, 2014 at 20:25
Nice stats! I have a bottle of coal ila, really good whisky! So just try the 12 year old, good quality for little money
Posted by: Harro | January 19, 2014 at 12:40