AT&T has recently announced it will no longer offer unlimited data plans for new iPhone users in the US, and now some carriers in the UK have followed suit. In each case, the providers claim that only a very small number of users actually use enough data to warrant an unlimited plan, and most users use relatively little and will benefit from the cheaper, capped plans. But what does "most users" mean? Journalist Charles Arthur of the Guardian, in an article about the impact of the data caps, illustrated the distribution of data usage amongst mobile users with this chart, created in R:

Reader Barry Rowlingson thought this chart looked odd (is that really a 97% percentile?), but given that the chart lacks axis labels, it's hard to make sense of anyway. So he set to recreate the chart: given the mean (200Mb) and one quantile (the 97% quantile is 500Mb), he figured out the unstated standard deviation of the Normal distribution (159Mb) and recreated the graph with axis labels:

ss=seq(0,700,len=100)

plot(ss,dnorm(ss,mean=200,sd=159),col="red",type="l")

abline(v=500,col="black")

I added the black vertical line to indicate the 97% percentile at 500Mb. Anyway, it's nice that R is being used to illustrate statistical concepts like this, just a shame that the chart wasn't quite right. Thanks to Barry for providing this alternative.

The Guardian: Why file-sharing has killed 'unlimited' mobile data contracts

Cellphone data usage is almost certainly a fat-tailed distribution, not a normal one. The second graph says absolutely nothing about the first, and shame on this article for promoting the idea that every distribution is normal.

I hate the cellphone companies too, but this is silly,

Bill Mill

Posted by: Bill Mill | June 16, 2010 at 08:46

It would be great to see the data because this must not be normal and must be fat tailed (as Bill stated) because with that SD then the 99% is less than 700, not the 800 in the chart.

Posted by: John Christie | June 16, 2010 at 09:09

Not to mention that the probability of using less than 0 MB of data is 0. With an "average" of 200 MB and a standard deviation of 159 MB, a normal distribution is going to be significantly wrong. My guess is a gamma distribution, but others could make sense.

Also, I'd take the "average" of 200 MB as the mode, not the mean, given the graph.

Posted by: Nick N | June 16, 2010 at 09:20

You can't model these data by a normal distribution. Note that your model implies that a large number of cellphone users use NEGATIVE megabytes. The real usage is bounded below by zero, but is unbounded above. Simple models for this situation include lognormal and gamma.

Posted by: Rick | June 16, 2010 at 13:02

Use "Empirical Cumulative Distribution Function", ecdf() or something instead.

I use about 18G of mobile data monthly with 7 Mbit/s modem. New modems are 21 Mbit/s. Theoretically my plan is capped to 3G/month but the operator doesn't seem to care. This costs 13 euros in a month.

Posted by: N. N. | June 16, 2010 at 14:47

hmmmm....

Its highly likely (ie its a regular mistake) that the top distribution is a fit to data with some super user peaks that are not apparent (should have loessed it).

So the 97 percentile is (probably) in the right place but the curve itself a poor representation.

Posted by: Stephen | June 17, 2010 at 02:03

Agree with the above comments: This is not a normal distribution. The leftmost bound of x axis should be 0 (not negative) and using standard deviation of normal distribution is not correct. Need to use another distribution...

-Ralph Winters

Posted by: Ralph Winters | June 18, 2010 at 12:55