With so many more devices and instruments connected to the "Internet of Things" these days, there's a whole lot more time series data available to analyze. But time series are typically quite noisy: how do you distinguish a short-term tick up or down from a true change in the underlying signal? To solve this problem, Twitter created the BreakoutDetection package for R, which decomposes a time series into a series of segments of one of three types:
- Steady state: The time series follows a fixed mean (with random noise around the mean);
- Mean shift: The time series jumps directly from one steady state to another;
- Ramp up / down: The time series transitions linearly from one steady state to another, over a fixed period of time.
Given a univariate time series (and a few tuning parameters), the breakout function will return a list of breakout points: times when these state transitions are detected. It uses a non-parametric algorithm (E-Divisive with Medians) to detect the breakout points, so no assumptions are made about the underlying distribution of the time series.
Twitter uses this R package to monitor the user experience on the Twitter network and detect when things are "Breaking Bad". Data scientist Randy Zwitch used the package to identify the dates of blog posts or references on Hacker News from his blog traffic data. (He also compared the algorithm to anomaly detection with the Adobe Analytics API.) And the University of Louisville School of Medicine has also looked at using the package to identify past influenza outbreaks from CDC data:
For more information about the BreakoutDetection package, check out Twitter's blog post linked below. You can download the BreakoutDetection R package itself from GitHub.
Twitter Engineering blog: Breakout detection in the wild (via FlowingData)
Comments
You can follow this conversation by subscribing to the comment feed for this post.