"Big Data" generates a lot of news these days, but sometimes small data still means big computation.
Indiana's Department of Workforce Development has the responsibility to forecast future employment rates in the State of Indiana. And not just the number of jobs available: the department also needs to forecast the types of jobs that will be available, so the administration can link up future job requirements with training and education policies and make sure a skilled workforce can fill the future demand.
The State of Indiana contracted with analytics professional services firm Inquidia Consulting to develop a system to generate the forecast, as part of the "Demand-Driven Workforce System" project. After assembling a database of employment data, the team quickly found that this wasn't exactly a big-data problem: each forecast was based on just 1kb to 3kb of data, and could easily be computed using the R language in less than 2 seconds. The problem was that one the combinations of counties and selected variables were taken into account, millions of different forecasts needed to be made, immediately making this a problem that demanded significant of computing time.
Fortunately, this is exactly the kind of problem that can be tackled using parallel programming: since each of the models can be computed independently, a large number of machines can compute models simultaneously and dramatically reduce the required computation time. Using Microsoft Azure, the team could spin up hundreds of virtual machines to perform the R processing, orchestrated by a PDI/carte process to implement a simple (yet reliable) fault-tolerant distributed system. (The team considered using HDInsight and Spark for the distribution framework, but decided given the small data sizes that it would be overkill for their needs.)
For the actual forecasts, the Inquidia team used R's forecast package, noting that it offers "a robust set of options for projecting time series" and "allows for easy projections from any set of models for an arbitrarily large forecast period". However, vector autoregressive models can only accomodate a relatively small number of variables and there are a large number of macroeconomic variables to choose from to forecast employment. This created a difficult variable-selection problem, which the team solved with a brute-force model forecasting and evaluation process.
Finally, to present the forecasts to State of Indiana administrators, the team created a shinydashboard library. This dashboard gave administrators flexbility to customize the forecasts by adjusting for seasonality, combining multiple models, and including new employment data.
As a result, the Indiana DWD now receives a much more finely grained and accurate analysis of job skills and needs in Indiana, and can now view details at the county and metro level, without waiting for months. This is thanks to the system build with R on Microsoft Azure: a computation engine capable of producing more than 1,000 models per second at a fraction of the cost of an on-premises configuration.
To learn more about the forecasting system Inquidia Consulting build for the Indiana DWD, you can watch this archived webinar presented by the project leads. Or check out the white paper linked below, which gives details about the underlying architectute and implementation.
Microsoft Advanced Analytics and IoT: White Paper: Dive deep into small data with big data techniques
Comments
You can follow this conversation by subscribing to the comment feed for this post.