by Joseph Rickert
At a Bay Area R User Group (BARUG) meeting this month hosted by Cisco, Dag Lohmann (the co-founder of Katrisk) gave an electrifying talk on catastrophe modeling for the insurance industry. Catastrophes: cyclones, hurricanes, floods earthquakes, terrorist attacks are rare events (from a statistical point of view) that cause losses and human suffering over large geographic areas. Insurance companies build models of these events both for underwriting, where they need estimates of local risk at various locations, and portfolio management, where it is imperative for them to estimate the correlation of risk at the different locations and also have a means for aggregating risk.
Two aspects of catastrophe models that Dag’s talk really drove home are: the astounding amount of data consumed and the scope and sophistication of the modeling techniques employed. A typical professionally-done catastrophe model might attempt to use 150 years worth of meteorological data (~ 30 terabytes) over 100M locations while simulating 100,000 or so atmospheric and hydraulic scenarios. As Dag pointed out: (100K scenarios) x (100M locations) x (1/50 probability of occurrence) yields 200 billion records to feed the financial models. To get an idea of the level of modeling sophistication involved, consider that a typical model might employ: detailed fluid dynamics simulations to calculate hazards; vectorized time series models to compute correlations; advanced statistical methods for variable reduction; validation and much more.
No less impressive, but not surprising, is the fact that Dag can do all of this with an open source stack built around R and supplemented with Leaflet and Mapserver. As Dag pointed out: "R is deeply embedded in the Insurance Industry".
For a serious introduction to catastrophe models have a look at Dag's slides, and then work through the R code of an elaborate sample model that is available on the KatRisk website.
The following plot comes from the first part of the model.
It shows a grid superimposed on a map of England colored by the hazard for an imaginary catastrophic event. For a real event, this type of plot would be the output of an extensive data analysis and modeling effort. To produce the example plot, however, an elliptical copula was defined using R's copula package to create a multivariate distribution with fixed correlation among the marginal distributions. Then, the hazard grid was filled out by sampling from the distribution. This only the beginning. After simulating the hazard events, the code goes on to simulate exposure and vulnerability, build an event loss table, work through a financial model, construct AEP (Aggregate Loss Exceeding Probability) and OEP (Occurrence Loss Exceeding Probability) curves for both expected losses and sampled losses, estimate secondary uncertainity and compute quite a few performance measures.
Never having worked in this field myself, I found the LLoyd's publication "Catastrophe Modelling; Guidance for Non-Catastrophe Modellers" helpful.
"It shows a grid superimposed on a map of England"
No, it shows a grid superimposed on a map of the United Kingdom, a map that includes England, Wales, Scotland and a little bit of Northern Ireland. Four countries. I know from long experience that the "Britain = England" error is a pervasive one in the US but if a data science blog can't get this sort of detail right what hope do we have?
Posted by: Dan | December 21, 2013 at 12:04