By Srini Kumar, Director of Data Science at Microsoft
Who does not hate being stopped and given a traffic ticket? Invariably, we think that something is not fair that we got it and everyone else did not. I am no different, and living in the SF Bay Area, I have often wondered if I could get the data about traffic tickets, particularly since there may be some unusual patterns.
While I could not find one for any data for the SF Bay Area, I supposed I could get some vicarious pleasure out of analyzing it for some place for which I could get data. As luck would have it, I did find something. It turns out that Montgomery County in Maryland does put all the traffic violation information online. Of course, they take out personally identifiable information, but do put out a lot of other information. If any official from that county is reading this, thanks for putting it online! Hopefully, other counties follow suit.
By the standards of the data sizes I have worked with, this is rather small; it is a little over 800K records, though it has about 35 columns, including latitude and longitude. I find it convenient to park the data in a relational database, analyze it using SQL first, and then get slices of it into R.
In this post, I'll be using Microsoft SQL Server to import and prepare the traffic violations data, and then use R to visualize the data.
The SQL language
If you are unfamiliar with SQL, this SQL tutorial has been around for a long time, and this is where I got started with SQL. Another thing I find convenient is to use an ETL tool. ETL, if you are not already familiar with it, stands for Extract-Transform-Load, and is as pervasive a utility in IT infrastructures in companies as plumbing and electrical wiring are in homes, and often just as invisible, and noticed only when unavailable. If you are determined to stay open source, you can use PostgreSQL for a database (I prefer PostgreSQL over MySQL, but you can choose whatever you become comfortable with) and say, Talend for the ETL. For this exercise, I had access to Microsoft SQL Server's 2016 preview, and ended up using it. An interesting thing is that it already has an ETL tool called SQL Server Integration Services, and a wizard that makes it extremely convenient to do it all in one place.
Now, why do I spend so much time suggesting all these, when you can very easily just download a file, read it into R and do whatever you want with it? In a nutshell, because of all the goodies you get with it. For example, you could easily detect problems with the data as part of the load, and run SQL. You can handle data sizes that can't fit in memory. And at least with MS SQL, it gets better. You have an easy wizard that you can use to load the data, and figure out problems with it even before you load it. And with MS SQL, you can also run R from within it, on a large scale.
Here are the fields in the table. The query is quite simple, and accesses the meta data which is available in every relational database.
select column_name, data_type from INFORMATION_SCHEMA.columns where table_name = 'Montgomery_County_MD_Traffic_Violations'
order by ORDINAL_POSITION asc
Date Of Stop date
Time Of Stop time
Agency varchar
SubAgency varchar
Description varchar
Location varchar
latitude float
longitude float
Accident varchar
Belts varchar
Personal Injury varchar
Property Damage varchar
Fatal varchar
Commercial License varchar
HAZMAT varchar
Commercial Vehicle varchar
Alcohol varchar
Work Zone varchar
State varchar
VehicleType varchar
Year bigint
Make varchar
Model varchar
Color varchar
Violation Type varchar
Charge varchar
Article varchar
Contributed To Accident varchar
Race varchar
Gender varchar
Driver City varchar
Driver State varchar
DL State varchar
Arrest Type varchar
Geolocation varchar
month_of_year nvarchar
day_of_week nvarchar
hour_of_day int
num_day_of_week int
num_month_of_year int
The last five columns were created by me. It is trivial to create new columns in SQL Server (or any other relational database) and having them computed each time a new row is added.
With RODBC/RJDBC, it is easy to get data from an RDBMS and do what we want with it. Here is an interesting plot of violations by race and if they concentrate in certain areas.
The code to create the above chart uses, as you may have guessed, the ggmap library. Installing ggmap automatically includes the ggplot2 library as well. In order to plot this, we first get the geocodes for Montgomery county, which is easy to do. First, we get the geocode for the location, and then use the get_map function to get the map. Actually, the get_map function also has a zoom level, that we can play with to get the appropriate zoom if we need to zoom in or out instead of using the default.
Here is another one on violation concentrations related to the time of day (the scale is hours from midnight: 0 is midnight and 12 is noon):
The code to do this is as follows:
It looks like there are few violations are really small in the wee hours of the morning. How about violations by month by vehicle?
Looks like it is mostly cars, but there are some light duty trucks as well. But how did we get the numerical month of the year? Again, in SQL, it becomes easy to do:
ALTER TABLE <table name> ADD COLUMN num_month_of_year AS datepart(month,[Date Of Stop])
After that, it is a routine matter of plotting a bar chart, after using RODBC/RJDBC to get the data from the database.
ggplot(data=alldata, aes(num_month_of_year)) + geom_bar(aes(color=VehicleType), fill=factor(VehicleType))
When we look at the Make of the cars, we see how messed up that attribute is. For example, I have seen HUYAND, HUYND, HUYNDAI, HYAN, HYANDAI, HYN, HYND, HYNDAI. I am pretty sure I even saw a GORRILLA, and have no idea what it might mean. That does not make for a very good plot, or for that matter any analysis. Can we do better? Not really, unless we have reference data. We could build one using the car makes and models that we know of, but that is a nontrivial exercise.
A little SQL goes a long way, particularly if the data is very large and needs to be sampled selectively. For example, here is a query:
The square brackets [] are an MS SQL syntax to work with fields that are separated by spaces and other special characters. To practise SQL itself, you don’t have to get an RDBMS first. In R itself, there is the sqldf library, where you can practice your SQL if you don't have an actual database. For example:
bymonth <- sqldf("select num_month_of_year, sum(frequency) as frequency from res group by num_month_of_year")
Since I have been recently introduced to what I call the "rx" functions from Microsoft R Server (the new name for Revolution Analytics' R), I decided to check it against what I normally use, which is ctree
from the party package and of course the glm. My intent was not to get insights from the data here; it was to simply see if the rx functions would run slightly better. So I simply tried to relate the number of violations to as many variables as I could use.
The rx functions (in my limited testing) worked when the CRAN R functions did not. Note that I did not go looking for places where one outperformed the other. This was simply to figure out what was doable with the rx functions.
Finally, we can call R from within SQL Server too. Here is an example:
It is no surprise that the least violations are during the wee hours of the morning, but one surprise for me was the vehicle brand names. Of course, one can create a reference set of names and use it to clean this data, so that one can try and see what brands have a higher chance of violations, but that is not an elegant solution. Chances are that as we get such data, the errors in human inputs will be a serious issue to contend with.
Hopefully, this article also illustrates how powerful SQL can be in manipulating data, though this article barely scratched the surface on SQL usage.
I'm curious to know why do some columns are in snake_case(like "driver_city"), whereas some in PascalCase(like "VehicleType").
Posted by: KS | April 11, 2016 at 07:20
Great article! I use sqlQuery all the time. It's pretty powerful!
My only problem was the violations by vehicle bar chart. It's pretty hard to tell what vehicles are represented because the colors are similar. If there are only 4-5 significant values (that can be seen), then I'd probably just stick to those and make the colors starkly different.
Posted by: Jeff | April 13, 2016 at 07:24
KS, the snake case might have been me, as I was moving the data to the database. I prefer the column names with snake case. BTW, thanks - I did not know that it was called snake case.
Jeff, thanks for the compliment, and your point on the colors is well taken. Next time, I will probably explicitly set colors to be distinctly different.
Posted by: Srini Kumar | April 18, 2016 at 10:42