By Srini Kumar, Director of Data Science at Microsoft
Who does not hate being stopped and given a traffic ticket? Invariably, we think that something is not fair that we got it and everyone else did not. I am no different, and living in the SF Bay Area, I have often wondered if I could get the data about traffic tickets, particularly since there may be some unusual patterns.
While I could not find one for any data for the SF Bay Area, I supposed I could get some vicarious pleasure out of analyzing it for some place for which I could get data. As luck would have it, I did find something. It turns out that Montgomery County in Maryland does put all the traffic violation information online. Of course, they take out personally identifiable information, but do put out a lot of other information. If any official from that county is reading this, thanks for putting it online! Hopefully, other counties follow suit.
By the standards of the data sizes I have worked with, this is rather small; it is a little over 800K records, though it has about 35 columns, including latitude and longitude. I find it convenient to park the data in a relational database, analyze it using SQL first, and then get slices of it into R.
In this post, I'll be using Microsoft SQL Server to import and prepare the traffic violations data, and then use R to visualize the data.
The SQL language
If you are unfamiliar with SQL, this SQL tutorial has been around for a long time, and this is where I got started with SQL. Another thing I find convenient is to use an ETL tool. ETL, if you are not already familiar with it, stands for Extract-Transform-Load, and is as pervasive a utility in IT infrastructures in companies as plumbing and electrical wiring are in homes, and often just as invisible, and noticed only when unavailable. If you are determined to stay open source, you can use PostgreSQL for a database (I prefer PostgreSQL over MySQL, but you can choose whatever you become comfortable with) and say, Talend for the ETL. For this exercise, I had access to Microsoft SQL Server's 2016 preview, and ended up using it. An interesting thing is that it already has an ETL tool called SQL Server Integration Services, and a wizard that makes it extremely convenient to do it all in one place.
Now, why do I spend so much time suggesting all these, when you can very easily just download a file, read it into R and do whatever you want with it? In a nutshell, because of all the goodies you get with it. For example, you could easily detect problems with the data as part of the load, and run SQL. You can handle data sizes that can't fit in memory. And at least with MS SQL, it gets better. You have an easy wizard that you can use to load the data, and figure out problems with it even before you load it. And with MS SQL, you can also run R from within it, on a large scale.
Here are the fields in the table. The query is quite simple, and accesses the meta data which is available in every relational database.