Couple problems with the data. The biggest is that the data didn’t contain longitude and latitude of the incidents, only the addresses. This means that if I wanted to map all 14 thousand incidents, I’d have to geocode them. Geocoding 14,000 addresses is no small feat. Google Maps only allows for 2,500 requests a day. With 14,000 addresses that would take nearly a week to accomplish.
Luckily, there’s a DIY approach: using Postgres’s PostGIS extension with the TIGER data set. Turns out, this was a nightmare to install. After searching for hours, I found a script that I had to download and heavily modify. I had to run through the install several times because of it being misconfigured each time. Not user friendly at all, yet very impressive technology.
Unleashing the dataset on my home brewed geocoder was an error prone process, as I learned various things:
There are bugs in the geocoder that may crash the database connection with certain addresses
Some addresses were unable to be sanely geocoded (eg, the address would resolve in another state)
The addresses listed in the incident may be street intersections and not street addresses
Addresses may be ambiguous (1201 Main could refer to 1201 S Main or 1201 N Main)
So I tried cleaning up the data as best I could as I went along. Still when all is told and done it probably took me more than a week to geocode the whole dataset because of all the errors I encountered and geocoding each address took about 5 seconds, which translates into nearly a day’s worth of constant geocoding sql queries (Postgres was pegged at 100% all day).
I don’t envy services that geocode for a living. Seems tough!
Below is the python code to transform the original 14,000 incident dataset into one that contains latitude, longitude, and the normalized address for each incident.