In this first post a little more technical, I present some considerations on geocoding. Geocoding is the action of assigning geographic coordinates to a place described by an address. This can be the address of a store or the summit of the Matherhorn for instance. Geocoding therefore allows the georeferencing of data, it is the basis of any cartographic work. With the ever-increasing importance of data in our society, we are increasingly confronted with the need to geocode to make the best use of them.
As part of a data mining project, various geocoding options are available to us. Consider, for example, the following 5 options, in the form of APIs (this is not an exhaustive list):
– Google via its geocoding API,
– Nokia with its Here API, may be a little less known,
– Nominatim which is based on open data from OpenStreetMap, more confidential like,
– Opencage (an R package) which is also based on OpenStreetMap but also on other sources.
What about the fifth option? … I will test the R package (ggmap) which has a geocoding function related to the Google API. I’m just trying to make sure the results are stable, we’re never too careful with third-party data. I will not speak programming here, there are many sources to train you in the field, this would only lengthen this post unnecessarily.
Here, for our test, addresses are retrieved as part of this project. I have segmented the database to obtain those from the canton of Neuchâtel, Switzerland. The format of addresses is not my choice, they are raw data collected as is. I do not present the table showing the result of the geocoding (the geographical coordinates), but rather that of the distance between the georeferenced points for the same address, because it is more meaningful. The results of the Nokia API are used as a point of comparison (this is an arbitrary choice). This distance between pairs of points is obtained by using the distVincentyEllipsoid() function available in the R “geosphere” package.
Table 2: Geocoding deviations given in meters. “NOK_GOO”, “NOK_OPEN” and “NOK_OSM” are respectively the differences between the Nokia Here API and the Google API, the R Opencage package, the Nominatim API. “GOO_GGMAP” represents deviations from the Google API, not passing through and passing through the R ggmap package. Since the Here and Google API have returned coordinates for all addresses, any missing result in a column is to be attributed to a lack of coding returned by the API that is compared (a notable example: NOK_OSM).
So the results of Nokia and Google are the most homogeneous and I imagine the most accurate (note for later, go to this site and find the coordinates for any serious discussion of this aspect). With a median gap at 0.43m (first quartile: 0.28m, third quartile: 2.63m) between these two APIs, it is near perfect!
There are still significant gaps, with the largest gap at nearly 3 km. The origin of the differences is not unique:
– on the one hand there are imprecise addresses, with street number missing. This is the case for example of “Draizes 2000 Neuchâtel” but also of “Avenue des Cadolles 2000 Neuchâtel” generating deviations at 460m and 210m, respectively. Both APIs do here at best and must assign a value (mean?) for the street. Considering the extent of the two streets, the difference between the deviations seems to be related to the extent of these objects in space.
– on the other hand, there is the format of the address. The biggest gap observed is about 3km and corresponds to “F.C. de Marval 8 2000 Neuchâtel”. For another address, “Rue Frédéric-Carl-de-Marval 1 2000 Neuchâtel”, there is a 10m gap between the two geocoding. It is the same street in a rectangle of 290mx35m … Taking a closer look at the coordinates we obtained, it seems that Nokia’s engine misses the target for the first address and this is probably due to the abbreviation of the address.
As for the other sources of geocoding, the quality drops drastically. The graph below shows the geocoding differences between Here and Nominatim. As shown by the blue lines connecting one to one the coordinates assigned to the (identical) addresses, the geocoding gaps are important, the farthest points are more than 100km apart !!
The graph below shows the geocoding gaps between Here and Opencage. As shown by the green lines build on the model of the previous graph, the most distant points are nearly 90km away. Here, an additional problem is to be highlighted. Many addresses (64/81) are geocoded at the same point (surrounded in red: if the red circle is not visible, please reload I think WordPress or the WP’s D3 plugin does not work very well sometimes)!
What conclusions can be drawn from this test?
First, this test was carried out with swiss addresses, we can expect different behavior from country to country.
Considering Nominatim and Opencage, the first has no defined limit as to the number of geocodings, except that one must not abuse the servers. Nominatim does not always return a result. Opencage limits the geocodings to 2’500 per day and always returns a result. On the other hand, both give results whose precision leaves much to be desired.
If you’re looking for accuracy and effectiveness, Google and Nokia will provide you with the best services. After several tests on other address sets, if Google makes fewer coding errors, it does not always return a result (but it is rather rare), unlike Nokia. Another aspect to consider, the limit of geocoding, 2’500 / day for Google and 15’000 / month for Nokia, all for free. You can obviously pay for more geocodings.
Beware however of the general conditions of use (if you follow them), Google requires that all visualizations using their data are displayed with googlemap. In other words, you are not supposed to use them on a map of your own design, which is a shame! As for Nokia, as far as I know, the general conditions do not stipulate this type of restriction.
And this 5th option? The results obtained by the Google API with or without the “filter” of the package R (ggmap) are not identical!?! 4 addresses out of the 81 tested return different coordinates with errors from 15m to 300m and an address is not coded, so there is 6% of difference in the results for the same API.
So be careful not to mix your sources and your geocoding tools so as not to introduce errors, often important. This is true if you perform the georeferencing by yourself, but also when it is a third person who have done it!
Knowing one’s data is a prerequisite for any analysis.
Update: for more information, I have just found this site where a lots of sources are compared (for the United States).