The mysterious data mines of Argleton-on-Google
There’s been a bunch of online chatter today about Argleton, the mystery town on Google Maps that has never really existed.
“Maybe it’s a trap street,” people have speculated. Google itself appears to be pinning the blame on Tele Atlas, telling the Telegraph: “People can report an issue to the data provider directly and this will be updated at a later date.”
The Telegraph goes on to say: “The data for the programme was provided by Dutch company Tele Atlas. A spokesman said it would now wipe the non-existent town from the map.”
Update: Originally I suggested here that, by reference to extra map data showing up elsewhere in Britain, this looked like something that had been ‘mined’ by Google from web sources. From a couple of comments below based on other Tele Atlas mapping, it does actually appear that this is a superfluous Tele Atlas town, not an invention of Google’s data mining. Nonetheless the data mining story is interesting in itself, so…
The canary starts to wobble
We know that, even before their recent go-it-alone expedition in the States, Google was mining the web and integrating the results into its map data. Wikipedia is the best-known example; Wikipedia articles with co-ordinates have long appeared as ‘active POIs’ on Google Maps. But as time goes on, Google has mined more and more directories, and other web content, to make the maps richer than the raw Tele Atlas data can offer.
It’s a really clever idea.
But sometimes, the parsing fails. Google Maps FAIL has a good example. Google has found a source of addresses somewhere on the web, and pulled out various data from it. But either the source data is dodgy, or more likely, it’s not formatted quite as consistently as Google’s algorithms would like.
So in Google Maps FAIL’s example, the sizeable town of Cirencester has moved to a little village halfway towards Northleach “inhabited by two sheep and a squirrel”, and the historic city of Gloucester has navigated upriver 20 miles and is sitting in a watermeadow outside Tewkesbury.
This  was my original guess as to what’s happened at Argleton: dodgy data mining. My guess was that the mined data was in fact a badly OCRed address, meant to be “Aughton” but transcribed as “Argleton”. We already know that Google is OCRing PDFs as it crawls them; or maybe it was OCRed before being uploaded to the web. No matter.
If we need any more proof that they’re mining some fairly imperfect sources, then three miles to the west we find “Downhollnad”. A couple of months ago I was drawing a map of the Leeds & Liverpool Canal there, and I’m pretty sure that it’s called Downholland. It’s spelled correctly on other Tele Atlas-derived mapping, too, such as Multimap’s.
The canary falls over
How endemic is this faulty mining?
My home-town of Charlbury is well-known as the world centre of innovation in collaborative mapping, especially as performed by ninjas. I was just coming back from church the other day and I met that Artem ‘Mapnik’ Pavlenko walking down the street. So let’s have a look at the data Google has mined for Charlbury.
This is a good start. St Mary’s, where I play the organ (badly), is labelled as ‘Charlbury RC Church’. St Mary’s is not Roman Catholic. It’s Church of England. People have been firebombed in Ulster for less. Charlbury’s Catholic church is, as the full address suggests, a few streets away on Fisher’s Lane. (Incidentally, thank you to my Twitter followers for suggesting that maybe RC meant Radio-Controlled. It could make baptisms a whole lot more fun.)
You can also see that the Bell is in the right place, but the Bull, which should be at the corner, is closer to where the Three Horseshoes actually is.
(Incidentally, there’s a little sponsored link beside the wee Bull for Millie Benjamin Bridal Wear. Curiously, when I looked at this earlier, this in turn triggered a foot-of-page ad for ‘Milly Dress at Shopbop’. So buying one sponsored ad alerts Google to place potential competitors’ ads at the same place? That’s an… interesting loyalty tactic.)
The ‘Cotswold View’ campsite has been placed on a little unpaved street called Cotswold View. As the full address again makes clear, it’s not there. It’s actually on the road to Enstone. Whether it’s actually on ‘Enstone Rd’ is debatable – I’d have said Banbury Hill, and so does Tele Atlas.
Note the non-standard space in the middle of the phone number. A Google search for “Cotswold View” “Enstone Road” “810 314” only returns a few results, two of which are at 192.com (once described as Britain’s most invasive website in a shock-horror exposé, and no strangers to data mining themselves). I’m guessing that Google is either mining 192.com or has licensed the same data.
This is also interesting in that Google clearly aren’t doing a postcode lookup, which would be easy technically but horrible legally. A postcode lookup would put the icon in the right place.
The Fiveways Takeaway appears on the wrong side of the road. Well, big deal. But again, the only result for Fiveways “Sturt Road” “811 555″ is 192.com.
(Curious decision on Google’s part not to show ‘Takeaway’ as part of the name, but yet also not to use a custom icon. Fiveways is originally the name of the junction you see just to the south-west. “Turn left at Fiveways” is a common direction in Charlbury. If you took that literally while looking at this map, you’d drive up Sturt Close.)
This one just made me giggle. The problem with having good satellite imagery, as again Google Maps FAIL points out, is that it shows up the inadequacies of the rest of your data. There is clearly a bowls club in this picture but it ain’t where the icon is.
This is a dead canary
So. A small Oxfordshire town, only a handful of mined icons, and around half of them are faulty in some way. Data is being conflated which shouldn’t be (’Cotswold View’ caravan site on ‘Cotswold View’ street, ‘Charlbury RC Church’ located at a church in Charlbury). Positional accuracy is iffy, at best. How endemic is this faulty mining? It’s pretty endemic.
Even getting to this stage is, of course, a display of awesome technical ability. And there is no doubt that the logic will iterate like every other Google product, becoming more accurate each time.
But it does also point out the limitations of applying search-engine technologies to mapping. If you search Google for something non-trivial, you don’t expect the top result to be the one that answers your question. You hope you’ll find it in the top 10, and if not, you’ll turn the page until you get the answer. It’s fuzzy like that and people accept this.
Map data isn’t fuzzy. You have to get it right, first time. Charlbury Bowls Club’s location is approximate, but nonetheless, wrong. St Mary’s is a church in Charlbury but it’s not the Charlbury RC Church.
Data mining gets you worldwide coverage fast, but takes a long time to get to 95% accuracy: you could argue it never will. Crowdsourcing, OpenStreetMap-style, gets you to 95% accuracy fast, but takes a long time to approach worldwide coverage. Professional surveying a la Tele Atlas gets you both, at a huge cost.
All of this is especially interesting in the light of the superb Mike Dobson interview at SearchEngineLand. If you only read one article about webmapping this year, make it that one. He’s the only commentator I’ve seen who appreciates how much data mining Google is doing:
“It is clear to me that conflation and data mining across redundant sources are major components of [Google’s] update process.”
He then suggests that the strategy is to start with data mining, then refine it via crowdsourcing.
“One of the tenets of crowd sourcing is that the frequency of errors decreases with increased inspection. So, Google might make a wrong change from time to time, but the odds are that someone will correct it.” [See also his later comment on Tele Atlas and GDT.]
In other words, Google’s strategy is to get worldwide coverage via mining, then refine it until it’s accurate by crowdsourcing. That makes a lot of sense. But it remains to be seen whether their reputation can withstand the Telegraph story that will inevitably accompany each excursion into the mines.
Drums. Drums in the deep. They Are Coming.