How Google Counted The World’s 129 Million Books
In a blog post published this week, search mammoth Google explained the deep and thoroughly elaborate algorithm used by its literary offshoot, Google Books, to count just how many books exist in the world, right now.
Seeing as there’s no official standard to cataloging tomes (the final term Google settled on for defining what is and isn’t worth cataloging in Google Books, tomes are bound volumes that can be printed millions of times, or just once), plenty of systems were deemed unreliable.
Take ISBN (International Standard Book Numbers). They’ve only been around since the 1960s, and then only came into provenance in the 70s. They also discount books not intended for commercial distribution, and are mostly only used in the western world. You’ll also sometimes find up to 1,500 books assigned to the same ISBN, and irrelevant items like CDs, bookmarks and even t-shirts having Book Numbers.
Other identifiers, like the Library of Congress Control Numbers and OCLC accession numbers, feature duplication, redundancy, and immense reduction for series featuring thousands of volumes. More unreliability that lead Google needing to make up its own identifying system.
The final process involved a massive metadata collection from hundreds of these providers, including catalogues and commercial providers, which are then intensely parsed and analysed. The initial raw data features close to a billion records, which are reduced to 600 million when superficial duplication is reduced.
Then it’s a case of separating the wheat from the chaff, using different attributes and fields to spot duplications and redundancies, even when its as confusing as the same book being attributed to several different publishers, or the exact same book featuring two massively different names. That drops the count down to 210 million.
Then its on to excluding non-book items, which Google counts as “microforms (8 million), audio recordings (4.5 million), videos (2 million), maps (another 2 million), t-shirts with ISBNs (about one thousand) and turkey probes (1, added to a library catalog as an April Fools joke).”
Finally, Google reaches the number it has been looking for, and believes the count is a pretty reliable representation of the world’s books: 129,864,880. “At least until Sunday,” Google says.