Home > geoffrey nunberg, google and the myth of universal knowledge, google books, Uncategorized > Why is there a Google Books? Part 1: Machine Translations and Customer X

Why is there a Google Books? Part 1: Machine Translations and Customer X

January 2, 2014

The Author’s Guild has appealed the ridiculous Google Books decision imposed on the world’s authors courtesy of Google’s Silicon Valley hotshot lawyer Daralyn Durie (who is currently representing GoldieBlox against the Beastie Boys, and who appears to be evolving a subspecialty of attacking artists–even if it means unseating the Orrick firm in the rush to get over on creators, who another Silicon Valley lawyer called those “really good looking people“).

It’s worth taking a longer look at exactly what Google Books is and who it may be meant to serve.

The Ghost in the Machine

Perhaps you have never heard of machine translations, but if you use one of the many translation algorithms available for “free” online, you have experienced a translation by a machine.  The video above is a good summary of how Google uses machine translation of a particular kind–a “corpus machine translation.”

Simply put, “corpus machine translation” is an offshoot of speech recognition, often studied under the name “language technologies” or something similar (Carnegie Mellon University has a “Language Technologies Institute”, for example.  Carnegie Mellon’s Language Technology Institute is chaired by Dr. Jaime Carbonell, faculty advisor to a number of Googlers and formerly Chief Scientist at a company called Meaningful Machines.  More about them later.)

The way this works with text-based translations is that machines are taught to recognize written speech patterns in a language.  If the machine wants to translate a sentence from English to French, for example, rather than teaching the machine a language the way humans would study it (conjugating verbs, for example), the machine learns phrases, sentences and expressions–or “strings” of text.

When a machine “translates” a sentence (or string of text) from English to French, it first tries to match the text string in its English “memory” of text strings (a large database).  Then it will try to compare that string to what Secretary Rumsfeld might call a “known known”–a corresponding text string in its French database that the machine has been told is an exact or good enough match to the English string.

A good way to accomplish this is with books in translation.

How would the machine know that?  One way would be if the machine had scanned into its English database a book in English, say Bonfire of the Vanities by Tom Wolfe, that had been translated into French and that had the French translation mapped to the English version so the machine could compare the two.  Or if it had a book in French, say  L’Être et le néant by Jean-Paul Sartre that had been translated into English (Being and Nothingness) and mapped to the French so the machine could compare the two.

If the machine had more than one translation of Bonfire of the Vanities and Being and Nothingness and could compare a number of examples, then the machine could take advantage of all the work done by the translators (and the publishers that paid the translators for their work) and reliably compare strings of text.  The user of the machine could then have a high degree of confidence that the two strings really did mean what they should mean based on their sequential location in the two books, fifty books, or as many languages as the publishers had the books translated.

And of course if you have very popular books that have gained a world wide audience through the talent of the author and the marketing work and expense of the publisher, then those books will likely be translated into many languages.  Perhaps dozens.  Whether it’s the King James Bible, the Koran, The Cult of the Amateur or Harry Potter, the more the book has been translated, the more accurate is the machine learning and the better it is for the machine seeking to do the translating.

Or the owner of the machine.

Now consider the machine translation at scale.  If one had a very large number of books to make available to the machine translator, then the machine could begin to look for statistically significant text strings that recur in multiple books.  The more books the machine translator had available to it, the machine’s recognition of these strings would be ever more precise because it would have an increasing basis–or “corpus”–on which to recognize text strings.  The more examples of these strings were available to it, the more likely it is that the machine could recognize ever more subtle differences in the way these strings appear–one might call that “context.”

This would–of course–require digitizing very large numbers of books, and then teaching machines to recognize these strings, or to “read” these strings and the translations of these strings.  This is a very costly process, so if it were going to be done it would need to be done by a private company that was willing to spend a good deal of money to do all the scanning and teaching necessary to scale up this corpus based machine translation.

And even at that, if the private company were going to license millions upon millions of books, that would be a prohibitively costly project because, of course, the authors, illustrators, photographers, songwriters and publishers would have to be paid for the in-copyright works, and some libraries would have to fork over the out of copyright books for free.  In some cases, the delicate, rare and impossible to replace out of copyright books.

As you know from having used online translation algorithms such as Google’s translator, these devices are offered at no charge to the public.  So if a private company such as Google were going to go to the trouble and expense of the scanning and programming necessary to accomplish enterprise level translation with a high degree of accuracy, at some point they would need to have a customer who would pay money for this service.  They’d probably want to have that customer before they started scanning the millions of books necessary to create this highly reliable translation service.

What kind of customer might that be?  Not someone who was translating a letter to their Polish grandma.  Not someone who wanted to see what a Russian blogger had posted.  Not even a million such people.

No, the kind of customer who would be interested in a highly robust, best in breed, first in class translation engine would be someone who had a bunch of stuff to translate.  Someone who had millions, maybe billions of pages of text they needed to translate and analyze. Maybe someone who collected that scale of text strings to translate on a daily basis.  And maybe that customer, let’s call them Customer X, might have a certain phobia about sunlight.  Maybe Customer X would really like to get this done, but also really didn’t want it known that they were doing it.  Customer X might be that kind of special someone who liked to keep things–shall we say–on the down low.

Or as some might say–liked to keep things….secret.

Yet, Customer X is collecting all this data and keeping a beady eye on how high that stack was getting, how big their backlog was growing.  Customer X kind of needed to get this translation thing underway, knowing that it would take a good long time to scan all those books–I mean, do all that digitizing for the Digital Library of Alexandria for all things good and true, pure souled and high minded.

Not to mention how would Customer X get their hands on that many books to scan, and what possible reason could they give to…librarians…for why they needed to check out all the books in their library?  You know, all their buddies in the…librarian…community.

Because Customer X typically did not vote along the same lines as the typical librarian.  And Customer X couldn’t just walk up to these librarians, flash their general’s stars and let them in on what was up with those billions of pages of data that Customer X might not be supposed to have collected.  Now that would be an awkward conversation!

No, Customer X needed someone to…what should we call it…help.  Someone whose intentions would not be questioned, someone who was…well, not evil, you know.

And just like DARPA changed the name of Total Information Awareness to Terrorist Information Awareness, this project would have to have a catchy name that would be an excellent and appealing cover–something like the digital library of Alexandria.

The private company involved would have to be willing to keep all of this very secret–and of course no one could know that this company worked for Customer X because that was itself going to need to be a secret.  How would this be accomplished?  Well, maybe Customer X or its friends could arrange something appealing to the top brass at the contractor, perhaps…oh, maybe some real estate.  You know, like an airfield to keep some planes nearby and a good price on some gasoline.  Customer X had a bunch of smart guys, they’d surely think of something.

In his prescient 2006 book Chatter: Dispatches from the Secret World of Global Eavesdropping, Patrick Radden Keefe observed (at p. 141):

It seems strange that while the technology of espionage has progressed to the point where satellites thousands of miles above the earth can take a clear photograph of a license plate in enemy territory, the challenges posed by the varieties of human speech have remained largely insurmountable.  In 2002, the [National Security Agency] made the rare move of publicly advertising for trained linguists, specifying which languages they wanted, and in so doing acknowledging a dire institutional weakness….In keeping with the tendency to continually reinvest in what it has traditionally been good at, even when that is not precisely what is currently required, the intelligence community proceeded after September 11 to collect even more intercepts–far more than it was able to translate.  As a result, there was a massive backlog of documents and recorded phone calls, sitting in databases, untranslated and unexamined.

So what to do with this massive backlog of untranslated intercepts?  Keefe continues (at p. 145):

[W]hile it may be crucial to maintain a good stable of linguists in the meantime, the end goal [of the NSA] will still be to work toward new technological solutions.  One such solution is automated translation, and that is what Meaningful Machines does.

Why is the Metadata So Bad?

During the pre-Edward Snowden years when I first discussed the idea that Google Books was never intended to be a “Digital Library of Alexandria” but instead was assembled for the value of the nondisplay uses of the digitized text including massive amounts of translations for Customer X, I got a lot of rolling of the eyeballs.

No longer.

Here was another disclosure (we call these “clues”)–Google received many, many complaints about the lack of care paid to the Google Books metadata, i.e., the information about the information in the books Google was scanning.  (Authors’ names, titles, copyright dates, are all examples of metadata.  When you are building a registry of copyrights, you want to keep a good eye on that stuff.)  This criticism came almost immediately after Google started the books project in 2004–shortly after the NSA advertised its need for linguists (according to the FY 2012 “black budget” released by Snowden, the government still pays significant bonuses for linguists).

Criticism of the metadata came swiftly from a variety of sources.  Jean-Noel Jeanneney, then president of the National Library of France, published Google and the Myth of Universal Knowledge in 2006 (based on an op-ed from Le Figaro in 2005).  Jeanneney was sharply critical of the sloppy digitizing of the works of leading French authors.  In 2009, Professor Geoffrey Nunberg published “Google’s Book Search: A Disaster for Scholars” in the Chronicle of Higher Education.

As someone who worked on assembling a registry of music that included over 10 million titles, I can tell you that if you let mistakes in the metadata go for several weeks, let alone several years, you will have a devil of a time extracting that evil from that machine.  And if you have a corpus of works that are not properly organized, they are eventually going to be near useless if your goal is to find particular works.

And that is the goal of Google Books, right?  “Organizing the world’s information” and all that?

But you have to ask yourself if they care about finding things–as opposed to digitizing and owning a massive amount of text–why wouldn’t Google fix the metadata?  What good is the “Digital Library of Alexandria” if you can’t rely on the card catalog?

An illustration from Professor Nunberg:

Start with publication dates. To take Google’s word for it, 1899 was a literary annus mirabilis, which saw the publication of Raymond Chandler’s Killer in the Rain, The Portable Dorothy Parker, André Malraux’s La Condition Humaine, Stephen King’s Christine, The Complete Shorter Fiction of Virginia Woolf, Raymond Williams’s Culture and Society 1780-1950, and Robert Shelton’s biography of Bob Dylan, to name just a few. And while there may be particular reasons why 1899 comes up so often, such misdatings are spread out across the centuries. A book on Peter F. Drucker is dated 1905, four years before the management consultant was even born; a book of Virginia Woolf’s letters is dated 1900, when she would have been 8 years old. Tom Wolfe’s Bonfire of the Vanities is dated 1888, and an edition of Henry James’s What Maisie Knew is dated 1848.

But maybe we should not assume that getting the metadata right was much of a concern for Google, or even getting the scanning done correctly.  Because the sad state of the metadata is not improving: Last week The New Yorker included an article about web-based projects documenting vast numbers of “glitches” (an over used apologia) in the scans, as first noticed by Mr. Jeanneney in nine years ago.

I would suggest to you that nine years of mistakes in a corpus of untold millions of books–emphasis on the “untold” as we will see–would have to be fixed by humans, one by one.  That is, practically speaking, impossible to fix. This is likely why Mr. Jeanneney, Professor Nunberg and others are so offended by all the mistakes.  Professor Nunberg makes a very important point:

Google’s book search is clearly on track to becoming the world’s largest digital library. No less important, it is also almost certain to be the last one. Google’s five-year head start and its relationships with libraries and publishers give it an effective monopoly: No competitor will be able to come after it on the same scale. Nor is technology going to lower the cost of entry. Scanning will always be an expensive, labor-intensive project. Of course, 50 or 100 years from now control of the collection may pass from Google to somebody else—Elsevier, Unesco, Wal-Mart. But it’s safe to assume that the digitized books that scholars will be working with then will be the very same ones that are sitting on Google’s servers today, augmented by the millions of titles published in the interim.

This is why you do it right the first time.

If your purpose is to do it right.

Continued in Part II: HAL Gets A Friend, or The Voice of the Machine

  1. AudioNomics
    January 2, 2014 at 12:26

    That actually made a heck of a lot of sense.
    … now, who do we sue to get our dues? The NSA? or Google? (both?)

    Maybe we should start a ‘whistle-blower’ fund to entice insiders to blow the whistle on these rape-and-pillage methodologies

  1. No trackbacks yet.
Comments are closed.
%d bloggers like this: