Google Fouls Up Again: Google Book Search is a Disaster for Scholars AND Copyright Owners

Google’s Book Search: A Disaster for Scholars. Now that title caught my eye, not the least because it appeared in the Chronicle of Higher Education. The article is extraordinarily honest and well written, with solid research and supporting evidence.

We’ve become accustomed to librarians and academics uncritically fawning over the disaster that is Google Books (especially those privileged librarians among the sovereignly immune), but give this one by Professor Geoffrey Nunberg a read, particularly regarding the “metadata,” the fields in the Google Book Search that are supposed to contain information like year of publication, title, author, etc.

“Start with publication dates. To take Google’s word for it, 1899 was a literary annus mirabilis, which saw the publication of Raymond Chandler’s Killer in the Rain, The Portable Dorothy Parker, André Malraux’s La Condition Humaine, Stephen King’s Christine, The Complete Shorter Fiction of Virginia Woolf, Raymond Williams’s Culture and Society 1780-1950, and Robert Shelton’s biography of Bob Dylan, to name just a few. And while there may be particular reasons why 1899 comes up so often, such misdatings are spread out across the centuries. A book on Peter F. Drucker is dated 1905, four years before the management consultant was even born; a book of Virginia Woolf’s letters is dated 1900, when she would have been 8 years old. Tom Wolfe’s Bonfire of the Vanities is dated 1888, and an edition of Henry James’s What Maisie Knew is dated 1848….Google acknowledges the incorrect dates but says they came from the providers.”

Of course, Google claims that these mistakes are the copyright owners’ fault per usual–but what is interesting about this catalog of boneheaded errors is that the mistakes always seem to make the works OLDER than they actually are. Therefore more likely to be out of copyright and non-infringing, as opposed to NEWER than they actually are, therefore more likely to be in copyright and infringing. In fact–“to take Google’s word for it”–it seems a safe guess that all the listed works would be in the public domain according to the incorrect dates that Google has placed in their metadata. Which benefits Google. Of course, not even Google can make a book into a public domain work when it isn’t, but it does suggest that Google could say that an unlicensed work in copyright got into the claws of Google Books “by mistake”–an “innocent” infringement because the metadata said the work was in the public domain.

As one commenter to Professor Nunberg’s article notes: “Here’s another one: a recurring problem with date of publication is that all volumes of a journal are assigned the date of voulme 1.” That is–the oldest possible date. God knows what they did with the sheet music.

The applications running the Google Books registry will need to make a distinction between works in copyright and works out of copyright. That is a very important date in the settlement agreement. Where do you think that date is going to come from? It is starting to look like it will come from incorrrect data–data that makes the publication dates MUCH OLDER than they actually are.

Who’s going to check to see that the millions of copyright dates are correct? Nobody. And it’s yet one more thing for the overburdened copyright owner to sort out as Google continues its “cultural rape.”

“Innocent” infringement versus “intentional” infringement creates a rather large difference in how the punishment for the infringement is treated on judgement day–which would be on the later of the date that the plaintiff gets a final non-appealable judgement against Google for copyright infringement–or the author dies penniless. Also likely to foreclose criminal prosecution.

Perhaps this all has something to do with the mysteries of advertising placement? Professor Nunberg says he was told that “[t]he ad placement on Google’s book search right now is often comical, as when a search for Leaves of Grass brings up ads for plant and sod retailers.”

Hmmm. I think we noted that possibility two years ago in my review of “Google and the Myth of Universal Knowledge” where the absurdity of selling advertising in books was well argued by the prescient Jean-Noël Jeanneney, former president of the Bibliothèque Nationale of France:

“Recall Google CEO Eric Schmidt’s statements to the Wall Street Journal on the eve of the Viacom lawsuit: When asked to respond to the idea that “content” has intrinsic value, he said “prove it”. Which has to be one of the dumber, but yet illuminating, remarks to come from a Silicon Valley CEO on the subject of art and culture.

No wonder M. Jeanneney tells us that ‘[t]he visit I received from several Google executives after the beginning of my campaign [against Google Books] didn’t do much to reassure me.’

These statements echo and confirm one of the most important points raised in Google the Myth: M. Jeanneney writes, ‘What pays for the digitization of materials are linked advertisements from companies that have an interest in associating their image with old or recent works likely to promote that image. As a result, books will necessarily be hierarchized in favor of those best suited to satisfy the demands of advertisers, again, chosen according to the principal of the highest bidder [as is Google AdWords]. I wouldn’t want to see—although I’m amused by the thought—the text of Saint-Exupéry’s Le Petit prince accompanied by an ad for a sheep merchant.’”

Right on cue, Professor Nunberg tells us:

“Google’s fine algorithmic hand is also evident in a lot of classifications of recent works….[Google assigns a “]Religion[“] tag…to a 2001 biography of Mae West that’s subtitled An Icon in Black and White [and] the Health & Fitness label on a 1962 number of the medievalist journal Speculum….
But even when it gets the [bookseller’s standard] categories roughly right, the more important question is why Google would want to use those headings in the first place. People from Google have told me they weren’t included at the publishers’ request, and it may be that someone thought they’d be helpful for ad placement.”

So before you write off “cultural rape” as mere French “yankee go home” hyperbole, think again.

Professor Nunberg sums it up: “[Y]ou need reliable metadata about dates and categories, which is why it’s so disappointing that the book search’s metadata are a train wreck: a mishmash wrapped in a muddle wrapped in a mess.”

Maybe it’s not hyperbole, and maybe it is cultural rape for real. All those statements about what a great idea Google Books is, how it will make “millions” of titles available to the public–maybe that is yet more evidence of Google’s charm offensive to mask what an unmitigated disaster this project is from a cultural, copyright, antitrust and now scholarly perspective.

It’s nice to find an academic being honest about Google’s screw ups. Since Professor Nunberg is at Berkeley, Google might actually listen to him.

But it’s unlikely. The do-over to fix the metadata and cataloging system will take a very long time at vast cost. “Fixing” Google Books is not in Google’s interest and who can make them? As the scanning keeps going every minute of every day, it is becoming increasingly clear that Google thinks of Google Books as Google’s books–the entire intellectual capital of the world.

It shows again what happens when you put the sole retailer in charge of the metadata–the monopsonist buyer has no incentive to act to the benefit of its sellers, particularly when the sellers’ only enforcement mechanism is costly litigation against the monopsonist, particularly when the monopsonist has access to the public financial markets to raise its defense funds. (Which defense costs evidently are deemed not “material” by its public accounting firms and thus the true litigation exposure is not reflected in its public financial filings.)

Swiss they ain’t.