Saturday, December 28, 2013

COUNTING WORDS

(from Kirkus
by Gregory McNamee)

Merry Christmas.

It’s long been said that the phrase owes to Charles Dickens—more specifically, to the Charles Dickens of A Christmas Carol, a short novel serialized in 1843. A few sources claim that Dickens coined the phrase, others merely that he popularized it. It’s the kind of thing that lexicographers live for, the kind of thing that fuels rarified research that would require decades of reading and a slew of grants, library support and assistants.

Or, failing that, a quick visit to the computer and the Google Books Ngram Viewer, which tells us that “Merry Christmas” came along in about 1800, spiked in the 1830s, dropped to almost nothing, rose again in the mid-1840s, and then steadily climbed in popularity, beating the stuffing out of its closest competitor, “Happy Christmas.”

There’s meaning to be drawn from those numbers: first, in establishing that Dickens didn’t coin it, second, in suggesting that he may very well have popularized it and third, in proving once again that you can’t count on received wisdom for much of anything.

A second example: Eminent historians have maintained for decades that before the Civil War, Americans referred to the United States as a plural entity: “The United States are a nation devoted to liberty.” Come the Civil War and Union victory, and, presto, Americans referred to the country in the singular: “The United States is a nation devoted to liberty.” Received wisdom, once again—and, as Erez Aiden and Jean-Baptiste Michel point out in their new book, Uncharted: Big Data as a Lens on Human Culture, not so. Instead, they note, the transition from the singular to the plural “was gradual, starting in the 1810s and continuing into the 1980s—a span of more than a century and a half.”

It’s the “big data”—a term that itself, Google’s Ngram Viewer tells us, has been spiking in usage for the last 20 years—of Aiden and Michel’s subtitle that sheds light on such matters, data that in this case is derived from a robotic look at the vast corpus of words, all 500 billion or so of them, that Google has been assembling through, among other vehicles, its on-again, off-again, controversial program of scanning whole libraries into digital form. “There are no complex equations here,” data-science entrepreneur Michel tells Kirkus. “When it comes down to it, it’s simple: We’re counting words.”

The term “big data” has a bit of Big Brother about it in a time when the National Security Administration, it seems, knows more about us than we do. But, says geneticist Aiden, big data “is incredibly democratizing. There’s a first-person adventure story possible in big data, where a student can approach a very big question in very simple ways and make big discoveries.” Adds Michel, “This is especially true of bodies of data that are way too large to go through by hand”—among them the thousands of books that the Ngram Viewer sorts through to answer questions about the prevalence of phrases such as “the United States is” and “Merry Christmas.”

Recalls Michel—who, like Aiden, is in his early 30s and therefore too young to remember much of the world without computers—a few years ago, the twoAiden&Michel cover gave a presentation at a library conference in which they talked about how to use Ngrams to interrogate vast bodies of data. It was the first time, in fact, that what would become the Ngram Viewer was shown outside Google proper. “Librarians are supposed to be pretty quiet,” he says, “but at the end of the talk, after we showed our results, we asked if there were any queries. They went wild. They were yelling over each other: Try ‘pirates’! Try ‘ninja’! Try ‘vampires’! They were a raucous bunch.” Adds Aiden, “In the early searches, there were lots of swear words, too.”

It’s easy enough to try for yourself, as I did, testing “ninja.” It’s said that Ian Fleming introduced the Japanese term into English in one of his James Bond novels, You Only Live Twice. He didn’t, and though Ngram doesn’t quite have the power yet to home in on the very first appearance of the term (for that we’ll have to scan billions more words into the corpus, a project on which thorny copyright issues still ride), what it does show is that sure enough, after 1964, when Fleming’s book appeared, the term becomes steadily more common, rising to its current everyday status in our culture.

Swear words and vampires aside, Aiden, who is now on the faculty of the Baylor College of Medicine, recalls talking with a couple of dermatologists who had made discoveries about several obscure medical conditions simply by asking the right Ngram questions. The doctors were probably very good at their work, but they were also not trained in the arts of Boolean logic or the fine details of search-engine syntax. Instead, they were ordinary people making use of data that was meaningful to them—another instance, the authors note, of the democratizing possibilities of big data made freely available.

“The earliest optical lenses helped us look into the cosmos and ask questions about the biology of organisms,” says Aiden. “The lens in itself is a pretty simple device, but it can do cool stuff. In the same way, the big data lens we’re talking about here is pretty simple, but it allows us to ask really meaningful questions about ourselves. We can repartition all kinds of research. Ordinary people without vast amounts of training can ask exciting questions. I bet that in 20 years, there will be hundreds of major breakthroughs in our knowledge—and that half of them will have been pulled off by teenagers.”

“We hope that, with our book, our contribution is about the possibility that anyone can have a big data adventure,” Aiden adds. “And we hope that our readers will feel energized in knowing that we now have measurement tools that allow us to look at things that we could never have imagined measuring before.” How many times “Merry Christmas” appeared in books before 1843 is just a start, and Uncharted makes for a fine user’s manual in a discipline that we have only just begun to explore.


Gregory McNamee is a contributing editor at Kirkus Reviews.

No comments:

Post a Comment