Online Musings of a Public Historian

Posts tagged ‘google books’

Quantifying Culture? Culturomics and the Google Books Corpus

Can tracing linguistic changes over time reflect shifts in cultural trends?

According to Jean-Baptiste Michel and the other minds behind the culturomic analysis movement, the answer is a resounding “yes.”

Working with the team responsible for the Google Books online collection, Michel and his fellow researchers constructed a corpus of almost 5.2 million digitized books.  Using this Google Books corpus, the team of scholars conducted a quantitative study analyzing the relationship between shifting linguistic and cultural changes over the period between 1800 and 2000.  Referring to this quantitative approach to measuring cultural trends as “culturomics,” Michel and company used their findings to produce the Google Ngram Viewer, an online tool of research through which everyday users can conduct their own studies within the Google Books dataset.  Users are instructed to simply enter a word or phrase (called an “ngram”) into the Viewer’s search bar, resulting in the creation of a line graph data chart chronicling that particular ngram’s level of usage within the corpus throughout the two-hundred year timeframe the study samples.

Sample Ngram Viewer study tracing the name "Abraham Lincoln"

Sample Ngram Viewer study tracing the name “Abraham Lincoln”

As seen in the sample Ngram Viewer study above, users can additionally use the data provided in the line graph to link particular peaks in ngram usage to significant historical events and/or cultural movements.  Using the name of one our nation’s more well-renowned leaders, “Abraham Lincoln,”  as an example, we can see that the initial spike in usage of his name in published works falls (predictably) within the period of his election, presidency, and duration of the Civil War.  Subsequent spikes occur in the years following World War I – a period of intense nationalism, during which Lincoln and other figures came to looked upon national heroes – and during the time surrounding the Civil Rights Movement, when associations with the Emancipation Proclamation and the ending of slavery in the U.S. were strongly linked to the mid-20th century struggle for racial equality.

While the arguments laid out by Michel and company highlighting the benefits of using the qualitative methods associated with culturomics in gaining fuller insight into traditionally humanist topics certainly make a strong  point, it is clear that the field still has a long way to go.  Glancing at the Culturomics FAQ page set up by the project’s participants,  one can see that there are still several kinks to be worked out with this particular method of study, particularly in relation to the quality of data.

Despite the undoubtedly large size of the Google Books corpus, the 5.2 million digitized works still only make up for around four percent of all published materials.  Similarly, the study focuses primarily on the years between 1800 and 2000 (despite the presence of materials dating as far back to the 16th century), since the data originating in those periods has proven most reliable.

Do these constraints undermine the quality of data produced by the Ngram Viewer? What can be done to widen these parameters? What should we as historians bear in mind while using the culturomic approach in our own work?