Online Musings of a Public Historian

Posts tagged ‘culturomics’

What in the Wordle?

A few posts ago, we examined the concept of culturomic analysis and explored the Google Ngram Viewer, a digital research tool that uses a document’s text patterns to link various lingual trends with specific historical periods.  Not the only word-based tool in the shed (so to speak), the Google Ngram Viewer is one among several similar culturomic platforms available on the web.

One such alternate tool is Wordle, a “word-cloud” generator derived from the input of text, giving “greater prominence to words that appear more frequently in the source text.” Whereas the Google Ngram tool matching lingual patterns in written documents with historical events and trends, Wordle serves more as a means of determining a specific document’s major themes and features.

To test the generator’s efficacy, I uploaded an old research paper of mine discussing the circumstances surrounding England’s 1605 “Gunpowder Plot” and subsequent annual Bonfire Night celebrations commemorating the event.

Behold, the visually-appealing result:

 

Screen Shot 2014-03-27 at 4.20.22 AM

“Gunpowder, Treason, and Plot”
via Wordle.

On the whole, Wordle proved relatively successful in conveying the document’s key themes. Looking at the generated word cloud, we can get a general grasp on the individuals involved, with the two most prominently associated with the plot (Sir Robert Catesby and Guy Fawkes) whose significance is accurately reflected through their placement as the two largest (most frequent) words. The plot’s religious connotations (along with those surrounding the ensuing holiday celebrations) are similarly highlighted.  Surprisingly absent, however, are any references to the Bonfire Night celebrations following the events of the plot, on which a significant portion of the document focuses.

Additionally, several very common words (“although,” “new,” “well”) are given precedence within the cloud, along with a fair number of common names (“Robert,” “Elizabeth,” “John”) and word variations (“Plot” vs. “plot,” “Catholics” vs. “Catholic“).

Despite these nuances, Wordle still serves as an effective means of tracking a document’s language patterns and highlighting its primary themes.  The platform’s personalization features add to its appeal, and include numerous options regarding font, cloud layout, and color scheme.

On the whole, the Wordle cloud-generator offers users an entertaining, personalized experience in obtaining a general representation of a document’s contents and main ideas. The application works best as more of a starting point in the research process, however, with users seeking a more in-depth analysis better served doing supplementary studies elsewhere.

 

The Legalities of Culturomics

In previous posts, we’ve discussed issues of fair use and taken a look at the Google Books corpus and the new trend in culturomic analysis. Now, let’s do a mash-up of the two as we examine the legality of the TIME Magazine Corpus of American English:

Entry Page for the TIME Magazine Corpus

Entry Page for the TIME Magazine Corpus

Working with Mark Davies, a corpus linguistics professor at Brigham Young University, TIME Magazine has put together its own text database through which users can:

…quickly and easily search more than 100 million words of text of American English from 1923 to the present, as found in TIME Magazine.  You can see how words, phrases and grammatical constructions and see how words have changed meaning over time.

Compared with the size and scope of the Google Books corpus project (5.2 million digitized books spanning a period of several hundred years), the 100 million words and eighty year timespan (1923-2006) featured in the TIME corpus appears positively miniscule.  This small scale is not necessarily a setback, however, particularly when it comes to matters of copyright and issues of fair use.

Part of the reason for the smaller scope of the TIME corpus, for example, is due to the fact that all of its featured data is culled from the TIME Magazine archives. As such, all of the data within the corpus is also owned by the entity maintaining it. This allows TIME to share such data without worry of violating copyright ownership, and to additionally provide users of the corpus the opportunity to read the highlighted text in its original context, offering access to full articles as they were initially published.

With ownership of all of its content, along with the resultant ability to offer further access to previous publications, the TIME corpus functions safely within the parameters of fair use. This is beneficial for corpus operators and potential users alike, allowing for the corpus to provide a meaningful research experience for all involved.

Quantifying Culture? Culturomics and the Google Books Corpus

Can tracing linguistic changes over time reflect shifts in cultural trends?

According to Jean-Baptiste Michel and the other minds behind the culturomic analysis movement, the answer is a resounding “yes.”

Working with the team responsible for the Google Books online collection, Michel and his fellow researchers constructed a corpus of almost 5.2 million digitized books.  Using this Google Books corpus, the team of scholars conducted a quantitative study analyzing the relationship between shifting linguistic and cultural changes over the period between 1800 and 2000.  Referring to this quantitative approach to measuring cultural trends as “culturomics,” Michel and company used their findings to produce the Google Ngram Viewer, an online tool of research through which everyday users can conduct their own studies within the Google Books dataset.  Users are instructed to simply enter a word or phrase (called an “ngram”) into the Viewer’s search bar, resulting in the creation of a line graph data chart chronicling that particular ngram’s level of usage within the corpus throughout the two-hundred year timeframe the study samples.

Sample Ngram Viewer study tracing the name "Abraham Lincoln"

Sample Ngram Viewer study tracing the name “Abraham Lincoln”

As seen in the sample Ngram Viewer study above, users can additionally use the data provided in the line graph to link particular peaks in ngram usage to significant historical events and/or cultural movements.  Using the name of one our nation’s more well-renowned leaders, “Abraham Lincoln,”  as an example, we can see that the initial spike in usage of his name in published works falls (predictably) within the period of his election, presidency, and duration of the Civil War.  Subsequent spikes occur in the years following World War I – a period of intense nationalism, during which Lincoln and other figures came to looked upon national heroes – and during the time surrounding the Civil Rights Movement, when associations with the Emancipation Proclamation and the ending of slavery in the U.S. were strongly linked to the mid-20th century struggle for racial equality.

While the arguments laid out by Michel and company highlighting the benefits of using the qualitative methods associated with culturomics in gaining fuller insight into traditionally humanist topics certainly make a strong  point, it is clear that the field still has a long way to go.  Glancing at the Culturomics FAQ page set up by the project’s participants,  one can see that there are still several kinks to be worked out with this particular method of study, particularly in relation to the quality of data.

Despite the undoubtedly large size of the Google Books corpus, the 5.2 million digitized works still only make up for around four percent of all published materials.  Similarly, the study focuses primarily on the years between 1800 and 2000 (despite the presence of materials dating as far back to the 16th century), since the data originating in those periods has proven most reliable.

Do these constraints undermine the quality of data produced by the Ngram Viewer? What can be done to widen these parameters? What should we as historians bear in mind while using the culturomic approach in our own work?