To the casual observer, the political world appears more and more partisan with every new election cycle. Offering data to support that intuition during a recent Becker Brown Bag talk, Matt Taddy of the University of Chicago Booth School of Business presented the results of research he has been conducting with Matthew Gentzkow of Stanford University and Jesse Shapiro of Brown University. The project sets out to answer a seemingly basic question: Is partisan speech a new phenomenon?

Partisan speech is measured by counting the frequency of words used in politicians’ public statements and connecting that information to their party affiliation. For example, Republicans are found to be much more likely to use terms such as “tax relief,” “illegal immigrant,” or “Islamic terrorist,” while Democrats are more likely to use the terms “tax breaks,” “undocumented worker,” and “Muslim American.”

To measure trends in the partisanship of political speech, Taddy and colleagues coded and analyzed data from the US Congressional Record from 1879 to 2009.

The immediate challenge that Taddy noted was that speech is high-dimensional choice data: the size of a vocabulary (the choice set) is in the millions. In fact, they found that there were around seven million unique phrases used in Congress since the Civil War, most of which were used very rarely. In addition, Taddy pointed out that there is potential for severe finite-sample bias; as the researchers do not have infinite data, they have to read into a small data sample of what each politician would like to say.

To work with the raw data, Taddy and the other researchers tokenized that data, taking the text and breaking it into pieces (usually words or small combinations of words). They worked mainly with stem words, removing prefixes and suffixes, then tabulated these tokens. These basic counts per individual showed that the amount of speech exploded around the 1960s, while partisanship didn’t dramatically increase until later. This increase in verbosity is most likely due to the rise of television.

Taddy then described the model the researchers had set up, including a vector for phrase counts, party affiliation, and speaker characteristics such as gender, birth year, state, margin of the election, and such, as well as verbosity (the total amount that speaker speaks).

Taddy walked the student audience through the pitfalls and biases that arise when analyzing high-dimensional data with standard statistical methods. He showed how the researchers managed to eliminate most of the bias from the results, finding that not only did the average amount of speech increase a great deal from the earliest sample to 2008, but that speech became more partisan. Moreover, researchers found that in the past, politicians from opposing parties talked about different topics; more recently, they talk about the same topics using different party-line language. Interestingly, even when the researchers broke out the vocabularies by topic, they found that the partisan content of words is diverging over time.

Taddy noted that partisanship in this model is similar to isolation in a neighborhood segregation model—degree of racial isolation can be predicted based off of neighborhood characteristics, much in the way that party affiliation can be predicted based off of word choice.

Amelia Snoblin