Editorial Bias in Claremont Student Newspapers
I used natural language processing (NLP) algorithms to analyze the five student newspapers of the Claremont Colleges.
Highlights of the analysis include that The CMC Forum has the most objective reporting,
and that nobody likes to write articles about Harvey Mudd College.
The Claremont Colleges have 5 student newspapers:
The Student Life,
The Scripps Voice,
The Golden Antlers,
The CMC Forum,
The Claremont Independent.
All together, they've published more than ten thousand articles online.
My goal was to understand what types of articles each of these papers publishes.
To solve this problem, I first downloaded all the articles from each of these newspapers.
Then I applied five different natural language processing techniques to analyze the dataset:
Writing Style Analysis,
Supervised Topic Analysis,
and Unsupervised Topic Analysis.
Each one of these topics has their own section in the writeup below.
I used the NovichenkoBot web scraper to download all the articles that the Claremont student newspapers have ever published online.
NovichenkoBot is a tool that I wrote as part of a larger research project for analyzing the bias in online news.
It takes a newspaper's web address as input (e.g. https://tsl.news for The Student Life),
downloads the newspaper's entire website,
and then extracts all the articles from the website.The process of extracting articles from a website is surprisingly difficult because not all URLs on a domain correspond to a unique article. For example, the URL https://tsl.news/category/life-and-style/ is not an article for The Student Life, but rather a list of articles. The NovichenkoBot program is able to identify this fact and does not include the contents of these pages in the list of articles. The NovichenkoBot is also able to identify ads and other content that is not part of the article's main body of text, and ensures that these extraneous pieces of information are not included in the analysis.
For each article, the program also outputs lots of metadata like the date that the article was published.
The following plot shows the total number of articles published by Claremont student newspapers in recent years:
Several of the newspapers are significantly older than the graphs to the right imply.
The Student Life has been publishing continuously since 1889,
and The Scripps Voice has been publishing continuously since 1998.
But their webpages only contain more recent articles,
and this analysis only covers the online articles.
From this plot, we can easily see that The Student Life is the most active of the student newspapers,
publishing about 2/3rds of all articles.
Another interesting trend is that The CMC Forum has been publishing fewer articles every year since 2013.
This is the year that The Claremont Independent was founded,
and so it seems likely that many of the writers who would have otherwise contributed to The Forum are contributing to The Independent instead.
Writing Style Analysis
I began my analysis by calculating some simple statistics about the articles each paper publishes.
The following two figures use box and whisker plots to show the typical number of words in the titles and bodies of the articles for each paper.
The red bar indicates the medianThe "median length" is a better measure than the "average length" of titles/bodies for our purposes because it is less susceptible to outliers. In particular, it gives us a good idea of the length of a "typical" article we might read from each paper if we picked one at random. number of words in an article's title or body.
The first thing I noticed about these plots is that The Golden Antlers has the longest titles and the shortest articles.
This makes sense because they are a satire paper,
and the main joke of each article is in the article's click-baity title.
I also noticed a similar trend in Claremont's two consortium-wide non-satirical papers.
Compared with The Student Life, The Independent uses fewer words in their titles but has longer stories.
This suggested to me that The Independent is likely to have a more academic and in depth writing style, whereas The Student Life is likely to focus more on traditional journalism.
To test this hypothesis, I used Python's textstat library to calculate the Coleman-Liau Index for each document.
This index estimates the "grade level" that the document is written at.
For example, a document with Coleman-Liau Index 8 would be appropriate for someone who has graduated middle school,
12 would be appropriate for someone who has graduated high school,
and 16 for someone who has graduated college.
As before, I plotted the results with a box and whisker plot:
We can see that the typical article in The Independent is written at a level roughly 1.5 grades higher than The Student Life,
and more than two grade levels high than the other papers.
This advanced reading level is not due to longer or more complex sentences.
As seen below, The Independent and The Student Life have similar sentence lengths,
with The Student Life actually being slightly longer.
Instead, it is due to the use of larger, more difficult words.
The following chart shows the number of rare wordsA rare word is defined to be a word that is not in the top ten thousand most commonly used English words. used per article in each newspaper:
And the following chart shows the average size of words in each article:
From both of these charts, we can see that The Independent uses rarer, longer words than the other venues,
and this vocabulary choice is what results in their more advanced reading level.
This analysis raises two questions:
Is The Independent's advanced reading level a sign that they are writing better articles that contain more nuanced discussion?
Or is it a sign that they're making obfuscated arguments that are intentionally confusing?
Unfortunately, these are questions that a machine can't answer for us and are up for interpretation.
Next, I wanted to understand whether the different papers generally say positive or negative things.
That is, I wanted to analyze the sentiment expressed in the papers.
Python's textblob library is a popular tool for sentiment analysis.
It uses a simple algorithm that associates each word in the English language with a score between -1 and +1 (for bad and good sentiments respectively).
For example, the word "evil" has score -1,
the word "newspaper" has score 0,
and the word "happy" has score +1.
The sentiment of a sentence is then the average sentiment of all its words.
Using this tool, I calculated the sentiment for each article's title,
and the results are shown below.
We can see that The Golden Antlers publish the most positive posts, which makes sense for a satire paper.
According to textblob, the most positive posts of all time from The Golden Antlers are:
- CMC: The Best God-Damn Campus in America
- Impressive First-Year Pomona Class Includes Jesus Christ himself
- Princeton Review Rates Claremont McKenna “#2” in “Best SAT Scandal”
Of course, these titles are "positive" in only a superficial way and are actually making fun of CMC and Pomona.
But algorithms can't (yet) measure sarcasm.
I was more surprised to see that The Scripps Voice has the most negative titles.
So again I looked at the three most negative titles:
- Sorry Claremont, Stag Party was Disgusting
- "Women Are Just Bad at Math" Scripps Administration Admits
- “She’s Just So Desperate”
What these these articles have in common is that they are criticizing how women are treated and viewed by society.
It definitely makes sense that the newspaper of a women's college would have a lot of negative things to say about how women are treated.
My next task was to determine how subjective the different newspapers are,
and Python's textblob library also contains a tool for measuring the subjectivity of text.
The algorithm works similarly to the sentiment analysis algorithm above.
Each word is given a subjectivity score between 0 and 1,
and then the overall subjectivity of a sentences is the average subjectivity of the component words.
The resulting subjectivity scores are plotted below.
We can see that The Golden Antlers has the most subjective text,
which makes sense because it is a satire paper with over-the-top headlines and articles.
The remaining papers are all relatively objective,Recall that the red bar in the box and whisker plot indicates a median, so more than half of all articles from The Forum, The Independent, The Scripps Voice, and The Student Life have zero subjectivity (or equivalently 100% objectivity) according to textblob.
with The Forum being the most objective of them all (but just barely).
Supervised Topic Analysis
My next goal was to learn about *what* the papers are writing about rather than *how* they are writing it.
Data scientists call this procedure Topic Analysis,
and it has two forms: supervised and unsupervised.
Supervised topic analysis is the simplest version,
and so that's what I started with.
In supervised analysis, we are given a list of topics, and want to know how frequently each newspaper discusses each topic.
For my analysis, I made the list of topics be the five undergraduate Claremont colleges.
I counted the total number of times that articles written in each of the 5 papers mention each of the 5 colleges,Naively, we can generate these counts with a simple for loop that loops over the entire dataset. There are two complicating factors, however, that need to be considered.
First, some of the colleges can be referred to in multiple ways. For example Claremont McKenna College could be referred to by it's full name, "Claremont McKenna", "McKenna", or "CMC". The code counted the number of times any of these synonyms was mentioned.
Second, pronouns might sometimes refer to a college, and we want to count these references as well. Therefore, before looping over the files to perform the count, I performed a procedure called "coreference resolution" that replaces each pronoun in the text with the subject it refers to. I used the neuralcoref package to do this substitution. and plot the results below.
The CMC Forum and The Scripps Voice mention CMC and Scripps colleges respectively a large number of times, which makes sense because these papers have an explicit editorial mission to focus on these colleges.
More surprisingly, The Student Life mentions Pomona college a large number of times, which seems to contradict their stated mission of serving the entire Claremont Consortium equally.
A careful look at the articles written by The Student Life shows that they are particularly inclined to also say more positive things about Pomona College than other colleges.
If we look at the 50 articles with the most positive sentiment from The Student Life, we don't find a single mention of any non-Pomona college by name, but we find many articles like:
- Pomona Named Best Value College Again
- Pomona Students Bring Their Best Pop Culture Debates to the Blogosphere
- Vigil Shows Pomona Students at Their Best
The Claremont Independent has the most balanced coverage of the Claremont Colleges,
but notably has a distinct lack of reporting about Harvey Mudd.
In fact, Mudd is among the least mentioned colleges for every newspaper.
My hypothesis is that writers for student newspapers are typically humanities majors,
and therefore less interested in reporting about the engineering-focused events that happen on Mudd's campus.
Unsupervised Topic Analysis
Unsupervised topic analysis is significantly more complicated than the supervised analysis above.
In the unsupervised version, we are not given a set of topics, but we instead try to figure out automatically what the list of "most discussed" topics are for each paper.
There are many algorithms for unsupervised topic analysis,
and they are all significantly more complicated than the algorithms for the previous analyses.
For this project, I used a variant of L1-regularized logistic regression,
The unsupervised topic analysis figure was created through a pipeline of machine learning algorithms provided by the scikit-learn library.
First, I created a bag-of-words features from the articles using 3-gram tokens.
Then, I trained a logistic regression model where the class labels were the newspaper that wrote the article.
I needed to use regularization because the number of features (86151) is significantly greater than the number of articles (10305), and I chose L1 regularization because sparse solutions generate nicer visualizations.
The visualization is generated by plotting weight coefficient matrix of the trained logistic regression model.
The final matrix has shape 5x86151, but we only plot the 30 columns with largest L2 norm since these correspond to the 30 words that are most influential in predicting the class label.
Finally, I rearranged the order of the columns (using a co-clustering method) to make the resulting matrix a bit easier to read.
but the result of the analysis is shown in the figure below.
The y-axis of this figure shows the 5 student papers,
and the x-axis shows the 30 most significant topics.
Each box is colored blue if the corresponding topic and newspaper are highly associated with each other (i.e. the newspaper mentions that topic more than other papers);
the box is white if there is no association,
and the box is red if there is negative association
(i.e. the newspaper never mentions that topic).
From this figure, we can see that The Forum is associated with photo essays and many topics related to CMC.
Some of their top-rated posts on these topics are:
- How to Successfully Stage a Protest at CMC
- CMC Ranked Tenth Best College By U.S. News
- Photo Essay: 5C Art Spots You Wish You Had Known About Earlier
- Photo Essay: Toga Party 2018
We can also see that The Independent is associated with articles on "safe spaces", "blacklivesmatter", and the "lgbtq director".
Looking at their top-rated articles, we see that The Independent tends to criticize these topics:
- How #BlackLivesMatter Failed Black America
- I’m “Wary of White Gays,” “Women,” Says New LGBTQ Director
- Safe Spaces: Where Free Press Dies
The figure above shows few topics from The Scripps Voice or The Student Life.
This is because I only plotted the top 50 topics (so that they would all fit on the screen).
If I has plotted more topics, then we would have seen more topics related to these two papers.
Each of the Claremont newspapers has its niche,
and in this article I used some simple data mining techniques to help me better understand what those niches are,
and to find representative articles about those niches.
These analyses are just scratching the surface of what modern natural language processing techniques can do.
If these types of analysis are interesting to you,
you should consider taking CMC's data science sequence.
This sequence of courses will teach you state-of-the-art natural language processing techniques and how to apply them to a wide variety of application domains.
If you have any questions, feel free to reach out to me at firstname.lastname@example.org.