CNN ScriptScrape

Technologies Used

Node.js, MongoDB

Goal

I was reminded of Betteridge’s law of headlines, a journalistic adage that states: “Any headline that ends in a question mark can be answered by the word no.” I’ve frequently seen such headlines on CNN’s air, so I wanted to find out just how many of CNN’s scripts had at least one headline – also known as a banners – that ended in a question mark.

Approach

All iNEWS scripts are automatically archived immediately at the end of each live show and stored in an application called ScriptSource. Because I didn’t have direct access to this database, the Node script is designed to scrape each script from each hour available in ScriptSource.

The answer to my question would only be meaningful if it was limited to a date range. I chose year-to-date, and let the script go to work. Because of built-in, intentional delays to prevent bombarding ScriptSource with requests, the scraping took over twelve hours to complete!

Once the scraping was finished, I had a MongoDB collection of 73,366 individual scripts to query against.

To answer to the original question mark query, I used the aggregation framework and some regular expression matching to return every script that qualifies.

And the count came to 4,194. While there are 73,366 total scripts, many are used to include graphics and contain no copy. Generally speaking, most scripts that have copy also have a banner, so to get the correct denominator, I counted only the non-empty scripts:

The percentage I got at the time was 11.3% For this writing, with 38,611 non-empty scripts, the numbers come to 10.9%. I’m not sure what method I used then, but these queries give approximately the same result.

Happy with the results, I wanted to see what President of CNN Jeff Zucker thought. Here’s his response from the July 2015 CNN town hall:

Status

This was originally meant to be a single-purpose project, so there is no simple way to access all of the data, nor is it actively scraping new scripts.

However, there could definitely be more applications for this, such as heat-mapping locators by hour on a map to give a visual of everywhere CNN is going. A quick count query shows that Washington is the most fonted location on CNN domestic:

About The Author

Kyle Anderson
I'm a media and IT professional and JavaScript developer who worked most recently as an Associate Broadcast IT Engineer (Tier II) for CNN in Atlanta. One of my life-long goals is to help bridge data divides - missing connections between software systems and data stores - promoting inter-system communication and automation. Many of the projects described here reflect this goal in some way or another.