A quick look at recommendation engines and how the New York Times makes recommendations

A recent prediction that algorithmic curation would be one of the major trends of 2016 got me thinking about news recommendation engines. I’ve always been curious about the technology so I recently started digging into what makes them work and realized there is a whole lot to learn. But a little research and conversation with a newsroom technologist at New York Times helped me to understand how they work.

First you should know that the basics are easy enough to understand even if the code and academic articles are beyond your grasp, as they were to me. In fact, a recommendation system is a little like a librarian. If she knows you, she can recommend a book based on your interests; if she doesn’t, she can recommend best sellers that everyone seems to be reading. These two scenarios are closely related to the main types of recommendation engines.

Content-based recommendations

Traditionally, news websites have used tags to categorize their content. Every time an article is published, editors add a tag or multiple tags to it that describes the content and makes it easy for computers to categorize. At it’s most basic, a content-based filter would recognize that you read articles tagged “food”, and then recommend other “food” articles. This system is similar to the librarian going pointing you to the young adult or mystery section of the library because she knows where your interests lie.

More sophisticated content-based filtering systems go beyond article tags, and use natural language processing and other techniques to get a deeper understanding of both the story and the reader. The Washington Post’s Clavis tool is one good example.

Collaborative filtering

Collaborative filtering recommends content based on your past reading habits and other people like you. If an article is popular with readers like you, a collaborative filter is likely to recommend it. In the library example, a librarian who knows your interests are similar to another student can recommend you read books that the other student liked.

This method uses the reading history of one reader to recommend articles to another one who has the same preferences or reading history. So if you are consuming content related to technology, you will be recommended what other tech-geeks are reading on the web.

Hybrid filtering and the New York Times' system

Many recommendation systems combine content-based filtering with collaborative filtering to produce what’s known as the hybrid-filtering system. Hybrid systems base their recommendations on both the content of the article, as well as who is reading that article and how popular it is.

While researching this post, I came across an explainer of the New York Times' system by Alexander Spangher, a data-scientist at NYT, that helped me to understand how it works. The first thing to note is that hybrids help frequently make better recommendations than content-based or collaborative filters alone, which is why they power many news recommendation engines.

The NYT uses a natural language processing technique called Latent Dirichlet Allocation that allows computers to figure out what an article is about by counting the number of times a particular word appears and comparing that count to other articles, according to Spangher's post. LDA can also help weigh how much of an article is devoted to a particular topic, which allows the system to categorize an article, for instance, as 50% global warming and 40% politics.

 

Latent Dirichlet Allocation is a topic-modelling technique that helps computers figure out what a piece of content is about. (Source: Communications of the ACM)

The NYT system also checks to make sure recommended articles aren't too old or outdated. The assessment checks for the "evergreenness" of each article by looking at “a number of article-features like word-count, frequency of updates, and even the presence of specific words like 'announce' or '<day-of-week>' to predict how long each article will stay fresh in the candidate pool of articles to be recommended,” Spangher said.

One problem with the LDA approach is deciding topics in articles in which the number of times a word appears in a story is not always an accurate reflection of the topic of that story. Think about satirical articles that use metaphors or puns, or articles about different places that have the same name. In these cases, NYT’s recommendation-engine uses the information on who is reading that article (collaborative filtering). If an article is identified through LDA as containing 20% politics and 80% family, but is primarily read by readers interested in 90% politics, the recommendation-engine would adjust the categorization of the article using both these percentages.

Understanding readers

So how does NYT decide that the readers are interested in 90% politics? They use the simple method of applying topic-modelling on the articles you have read. If you read seven articles in a week, NYT recommendation-engine would use numerical average of the topics in those seven articles to decide what your interests are.

But wait! What if you did not like three of those seven articles you read? Or maybe you clicked on them but never really read them? And what about the articles you never clicked? The simple answer is that there is no way for a recommendation-engine to completely account for such actions. However, to make the system more efficient, NYT applies the technique of what they call the back-off approach. This approach assumes that you “90% like” the article you clicked, and “10% like” the article you did not. The recommendation system would then use these values to weigh topics of your interest, resulting in a more conservative average than the one simple numerical calculations would produce.

Red dots represent articles you haven’t read and green dots represent articles you have. (Source: NYT Open)

The NYT is constantly updating their article-recommendation system and testing new models, Spangher said. One model they plan to further develop in future is weighing different reader behaviors like scroll-depth, dwell-time and social-share to better understand reader preferences. According to Spangher, “results show scroll-depth of 50% to be more indicative of interest than scroll-depth of 100%, which we assume relates to knowledgeable readers not needing to read a full piece.”

The NYT Developers team also plans to enhance their recommendation system by introducing sequence recommendations. “For example,” Spangher explained, “if on cooking.nytimes.com, we see that you are a steak person, we might not want to recommend the top N steak recipes. Rather, a steak recipe followed by a dessert, then a wine pairing might create a better experience.”

Each news organization publishes hundreds of articles and blog posts every day, and as more and more content is produced online, it is important for these organizations to filter the online content to ensure they can direct the most relevant articles to each individual user. “We want to provide the most relevant news and information to our readers so they stay longer and read more,” Spangher added. “Recommendation engine is part of The Times's strategy to engage readers.”

Latest Posts

  • Building a Community for VR and AR Storytelling

    In 2016 we founded the Device Lab to provide a hub for the exploration of AR/VR storytelling on campus. In addition to providing access to these technologies for Medill and the wider Northwestern community, we’ve also pursued a wide variety of research and experimental content development projects. We’ve built WebVR timelines of feminist history and looked into the inner workings of ambisonic audio. We’ve built virtual coral reefs and prototyped an AR experience setting interviews...

    Continue Reading

  • A Brief Introduction to NewsgamesCan video games be used to tell the news?

    When the Financial Times released The Uber Game in 2017, the game immediately gained widespread popularity with more than 360,000 visits, rising up the ranks as the paper’s most popular interactive piece of the year. David Blood, the game’s lead developer, said that the average time spent on the page was about 20 minutes, which was substantially longer than what most Financial Times interactives tend to receive, according to Blood. The Uber Game was so successful that the Financial...

    Continue Reading

  • With the 25th CAR Conference upon us, let’s recall the first oneWhen the Web was young, data journalism pioneers gathered in Raleigh

    For a few days in October 1993, if you were interested in journalism and technology, Raleigh, North Carolina was the place you had to be. The first Computer-Assisted Reporting Conference offered by Investigative Reporters & Editors brought more than 400 journalists to Raleigh for 3½ days of panels, demos and hands-on lessons in how to use computers to find stories in data. That seminal event will be commemorated this week at the 25th CAR Conference, which...

    Continue Reading

  • Prototyping Augmented Reality

    Something that really frustrates me is that, while I’m excited about the potential AR has for storytelling, I don’t feel like I have really great AR experiences that I can point people to. We know that AR is great for taking a selfie with a Pikachu and it’s pretty good at measuring spaces (as long as your room is really well lit and your phone is fully charged) but beyond that, we’re really still figuring...

    Continue Reading

  • Capturing the Soundfield: Recording Ambisonics for VR

    When building experiences in virtual reality we’re confronted with the challenge of mimicking how sounds hit us in the real world from all directions. One useful tool for us to attempt this mimicry is called a soundfield microphone. We tested one of these microphones to explore how audio plays into building immersive experiences for virtual reality. Approaching ambisonics with the soundfield microphone has become popular in development for VR particularly for 360 videos. With it,...

    Continue Reading

  • Audience Engagement and Onboarding with Hearken Auditing the News Resurrecting History for VR Civic Engagement with City Bureau Automated Fact Checking Conversational Interface for News Creative Co-Author Crowdsourcing for Journalism Environmental Reporting with Sensors Augmented Reality Visualizations Exploring Data Visualization in VR Fact Flow Storytelling with GIFs Historical Census Data Information Spaces in AR/VR Contrasting Forms Of Interactive 3D Storytelling Interactive Audio Juxtapose Legislator Tracker Storytelling with Augmented Reality Music Magazine Navigating Virtual Reality Open Data Reporter Oscillations Personalize My Story Photo Bingo Photojournalism in 3D for VR and Beyond Podcast Discoverability Privacy Mirror Projection Mapping ProPublica Illinois Rethinking Election Coverage SensorGrid API and Dashboard Sidebar Smarter News Exploring Software Defined Radio Story for You Storyline: Charts that tell stories. Storytelling Layers on 360 Video Talking to Data Visual Recipes Watch Me Work Writing and Designing for Chatbots
  • Prototyping Spatial Audio for Movement Art

    One of Oscillations’ technical goals for this quarter’s Knight Lab Studio class was an exploration of spatial audio. Spatial audio is sound that exists in three dimensions. It is a perfect complement to 360 video, because sound sources can be localized to certain parts of the video. Oscillations is especially interested in using spatial audio to enhance the neuroscientific principles of audiovisual synchrony that they aim to emphasize in their productions. Existing work in spatial......

    Continue Reading

Storytelling Tools

We build easy-to-use tools that can help you tell better stories.

View More