Semantic APIs, what to consider when picking a text analysis tool

Today, our online experiences are richer and more interconnected than ever. This is in part due to the existence of third-party services called Application Programming Interfaces, or APIs for short. APIs allow computer systems to speak with each other and exchange information. Facebook and Twitter’s APIs, for example, allow Twitter to repost your Facebook updates, and vice versa.

At the Knight Lab, we often make use of semantic APIs. These APIs will usually take text in the form of a file or URL, and tell you something about it. At their most basic, they might identify what part of the webpage is actually the article (and not the HTML, or the advertisements). There are many other services semantic APIs offer, but one of the most common identifies the topic of the text you submit, and/or finds “named entities” in the text. (These named entities include people, places, organizations, nations, and other similar things).

Given that there are several APIs that offer similar services, how can we (and you) choose between them?

Case 1: When the API is either right, or wrong

Let's say you want to know whether or not a thing (say, an entity) occurs in a document. Then, you can compare entity-extraction APIs based on:

  • False positives, type 1 – does it identify as entities things that are not entities?
  • False positives, type 2 – does it identify entities that don't even appear in the document?
  • False negatives – does it tell you that an entity is not present when it is?

Of course, APIs also assign a 'type' to each entity, based on the API's own ontology, which may or may not match your needs. Think of it as the difference between the Dewey Decimal System and the Library of Congress system; they're both valid ways of breaking things down, yet they are different. So, you can also compare APIs based on:

  • Entity types – Does it identify the entity types you're interested in?
  • Miscategorization – Does it identify entity type correctly, according to the API's ontology? If not, this could lead to both false positives and false negatives.

These concepts are usually summarized in terms of precision, recall, and accuracy. Some APIs may have better precision, or better recall. If you don’t care if you miss some things as long as the results you get are always right, then you care about precision. If you don’t care if you get some incorrect results as long as you find all of the correct results, then you care about recall. If you care about both, then accuracy is the better measure for you. In either case, classifier performance is simple to calculate.

Case 2: Finding the most relevant subset

Let's say that an API takes your documents and tells you, for each document, what keywords (for example) are present, and how relevant they are. Suppose also that you want the n best-matched keywords for each document.

Then, for any single API, you could choose the best-matched keywords in one of five ways:

  1. Make n a constant – usually 1, 3, 5, or 10 – and always take the top n keywords. This leads to problems when the API returns fewer results than the value you set for n, or when many of the top results have very low relevance.
  2. Identify a threshold, t, for relevancy, and take the keywords with a relevance higher than t. This could return no keywords, or a much larger list of keywords than you'd prefer, of course.
  3. Identify a relevancy threshold, t, but take no more than n keywords. This could return nothing.
  4. Identify a relevancy threshold, t, and a lower-limit, l, taking at least l keywords. This could lead to a set far larger than you'd prefer.
  5. Identify a relevancy threshold, t, a lower-limit, l, and a maximum number, m. Take at least l and no more than m keywords, excluding within those limits based on t.

So which of these five methods work best? It isn't clear that one is inherently superior to the others, and in fact each method may have different strengths and weaknesses (which may, in turn, be different for a different API). To really answer that question, you'd have to assess how best to use a given API by testing each of these methods for various values of l, m, and t, and assessing each with respect to:

  • False positives, type 1 – does it identify as keywords things that are not appropriate for use as keywords?
  • False positives, type 2 – does it identify keywords that don't even appear in the document?
  • Inflated ranking/relevance – are any of the keywords inside the selected subset included in it because of a higher relevance or ranking score than it deserves?
  • False negatives – does it entirely miss keywords it should find?
  • Lowball ranking/relevance – does ranking or relevance place a keyword that should be inside the subset outside of it?

Naturally, things get more complicated when you want to compare two different APIs, particularly ones made by different companies. Relevance is probably calibrated differently for each API. Furthermore, the best subset criteria for one API may not be the best subset criteria for another API. For example, one API may be able to consistently determine the three most relevant entities, but only get an average of five of the top ten correct. Another might get seven of the top ten correct, but only one or two of the top three. Either strength could be useful, depending on what you're trying to accomplish.

To make matters worse, we still don't know what relevance is. Many APIs return something like relevance, but none explain where the number actually comes from.

The biggest problem with this speculation, of course, is that ultimately, the APIs we use are black boxes. We could do a study that rigorously compared these APIs, and the week after we finish, one of them might change its algorithm without even telling anyone. The result is that APIs are best assessed when you need to make a decision, and perhaps periodically re-assessed if you wish to improve performance. That means that any solution to this conundrum will likely involve creating tools that make the assessments we’re already doing more rigorous, faster, and/or easier.

Latest Posts

  • Building a Community for VR and AR Storytelling

    In 2016 we founded the Device Lab to provide a hub for the exploration of AR/VR storytelling on campus. In addition to providing access to these technologies for Medill and the wider Northwestern community, we’ve also pursued a wide variety of research and experimental content development projects. We’ve built WebVR timelines of feminist history and looked into the inner workings of ambisonic audio. We’ve built virtual coral reefs and prototyped an AR experience setting interviews...

    Continue Reading

  • A Brief Introduction to NewsgamesCan video games be used to tell the news?

    When the Financial Times released The Uber Game in 2017, the game immediately gained widespread popularity with more than 360,000 visits, rising up the ranks as the paper’s most popular interactive piece of the year. David Blood, the game’s lead developer, said that the average time spent on the page was about 20 minutes, which was substantially longer than what most Financial Times interactives tend to receive, according to Blood. The Uber Game was so successful that the Financial...

    Continue Reading

  • With the 25th CAR Conference upon us, let’s recall the first oneWhen the Web was young, data journalism pioneers gathered in Raleigh

    For a few days in October 1993, if you were interested in journalism and technology, Raleigh, North Carolina was the place you had to be. The first Computer-Assisted Reporting Conference offered by Investigative Reporters & Editors brought more than 400 journalists to Raleigh for 3½ days of panels, demos and hands-on lessons in how to use computers to find stories in data. That seminal event will be commemorated this week at the 25th CAR Conference, which...

    Continue Reading

  • Prototyping Augmented Reality

    Something that really frustrates me is that, while I’m excited about the potential AR has for storytelling, I don’t feel like I have really great AR experiences that I can point people to. We know that AR is great for taking a selfie with a Pikachu and it’s pretty good at measuring spaces (as long as your room is really well lit and your phone is fully charged) but beyond that, we’re really still figuring...

    Continue Reading

  • Capturing the Soundfield: Recording Ambisonics for VR

    When building experiences in virtual reality we’re confronted with the challenge of mimicking how sounds hit us in the real world from all directions. One useful tool for us to attempt this mimicry is called a soundfield microphone. We tested one of these microphones to explore how audio plays into building immersive experiences for virtual reality. Approaching ambisonics with the soundfield microphone has become popular in development for VR particularly for 360 videos. With it,...

    Continue Reading

  • Audience Engagement and Onboarding with Hearken Auditing the News Resurrecting History for VR Civic Engagement with City Bureau Automated Fact Checking Conversational Interface for News Creative Co-Author Crowdsourcing for Journalism Environmental Reporting with Sensors Augmented Reality Visualizations Exploring Data Visualization in VR Fact Flow Storytelling with GIFs Historical Census Data Information Spaces in AR/VR Contrasting Forms Of Interactive 3D Storytelling Interactive Audio Juxtapose Legislator Tracker Storytelling with Augmented Reality Music Magazine Navigating Virtual Reality Open Data Reporter Oscillations Personalize My Story Photo Bingo Photojournalism in 3D for VR and Beyond Podcast Discoverability Privacy Mirror Projection Mapping ProPublica Illinois Rethinking Election Coverage SensorGrid API and Dashboard Sidebar Smarter News Exploring Software Defined Radio Story for You Storyline: Charts that tell stories. Storytelling Layers on 360 Video Talking to Data Visual Recipes Watch Me Work Writing and Designing for Chatbots
  • Prototyping Spatial Audio for Movement Art

    One of Oscillations’ technical goals for this quarter’s Knight Lab Studio class was an exploration of spatial audio. Spatial audio is sound that exists in three dimensions. It is a perfect complement to 360 video, because sound sources can be localized to certain parts of the video. Oscillations is especially interested in using spatial audio to enhance the neuroscientific principles of audiovisual synchrony that they aim to emphasize in their productions. Existing work in spatial......

    Continue Reading

Storytelling Tools

We build easy-to-use tools that can help you tell better stories.

View More