Semantic APIs, what to consider when picking a text analysis tool

Today, our online experiences are richer and more interconnected than ever. This is in part due to the existence of third-party services called Application Programming Interfaces, or APIs for short. APIs allow computer systems to speak with each other and exchange information. Facebook and Twitter’s APIs, for example, allow Twitter to repost your Facebook updates, and vice versa.

At the Knight Lab, we often make use of semantic APIs. These APIs will usually take text in the form of a file or URL, and tell you something about it. At their most basic, they might identify what part of the webpage is actually the article (and not the HTML, or the advertisements). There are many other services semantic APIs offer, but one of the most common identifies the topic of the text you submit, and/or finds “named entities” in the text. (These named entities include people, places, organizations, nations, and other similar things).

Given that there are several APIs that offer similar services, how can we (and you) choose between them?

Case 1: When the API is either right, or wrong

Let's say you want to know whether or not a thing (say, an entity) occurs in a document. Then, you can compare entity-extraction APIs based on:

  • False positives, type 1 – does it identify as entities things that are not entities?
  • False positives, type 2 – does it identify entities that don't even appear in the document?
  • False negatives – does it tell you that an entity is not present when it is?

Of course, APIs also assign a 'type' to each entity, based on the API's own ontology, which may or may not match your needs. Think of it as the difference between the Dewey Decimal System and the Library of Congress system; they're both valid ways of breaking things down, yet they are different. So, you can also compare APIs based on:

  • Entity types – Does it identify the entity types you're interested in?
  • Miscategorization – Does it identify entity type correctly, according to the API's ontology? If not, this could lead to both false positives and false negatives.

These concepts are usually summarized in terms of precision, recall, and accuracy. Some APIs may have better precision, or better recall. If you don’t care if you miss some things as long as the results you get are always right, then you care about precision. If you don’t care if you get some incorrect results as long as you find all of the correct results, then you care about recall. If you care about both, then accuracy is the better measure for you. In either case, classifier performance is simple to calculate.

Case 2: Finding the most relevant subset

Let's say that an API takes your documents and tells you, for each document, what keywords (for example) are present, and how relevant they are. Suppose also that you want the n best-matched keywords for each document.

Then, for any single API, you could choose the best-matched keywords in one of five ways:

  1. Make n a constant – usually 1, 3, 5, or 10 – and always take the top n keywords. This leads to problems when the API returns fewer results than the value you set for n, or when many of the top results have very low relevance.
  2. Identify a threshold, t, for relevancy, and take the keywords with a relevance higher than t. This could return no keywords, or a much larger list of keywords than you'd prefer, of course.
  3. Identify a relevancy threshold, t, but take no more than n keywords. This could return nothing.
  4. Identify a relevancy threshold, t, and a lower-limit, l, taking at least l keywords. This could lead to a set far larger than you'd prefer.
  5. Identify a relevancy threshold, t, a lower-limit, l, and a maximum number, m. Take at least l and no more than m keywords, excluding within those limits based on t.

So which of these five methods work best? It isn't clear that one is inherently superior to the others, and in fact each method may have different strengths and weaknesses (which may, in turn, be different for a different API). To really answer that question, you'd have to assess how best to use a given API by testing each of these methods for various values of l, m, and t, and assessing each with respect to:

  • False positives, type 1 – does it identify as keywords things that are not appropriate for use as keywords?
  • False positives, type 2 – does it identify keywords that don't even appear in the document?
  • Inflated ranking/relevance – are any of the keywords inside the selected subset included in it because of a higher relevance or ranking score than it deserves?
  • False negatives – does it entirely miss keywords it should find?
  • Lowball ranking/relevance – does ranking or relevance place a keyword that should be inside the subset outside of it?

Naturally, things get more complicated when you want to compare two different APIs, particularly ones made by different companies. Relevance is probably calibrated differently for each API. Furthermore, the best subset criteria for one API may not be the best subset criteria for another API. For example, one API may be able to consistently determine the three most relevant entities, but only get an average of five of the top ten correct. Another might get seven of the top ten correct, but only one or two of the top three. Either strength could be useful, depending on what you're trying to accomplish.

To make matters worse, we still don't know what relevance is. Many APIs return something like relevance, but none explain where the number actually comes from.

The biggest problem with this speculation, of course, is that ultimately, the APIs we use are black boxes. We could do a study that rigorously compared these APIs, and the week after we finish, one of them might change its algorithm without even telling anyone. The result is that APIs are best assessed when you need to make a decision, and perhaps periodically re-assessed if you wish to improve performance. That means that any solution to this conundrum will likely involve creating tools that make the assessments we’re already doing more rigorous, faster, and/or easier.

Latest Posts

  • Prototyping Augmented Reality

    Something that really frustrates me is that, while I’m excited about the potential AR has for storytelling, I don’t feel like I have really great AR experiences that I can point people to. We know that AR is great for taking a selfie with a Pikachu and it’s pretty good at measuring spaces (as long as your room is really well lit and your phone is fully charged) but beyond that, we’re really still figuring...

    Continue Reading

  • Capturing the Soundfield: Recording Ambisonics for VR

    When building experiences in virtual reality we’re confronted with the challenge of mimicking how sounds hit us in the real world from all directions. One useful tool for us to attempt this mimicry is called a soundfield microphone. We tested one of these microphones to explore how audio plays into building immersive experiences for virtual reality. Approaching ambisonics with the soundfield microphone has become popular in development for VR particularly for 360 videos. With it,...

    Continue Reading

  • How to translate live-spoken human words into computer “truth”

    Our Knight Lab team spent three months in Winter 2018 exploring how to combine various technologies to capture, interpret, and fact check live broadcasts from television news stations, using Amazon’s Alexa personal assistant device as a low-friction way to initiate the process. The ultimate goal was to build an Alexa skill that could be its own form of live, automated fact-checking: cross-referencing a statement from a politician or otherwise newsworthy figure against previously fact-checked statements......

    Continue Reading

  • Northwestern is hiring a CS + Journalism professor

    Work with us at the intersection of media, technology and design.

    Are you interested in working with journalism and computer science students to build innovative media tools, products and apps? Would you like to teach the next generation of media innovators? Do you have a track record building technologies for journalists, publishers, storytellers or media consumers? Northwestern University is recruiting for an assistant or associate professor for computer science AND journalism, who will share an appointment in the Medill School of Journalism and the McCormick School...

    Continue Reading

  • Introducing StorylineJS

    Today we're excited to release a new tool for storytellers.

    StorylineJS makes it easy to tell the story behind a dataset, without the need for programming or data visualization expertise. Just upload your data to Google Sheets, add two columns, and fill in the story on the rows you want to highlight. Set a few configuration options and you have an annotated chart, ready to embed on your website. (And did we mention, it looks great on phones?) As with all of our tools, simplicity...

    Continue Reading

  • Join us in October: NU hosts the Computation + Journalism 2017 symposium

    An exciting lineup of researchers, technologists and journalists will convene in October for Computation + Journalism Symposium 2017 at Northwestern University. Register now and book your hotel rooms for the event, which will take place on Friday, Oct. 13, and Saturday, Oct. 14 in Evanston, IL. Hotel room blocks near campus are filling up fast! Speakers will include: Ashwin Ram, who heads research and development for Amazon’s Alexa artificial intelligence (AI) agent, which powers the...

    Continue Reading

Storytelling Tools

We build easy-to-use tools that can help you tell better stories.

View More