Today, our online experiences are richer and more interconnected than ever. This is in part due to the existence of third-party services called Application Programming Interfaces, or APIs for short. APIs allow computer systems to speak with each other and exchange information. Facebook and Twitter’s APIs, for example, allow Twitter to repost your Facebook updates, and vice versa.
At the Knight Lab, we often make use of semantic APIs. These APIs will usually take text in the form of a file or URL, and tell you something about it. At their most basic, they might identify what part of the webpage is actually the article (and not the HTML, or the advertisements). There are many other services semantic APIs offer, but one of the most common identifies the topic of the text you submit, and/or finds “named entities” in the text. (These named entities include people, places, organizations, nations, and other similar things).
Given that there are several APIs that offer similar services, how can we (and you) choose between them?
Case 1: When the API is either right, or wrong
Let's say you want to know whether or not a thing (say, an entity) occurs in a document. Then, you can compare entity-extraction APIs based on:
- False positives, type 1 – does it identify as entities things that are not entities?
- False positives, type 2 – does it identify entities that don't even appear in the document?
- False negatives – does it tell you that an entity is not present when it is?
Of course, APIs also assign a 'type' to each entity, based on the API's own ontology, which may or may not match your needs. Think of it as the difference between the Dewey Decimal System and the Library of Congress system; they're both valid ways of breaking things down, yet they are different. So, you can also compare APIs based on:
- Entity types – Does it identify the entity types you're interested in?
- Miscategorization – Does it identify entity type correctly, according to the API's ontology? If not, this could lead to both false positives and false negatives.
These concepts are usually summarized in terms of precision, recall, and accuracy. Some APIs may have better precision, or better recall. If you don’t care if you miss some things as long as the results you get are always right, then you care about precision. If you don’t care if you get some incorrect results as long as you find all of the correct results, then you care about recall. If you care about both, then accuracy is the better measure for you. In either case, classifier performance is simple to calculate.
Case 2: Finding the most relevant subset
Let's say that an API takes your documents and tells you, for each document, what keywords (for example) are present, and how relevant they are. Suppose also that you want the n best-matched keywords for each document.
Then, for any single API, you could choose the best-matched keywords in one of five ways:
- Make n a constant – usually 1, 3, 5, or 10 – and always take the top n keywords. This leads to problems when the API returns fewer results than the value you set for n, or when many of the top results have very low relevance.
- Identify a threshold, t, for relevancy, and take the keywords with a relevance higher than t. This could return no keywords, or a much larger list of keywords than you'd prefer, of course.
- Identify a relevancy threshold, t, but take no more than n keywords. This could return nothing.
- Identify a relevancy threshold, t, and a lower-limit, l, taking at least l keywords. This could lead to a set far larger than you'd prefer.
- Identify a relevancy threshold, t, a lower-limit, l, and a maximum number, m. Take at least l and no more than m keywords, excluding within those limits based on t.
So which of these five methods work best? It isn't clear that one is inherently superior to the others, and in fact each method may have different strengths and weaknesses (which may, in turn, be different for a different API). To really answer that question, you'd have to assess how best to use a given API by testing each of these methods for various values of l, m, and t, and assessing each with respect to:
- False positives, type 1 – does it identify as keywords things that are not appropriate for use as keywords?
- False positives, type 2 – does it identify keywords that don't even appear in the document?
- Inflated ranking/relevance – are any of the keywords inside the selected subset included in it because of a higher relevance or ranking score than it deserves?
- False negatives – does it entirely miss keywords it should find?
- Lowball ranking/relevance – does ranking or relevance place a keyword that should be inside the subset outside of it?
Naturally, things get more complicated when you want to compare two different APIs, particularly ones made by different companies. Relevance is probably calibrated differently for each API. Furthermore, the best subset criteria for one API may not be the best subset criteria for another API. For example, one API may be able to consistently determine the three most relevant entities, but only get an average of five of the top ten correct. Another might get seven of the top ten correct, but only one or two of the top three. Either strength could be useful, depending on what you're trying to accomplish.
To make matters worse, we still don't know what relevance is. Many APIs return something like relevance, but none explain where the number actually comes from.
The biggest problem with this speculation, of course, is that ultimately, the APIs we use are black boxes. We could do a study that rigorously compared these APIs, and the week after we finish, one of them might change its algorithm without even telling anyone. The result is that APIs are best assessed when you need to make a decision, and perhaps periodically re-assessed if you wish to improve performance. That means that any solution to this conundrum will likely involve creating tools that make the assessments we’re already doing more rigorous, faster, and/or easier.