In machine learning and data labeling, it’s important to think about inter-annotator agreement: how well do the labelers building your datasets agree with each other? Cohen’s kappa statistic is a common way of measuring their agreement, but it suffers from several flaws: it can only be used for measuring two raters, and can only be used for categorical variables. This post describes a powerful alternative known as Krippendorff’s alpha, which generalizes interrater reliability to an arbitrary number of raters and a wide variety of data types.
Interested in joining a curated community dedicated to NLP and data labeling? Join our waitlist here!
Introduction
Imagine you’re building a startup to help stores manage their reputation. You’ve scraped thousands of reviews from online forums, and now you want to build a sentiment analysis model.
In order to form a training set, your team of 4 engineers held a labeling party to label 1,000 reviews, and at least one person (often multiple) categorized each review on a 5-star scale.
The problem: while your engineers are top-notch coders, you're not sure you trust their labeling ability! Paul is vegan, so you suspect he may have downgraded every review praising meat; Alice had an early dinner reservation she was itching to go to, so you suspect she may have sped through her labeling tasks; and English isn't Lawrence's first language, so you suspect he may not have picked up on negative sarcasm.
Since many reviews were categorized by multiple people, you're hoping you can measure the dataset's inter-rater reliability in order to determine how much to trust it.
Unfortunately, the dataset has several features that make it challenging to calculate the IRR using simple metrics like Cohen's kappa. First, not every rater labeled every single review. Second, many simpler IRR metrics only measure the agreement between two raters, but you have four. Third, there are multiple ways to define what qualifies as "agreement" — does it count if reviewer A gives a business 1 star and reviewer B gives it 2 stars, or do their ratings have to be identical?
In short, we need an IRR metric that can do three things:
- Calculate inter-rater reliability for incomplete data
- Compare an arbitrary number of raters
- Handle “shades of gray” where reviewers might partially agree with each other
Computing Krippendorff’s Alpha
The good news is that there’s a metric that meets all these criteria. It’s called Krippendorff’s Alpha, and it’s basically a ratio between the observed weighted percent agreement pₐ and the chance weighted percent agreement pₑ:
The bad news is that in order to generalize so widely, Krippendorff’s Alpha relies on calculations that are significantly more complex than more specialized metrics. At a high level, there are 6 steps:
1. Data cleaning: remove any stores and any reviewers with 1 or fewer ratings.
2. Agreement table: make an agreement table with rows for each store and columns for each rating category. The values rᵢₖ in this table are the number of times the reviewers assigned each store i rating k — i.e., the number of agreements for that store+rating combo.
3. Pick weight function: the weight wₖₗ quantifies how similar any two ratings k and l are. The simplest weight function is just the identify function: wₖₗ = 1 if k = l and 0 if k ≠ l. Other weight functions can express “shades of gray”: w(great, good) might equal 0.8, while w(great, bad) might equal 0.2.
4. Calculate pₐ: find how often the reviewers actually agreed using:
- ṝ, the average number of reviewers who rated each store
- ṝᵢₖ₊, the weighted count of how many reviewers gave each store i a rating that fully or partially matched category k
- pₐₗᵢ, the percentage agreement for each store i
- p'ₐ, the average value of pₐₗᵢ across all the stores
- The final equation pₐ =p'ₐ(1 - 1/nṝ) + 1/nṝ
5. Calculate pₑ: find the percent agreement the reviewers would achieve guessing randomly using:
- πₖ, the percentage of the total ratings that fell into each rating category k
- The equation pₑ = Σₖₗ wₖₗπₖπₗ
6. Calculate alpha using the formula 𝛼 = (pₐ - pₑ) / (1 - pₑ)
This is a lot, so let’s see how each step works using the data from our example.
1. Cleaning the raw data
First we start with the raw data from the reviews. The table below shows how many stars the four suspect accounts gave to each of 12 stores:
Krippendorff’s alpha is based on calculating percentage agreement, which requires comparing pairs of ratings. This means that while most of the incomplete data from Stores 2, 10 and 11 can be included, any stores with only one rating have to be cut — bye-bye, Store 12.
After that elimination, the total number of stores included in the analysis — represented by n — is 11. This is also a good time to make note of q, the total number of rating categories — in this case, 5.
2. Building the agreement table
Next, we’re going to build an agreement table that shows how often each store received each rating:
The values in this table are rᵢₖ, the number of times raters assigned store i rating category k. To pick a random example, store 8 got the “1 - Great” rating 3 times, so r₈,₁= 3. This step is what allows us to calculate Krippendorff’s alpha for incomplete data. Now there are no pesky null values, just some stores that have a lower total number of ratings (rᵢ).
While we’re looking at this table, let’s go ahead and calculate ṝ, the average number of shoppers who rated each store. Intuitively, that’s just the total number of ratings divided by the total number of stores. Mathematically, it’s
We’re going to need this number in step 4, calculating pₐ.
3. Choosing a weight function
In the next two steps we’re going to be calculating percent agreement. But what exactly does “agreement” mean?
The answer depends on our level of measurement. The reviewers assigned each store a numerical rating from 1 to 5 stars. We can choose to interpret these scores as:
Nominal: Separate categories with no “hierarchy.” A score of “1 star” is just as similar to a score of “3 stars” as a score of “5 stars”. Two raters only agree if they pick identical categories, so the weight function for nominal data is just the identity function: wₖₗ = 1 if k = l and 0 if k ≠ l.
Ordinal: There’s a ranking between the categories — “5 stars” > “4 stars” > “3 stars” > “2 stars” > “1 star” — but no relative degree of difference. For example, we can’t say that the difference between “5 stars” and “4 stars” is half the size of the difference between “5 stars” and “3 stars.” K. Gwet recommends the following weight function for ordinal data, which returns higher agreement the closer together two non-identical categories are:
Interval: There’s a fixed interval between categories — we can say that the difference between “1 star” and “3 stars” is the same as the difference between “3 stars” and “5 stars.” Interval data uses the quadratic weight function:
There are many other weight functions, including ratio weights, circular weights and bipolar weights. For this example, we’ll just consider our data nominal, since using the identify function makes the rest of the math easier to follow.
4. Calculating pₐ
This is where the heavy lifting happens. We’re going to start off easy by calculating ṝᵢₖ+, the weighted count of how many reviewers gave store i a rating that completely or partially matched category k. Since our data is categorical, we can use the identity function for our weights — in other words, we’re just not going to count anyone who gave even a slightly different rating . So, for example, the ṝᵢₖ+ for store 1 and category 2 (“Good”) is:
Next, we’re going to find the reviewers’ percent agreement for a single store and a single rating category. The equation for this normalizes the weighted rating count by the average number of ratings per store:
Sticking with store #1 and rating 2, we get:
Now we can just add up the percent agreement for all q rating categories to find the percent agreement for a single store:
For the specific example of store #1, rᵢₖ will be 0 for k = 1, 4 or 5, because no ratings fell into those categories. The numerator is also 0 for k = 3, because with only one rating in category 3, the weighted observer count will be 1 and so ṝᵢₖ+ - 1 will be 1 - 1 = 0.
Once we find pₐₗᵢ for all the stores, it’s time to find their average, p'ₐ:
For this dataset, we get:
And at long last, we can use the average store-level percent agreement to calculate the overall observed percent agreement pₐ:
Remember that since n is the number of stores, and ṝ is the mean number of ratings per store, the term nṝ is just the total number of ratings. We’re effectively normalizing average store-level percent agreement by the size of our data. For this specific dataset, we get:
5. Calculating pₑ
Calculating the percent agreement expected by chance is easy by comparison. We only need the weight function, each rating category’s classification probability, and one equation:
The classification probability for category k — πₖ — is just the percentage of the total number of ratings (across all the stores) that fell into category k. Here are the classification probabilities for our data set:
If all the reviewers were assigning ratings randomly, the probability of one reviewer picking rating k and a second reviewer picking rating l would just be πₖ * πₗ. We say that the reviewers agree when category k at least partially matches category l — when the weight wₖₗ is greater than 0. So, if we take the sum of πₖ * πₗ over all possible values of k and l, weighting each term by how closely k and l match, we’ll find the overall percent agreement expected by chance.
Since we’re using categorical weights, wₖₗ is 0 unless the categories are an exact match, so we can just find the sum of πₖ²:
6. Calculating 𝛼
Now we can finally calculate the ratio between the expected and observation percent agreement to find the value of Krippendorff’s alpha:
Interpreting Krippendorff’s Alpha
Despite the complexity of the calculations, Krippendorff’s alpha is fundamentally a Kappa-like metric. Its values range from -1 to 1, with 1 representing unanimous agreement between the raters, 0 indicating they’re guessing randomly, and negative values suggesting the raters are systematically disagreeing. (This can happen when raters value different things — for example, if rater A thinks a crowded store is a sign of success, but rater B thinks it proves understaffing and poor management).
Krippendorff’s Alpha in a Nutshell
Pros:
- Can handle incomplete data sets
- Can generalize to different sample sizes and numbers of raters
- Works for all measurement levels (nominal, ordinal, interval or ratio)
Cons:
- Computations are more complex than alternative metrics
- If the agreement expected by chance is high enough (say, because the rating falls in the same category 95% of the time), Krippendorff’s alpha will stay relatively low no matter how often the real-world raters agree
- No theoretical way to figure out significance thresholds — have to use bootstrapping
Code Sample
Interested in a code sample? Check out a step-by-step walkthrough of the calculation on GitHub.
—
Surge AI is a data labeling platform that provides world-class data to top AI companies and researchers. We're built from the ground up to tackle the extraordinary challenges of natural language understanding — with an elite data labeling workforce, stunning quality, rich labeling tools, and modern APIs. Want to improve your model with context-sensitive data and domain-expert labelers? Schedule a demo with our team today!
The average number of ads on a Google Search recipe? 8.7
Data Labeling 2.0 for Rich, Creative AI
Superintelligent AI, meet your human teachers. Our data labeling platform is designed from the ground up to train the next generation of AI — whether it’s systems that can code in Python, summarize poetry, or detect the subtleties of toxic speech. Use our powerful data labeling workforce and tools to build the rich, human-powered datasets you need today.