The Challenge: Number of Classes and Accuracy

Raw accuracy scores can be deceptive. One of the most significant factors affecting how we interpret accuracy is the number of classes a classifier is trying to predict. This page focuses specifically on this challenge and how Plexus helps provide clarity.

Why Number of Classes Matters: A Tale of Two Games

Imagine two guessing games. In the first, you predict a coin flip (2 options: Heads or Tails). In the second, you predict the suit of a drawn card (4 options: Hearts, Diamonds, Clubs, Spades). If you guess randomly in both games, your expected accuracy is vastly different:

  • Coin Flip (2 Classes): You have a 1 in 2 chance (50%) of being correct randomly.
  • Card Suit (4 Classes): You have a 1 in 4 chance (25%) of being correct randomly.

This simple illustration highlights a core problem: a raw accuracy score (e.g., 60%) means very different things depending on the number of classes.60% accuracy is only slightly better than chance for a coin flip, but significantly better than chance for predicting a card suit.

Randomly Guessing Coin Flips (2 Classes)

Achieved 48% accuracy. Chance baseline: 50%.

Labels: Binary
Balanced distribution
Heads
Tails
Confusion matrix
Heads
24
26
Heads
Tails
26
Tails
24
Predicted
Actual
Predicted classes
Heads
Tails

You achieved 48% accuracy:

05070809010048%
Accuracy

48% accuracy.

The contextual gauge shows this is near the 50% chance level for 2 classes.

Guessing Card Suits (4 Classes)

Achieved 23% accuracy. Chance baseline: 25%.

Labels: 4 classes
Balanced distribution
♥️
♦️
♣️
♠️
Confusion matrix
♥️
12
13
13
14
♦️
13
12
14
13
♣️
13
14
12
13
♥️
♠️
14
♦️
13
♣️
13
♠️
12
Predicted
Actual
Predicted classes
♥️
♦️
♣️
♠️

You achieved 23% accuracy:

0254555.06510023%
Accuracy

23% accuracy.

The contextual gauge shows this is near the 25% chance level for 4 classes.

Key Insight: Baseline Shifts with Class Count

The random chance baseline drops as the number of classes increases (assuming balanced classes). A 50% accuracy is poor for a 2-class problem but excellent for a 10-class problem (where chance is 10%). Without understanding this shifting baseline, raw accuracy is uninterpretable.

Visualizing the Impact: 65% Accuracy Across Different Class Counts

Each scenario below shows a 65% accuracy. The left gauge uses a fixed, uncontextualized scale. The right gauge dynamically adjusts its colored segments based on the number of classes (assuming a balanced distribution for this illustration), providing immediate context.

Two-Class

05070809010065%
No Context
05070809010065%
With Class Context

Contextual: 65% is 'converging', just above the 50% chance.

Three-Class

05070809010065%
No Context
033.353.363.373.310065%
With Class Context

Contextual: 65% is 'viable', well above the ~33% chance.

Four-Class

05070809010065%
No Context
0254555.06510065%
With Class Context

Contextual: 65% is 'great', significantly above the 25% chance.

Twelve-Class

05070809010065%
No Context
08.328.338.348.310065%
With Class Context

Contextual: 65% is outstanding, far exceeding the ~8.3% chance.

The Takeaway

The same 65% accuracy score transitions from mediocre to excellent as the number of classes increases. Fixed gauges are misleading. Contextual gauges, which adapt to the number of classes, are essential for correct interpretation.

Solution: Clarity Through Context

To address the challenge of varying class numbers, a two-pronged approach provides a clear and reliable understanding of classifier performance. This ensures metrics are interpretable whether dealing with few or many classes, or situations involving class imbalance (a related challenge discussed elsewhere).

  1. Contextualized Accuracy Gauges: As demonstrated previously, Accuracy gauges should not use a fixed scale. Their colored segments dynamically adjust based on problem characteristics like the number of classes (and class distribution). This provides an immediate visual cue: is the observed accuracy good *for this specific problem*?
  2. Inherently Context-Aware Agreement Gauges: Alongside accuracy, Agreement gauges (typically Gwet's AC1) offer a mathematically robust solution. These metrics are designed to calculate a chance-corrected measure of agreement. They *internally* account for the number of classes and their distribution, yielding a standardized score (0 = chance, 1 = perfect) that reflects skill beyond random guessing. This score is directly comparable across different problems.

The contextualized Accuracy gauge helps interpret raw accuracy correctly for the current task, while the Agreement gauge provides a robust, comparable measure of skill. Let's examine how these two gauges work together:

Two-Class (Coin Flip) - Near Chance Performance

Random guessing on 100 coin flips. Expect ~50% accuracy, ~0.0 AC1.

Labels: Binary
Balanced distribution
Heads
Tails
Confusion matrix
Heads
24
26
Heads
Tails
26
Tails
24
Predicted
Actual
Predicted classes
Heads
Tails
-10.20.50.801-.04
Agreement
05070809010048%
Accuracy

Key Insight:

Here, both gauges clearly indicate performance at (or slightly below) random chance. The Agreement (AC1) is near zero. The contextualized Accuracy gauge shows 48% is at the baseline for a 2-class problem.

Multi-Class (Article Topic Labeler) - Moderate Performance

5-class imbalanced problem. Accuracy: 62%, Gwet's AC1: 0.512.

Labels: 5 classes
Imbalanced distribution
News
Sports
Business
Technology
Lifestyle
Confusion matrix
News
28
3
3
3
3
Sports
3
9
1
1
1
Business
3
1
8
2
1
Technology
3
1
2
8
1
News
Lifestyle
3
Sports
1
Business
1
Technology
1
Lifestyle
9
Predicted
Actual
Predicted classes
News
Sports
Business
Technology
Lifestyle
-10.20.50.801.51
Agreement
0254555.06510062%
Accuracy

Key Insight:

For this more complex 5-class imbalanced task, the Agreement gauge (AC1=0.512) shows moderate skill beyond chance. The contextualized Accuracy gauge interprets 62% as 'good' for this specific setup, confirming a performance level that's meaningfully above simple guessing strategies.

These examples illustrate how combining a contextualized Accuracy gauge with an Agreement score like Gwet's AC1 offers a much clearer and more reliable assessment of classifier performance than looking at raw accuracy in isolation, especially when the number of classes varies.

For a Comprehensive Overview

This page focuses specifically on the "number of classes" problem. For a broader understanding of how Plexus addresses various contextual factors in evaluation (including class imbalance and the full two-pronged solution strategy), please see our main guide:

Next Steps

Continue exploring our documentation for a deeper understanding of evaluation: