The Challenge: Number of Classes and Accuracy

Raw accuracy scores can be deceptive. One of the most significant factors affecting how we interpret accuracy is the number of classes a classifier is trying to predict. This page focuses specifically on this challenge and how Plexus helps provide clarity.

Why Number of Classes Matters: A Tale of Two Games

Imagine two guessing games. In the first, you predict a coin flip (2 options: Heads or Tails). In the second, you predict the suit of a drawn card (4 options: Hearts, Diamonds, Clubs, Spades). If you guess randomly in both games, your expected accuracy is vastly different:

Coin Flip (2 Classes): You have a 1 in 2 chance (50%) of being correct randomly.
Card Suit (4 Classes): You have a 1 in 4 chance (25%) of being correct randomly.

This simple illustration highlights a core problem: a raw accuracy score (e.g., 60%) means very different things depending on the number of classes.60% accuracy is only slightly better than chance for a coin flip, but significantly better than chance for predicting a card suit.

Randomly Guessing Coin Flips (2 Classes)

Achieved 48% accuracy. Chance baseline: 50%.

Labels: Binary

Balanced distribution

Heads

Tails

Confusion matrix

Heads

Tails

Predicted

Actual

Predicted classes

Heads

Tails

You achieved 48% accuracy:

48%

Accuracy

48% accuracy.

The contextual gauge shows this is near the 50% chance level for 2 classes.

Guessing Card Suits (4 Classes)

Achieved 23% accuracy. Chance baseline: 25%.

Labels: 4 classes

Balanced distribution

♥️

♦️

♣️

♠️

Confusion matrix

♥️

♦️

♣️

♥️

♠️

♦️

♣️

♠️

Predicted

Actual

Predicted classes

♥️

♦️

♣️

♠️

You achieved 23% accuracy:

23%

Accuracy

23% accuracy.

The contextual gauge shows this is near the 25% chance level for 4 classes.

Key Insight: Baseline Shifts with Class Count

The random chance baseline drops as the number of classes increases (assuming balanced classes). A 50% accuracy is poor for a 2-class problem but excellent for a 10-class problem (where chance is 10%). Without understanding this shifting baseline, raw accuracy is uninterpretable.

Visualizing the Impact: 65% Accuracy Across Different Class Counts

Each scenario below shows a 65% accuracy. The left gauge uses a fixed, uncontextualized scale. The right gauge dynamically adjusts its colored segments based on the number of classes (assuming a balanced distribution for this illustration), providing immediate context.

Two-Class

65%

No Context

65%

With Class Context

Contextual: 65% is 'converging', just above the 50% chance.

Three-Class

65%

No Context

65%

With Class Context

Contextual: 65% is 'viable', well above the ~33% chance.

Four-Class

65%

No Context

65%

With Class Context

Contextual: 65% is 'great', significantly above the 25% chance.

Twelve-Class

65%

No Context

65%

With Class Context

Contextual: 65% is outstanding, far exceeding the ~8.3% chance.

The Takeaway

The same 65% accuracy score transitions from mediocre to excellent as the number of classes increases. Fixed gauges are misleading. Contextual gauges, which adapt to the number of classes, are essential for correct interpretation.

Solution: Clarity Through Context

To address the challenge of varying class numbers, a two-pronged approach provides a clear and reliable understanding of classifier performance. This ensures metrics are interpretable whether dealing with few or many classes, or situations involving class imbalance (a related challenge discussed elsewhere).

Contextualized Accuracy Gauges: As demonstrated previously, Accuracy gauges should not use a fixed scale. Their colored segments dynamically adjust based on problem characteristics like the number of classes (and class distribution). This provides an immediate visual cue: is the observed accuracy good *for this specific problem*?
Inherently Context-Aware Agreement Gauges: Alongside accuracy, Agreement gauges (typically Gwet's AC1) offer a mathematically robust solution. These metrics are designed to calculate a chance-corrected measure of agreement. They *internally* account for the number of classes and their distribution, yielding a standardized score (0 = chance, 1 = perfect) that reflects skill beyond random guessing. This score is directly comparable across different problems.

The contextualized Accuracy gauge helps interpret raw accuracy correctly for the current task, while the Agreement gauge provides a robust, comparable measure of skill. Let's examine how these two gauges work together:

Two-Class (Coin Flip) - Near Chance Performance

Random guessing on 100 coin flips. Expect ~50% accuracy, ~0.0 AC1.

Labels: Binary

Balanced distribution

Heads

Tails

Confusion matrix

Heads

Tails

Predicted

Actual

Predicted classes

Heads

Tails

-0.04

Agreement

48%

Accuracy

Key Insight:

Here, both gauges clearly indicate performance at (or slightly below) random chance. The Agreement (AC1) is near zero. The contextualized Accuracy gauge shows 48% is at the baseline for a 2-class problem.

Multi-Class (Article Topic Labeler) - Moderate Performance

5-class imbalanced problem. Accuracy: 62%, Gwet's AC1: 0.512.

Labels: 5 classes

Imbalanced distribution

News

Sports

Business

Technology

Lifestyle

Confusion matrix

News

Sports

Business

Technology

News

Lifestyle

Sports

Business

Technology

Lifestyle

Predicted

Actual

Predicted classes

News

Sports

Business

Technology

Lifestyle

0.51

Agreement

62%

Accuracy

Key Insight:

For this more complex 5-class imbalanced task, the Agreement gauge (AC1=0.512) shows moderate skill beyond chance. The contextualized Accuracy gauge interprets 62% as 'good' for this specific setup, confirming a performance level that's meaningfully above simple guessing strategies.

These examples illustrate how combining a contextualized Accuracy gauge with an Agreement score like Gwet's AC1 offers a much clearer and more reliable assessment of classifier performance than looking at raw accuracy in isolation, especially when the number of classes varies.

For a Comprehensive Overview

This page focuses specifically on the "number of classes" problem. For a broader understanding of how Plexus addresses various contextual factors in evaluation (including class imbalance and the full two-pronged solution strategy), please see our main guide:

Next Steps

Continue exploring our documentation for a deeper understanding of evaluation: