Adding Context to Evaluation Gauges

Raw accuracy scores can be misleading without appropriate context. This platform employs strategies for making evaluation gauges more interpretable by incorporating information about the number of classes, class balance, and by using metrics that are inherently context-aware.

A Unified Approach to Evaluation Clarity

Interpreting raw accuracy scores like "75% accurate" is challenging without considering crucial context, primarily the number of classes and their distribution within the data. A unified, multi-faceted approach brings clarity to classifier performance. This approach combines strategies that work in tandem:

  1. Enhancing Raw Accuracy's Interpretability: Essential context is provided directly to the raw accuracy metric by dynamically adjusting the visual scales (colors and thresholds) of the Accuracy gauge based on both the number of classes and the class distribution. This strategy, detailed first, makes the raw accuracy number itself more immediately understandable.
  2. Employing an Inherently Context-Aware Agreement Metric: Alongside the contextualized Accuracy gauge, a distinct "Agreement" metric, such as Gwet's AC1, is introduced. This type of metric is designed to internally account for the complexities of chance agreement, the number of classes, and their distribution. This second strategy, also detailed below, provides a stable, chance-corrected perspective that is directly comparable across different evaluation scenarios.

By using these strategies together—presenting both a contextualized Accuracy gauge and a self-contextualizing Agreement gauge—Plexus offers a comprehensive and robust understanding of classifier performance. The following explanations detail each of these complementary strategies.

Strategy 1: Adding Context to the Accuracy Gauge

To address the challenges of interpreting raw accuracy, context can be added directly to the metric's visual representation. The Accuracy gauge in Plexus can dynamically adjust its segments based on the problem's characteristics. This primarily involves two types of context: the number of classes and the balance of those classes.

Context Type A: Number of Classes

The number of classes significantly impacts the baseline random-chance agreement. A 50% accuracy means something very different for a 2-class problem than for a 12-class problem. By visualizing this context directly on the gauge, the raw accuracy number can be made more interpretable. By calculating a "chance level" or baseline for a specific dataset (based on the number of classes, assuming a balanced distribution for this specific adjustment), the gauge's colors can visually indicate whether the achieved accuracy is substantially better than random guessing, just slightly better, or even close to what chance would predict.

This approach retains raw accuracy but enhances its interpretability by providing crucial context through dynamic visual scales on Accuracy gauges. The background colors and threshold markers on the Accuracy gauge adapt based on the number of classes (assuming balanced distributions for this part of the explanation).

Coin Flip Prediction (50/50)

A fair coin has a 50% chance of heads or tails. Random guessing achieves about 50% accuracy.

Labels: Binary
Balanced distribution
Heads
Tails

No context for interpretation

010050%
Accuracy

With context for interpretation

05070809010050%
Accuracy

Key Insight:

Without context (left gauge), 50% accuracy is just a number. With proper contextual segments for a 2-class problem (right gauge), we see that 50% is exactly at the chance level, indicating no prediction skill beyond random guessing.

Card Suit Prediction (4 Balanced Classes)

A standard deck has four equally likely suits. Random guessing achieves about 25% accuracy.

Labels: 4 classes
Balanced distribution
♥️
♦️
♣️
♠️

No context for interpretation

010025%
Accuracy

With context for interpretation

0254555.06510025%
Accuracy

Key Insight:

Without context (left gauge), 25% accuracy appears very low. With proper contextual segments for a 4-class problem (right gauge), we see that 25% is exactly at the chance level, indicating no prediction skill beyond random guessing.

The dynamic gauges adjust their colors to match what "baseline random chance" means for each specific task based on class count (assuming balanced classes for this specific point). Instead of misleadingly suggesting that random guessing is "poor performance" in multi-class problems, the adjusted gauge shows it's exactly what you'd expect from chance. This makes it much easier to understand when a model is actually performing better than random guessing.

Visualizing Context: Impact of Number of Classes on Accuracy Interpretation (65% Accuracy Example)

Each scenario below shows a 65% accuracy. The left gauge has no context (fixed scale), while the right gauge adjusts its segments based on the number of classes (assuming balanced distribution for this visualization).

Two-Class
05070809010065%
No Context
05070809010065%
With Class Context
Three-Class
05070809010065%
No Context
033.353.363.373.310065%
With Class Context
Four-Class
05070809010065%
No Context
0254555.06510065%
With Class Context
Twelve-Class
05070809010065%
No Context
08.328.338.348.310065%
With Class Context

Key Insight: Number of Classes Drastically Alters Accuracy Perception

These examples demonstrate a critical point: you cannot interpret accuracy numbers without understanding the context of class count. A 65% accuracy score might be weak for a binary classifier (where chance is 50%) but represents strong performance for a 12-class problem (where chance is ~8.3%). Contextualizing the gauge for the number of classes (assuming balance for this step) is crucial for meaningful interpretation.

Article Topic Labeler - With Class Count Context

Our 5-class classifier (62% accuracy). The right gauge is contextualized for 5 balanced classes (20% chance baseline).

Labels: 5 classes
Imbalanced distribution
News
Sports
Business
Technology
Lifestyle
Confusion matrix
News
28
3
3
3
3
Sports
3
9
1
1
1
Business
3
1
8
2
1
Technology
3
1
2
8
1
News
Lifestyle
3
Sports
1
Business
1
Technology
1
Lifestyle
9
Predicted
Actual
Predicted classes
News
Sports
Business
Technology
Lifestyle

No context for interpretation

010062%
Accuracy

With context for interpretation

020.0405060.010062%
Accuracy

Key Insight:

Without context (left gauge), 62% accuracy seems mediocre. With contextual segments for 5 balanced classes (right gauge), the same 62% accuracy is revealed to be excellent performance! It's far above the 20% random chance baseline and falls well into the 'great' segment of the gauge.

When accounting only for the 5-class nature (momentarily ignoring its imbalance), our Article Topic Labeler's 62% accuracy appears excellent, as it's far above the 20% random chance baseline for a balanced 5-class problem. However, this is only part of the story.

Context Type B: Class Imbalance

Adjusting the accuracy gauge for class imbalance is the next critical step. Even with the correct number of classes, if the classes themselves are not evenly distributed, the baseline for "chance" or "no skill" performance changes. Specifically, a naive strategy of always guessing the majority class can achieve deceptively high accuracy. The gauge thresholds must shift to reflect this.

Visualizing Imbalance: Fixed vs. Class Distribution Context (75% Accuracy Example)

Consider a scenario with 75% accuracy on a binary task where classes are imbalanced (75% Class A, 25% Class B).

Without Class Distribution Context

Standard fixed gauge segments.

05070809010075%
Accuracy

Potentially Misleading: 75% accuracy appears "almost viable."

With Class Distribution Context

Dynamically adjusted segments for 75%/25% imbalance.

062.582.592.510075%
Accuracy

Correct Interpretation: 75% accuracy is at "chance" level (equivalent to always guessing the majority class).

Stacked Deck (75% Red) - Always Guessing Red

Scenario: deck with 75% red cards. Strategy: always guess 'Red'. Result: 75% accuracy. Right gauge contextualized for this 75/25 imbalance.

Labels: Binary
Imbalanced distribution
Red
Black

No context for interpretation

010075%
Accuracy

With context for interpretation

062.582.592.510075%
Accuracy

Key Insight:

With fixed segments (left), 75% accuracy looks 'almost viable'. With segments contextualized for the 75/25 imbalance (right), it's correctly shown as chance-level performance, as always guessing 'Red' achieves this 75%.

The "Always Safe" Email Filter (97% Safe)

Scenario: 97% 'Safe' emails, 3% 'Prohibited'. Strategy: always predict 'Safe'. Result: 97% accuracy. Right gauge contextualized for 97/3 imbalance.

Labels: Binary
Imbalanced distribution
Safe

No context for interpretation

010097%
Accuracy

With context for interpretation

010097%
Accuracy

Key Insight:

With fixed segments (left), 97% accuracy looks 'great'. With segments contextualized for the 97/3 imbalance (right), it's correctly shown as chance-level. True skill requires >97% in this scenario. The colored 'good' and 'great' segments are compressed to the far right.

Key Insight: Class Imbalance Redefines "Good" Accuracy

These examples demonstrate how adding class distribution context to accuracy gauges transforms interpretation. What initially appears as good performance with fixed gauges can be revealed as merely baseline chance once class imbalance is factored in. The gauge segments must shift to show that genuinely good performance requires exceeding what simple strategies (like "always guess the majority class") would achieve.

Visualizing Context: Impact of Varying Class Imbalance (65% Accuracy Example)

Each scenario below shows a 65% accuracy. The top gauge uses fixed segments, while the bottom gauge adjusts segments based on the specified class imbalance.

Balanced (50/50)
05070809010065%
Fixed Segments
05070809010065%
Contextual Segments
Imbalanced (75/25)
05070809010065%
Fixed Segments
062.582.592.510065%
Contextual Segments
3-Class Imbal. (80/10/10)
05070809010065%
Fixed Segments
066.086.010065%
Contextual Segments
Highly Imbal. (95/5)
05070809010065%
Fixed Segments
090.510065%
Contextual Segments

Full Context: Combining Number of Classes AND Imbalance

Plexus's Accuracy gauges, when fully contextualized, account for *both* the number of classes and their distribution simultaneously. This provides the most accurate baseline against which to judge the observed accuracy.

Article Topic Labeler - With Full Context (Class Count & Imbalance)

Our 5-class imbalanced classifier (62% accuracy, 40% News). Right gauge contextualized for its 5-class nature AND its imbalanced distribution (40% News, 15% others).

Labels: 5 classes
Imbalanced distribution
News
Sports
Business
Technology
Lifestyle
Confusion matrix
News
28
3
3
3
3
Sports
3
9
1
1
1
Business
3
1
8
2
1
Technology
3
1
2
8
1
News
Lifestyle
3
Sports
1
Business
1
Technology
1
Lifestyle
9
Predicted
Actual
Predicted classes
News
Sports
Business
Technology
Lifestyle

No context for interpretation

010062%
Accuracy

With context for interpretation

0254555.06510062%
Accuracy

Key Insight:

The left (fixed) gauge suggests 62% is 'converging'. The right (fully contextualized) gauge shows it as 'good', but not 'great'. The chance level, accounting for a 40% majority class among 5 options, is higher than the simple 20% for a balanced 5-class problem but lower than just guessing majority. This makes 62% good, but less stellar than if classes were balanced or if only class count was naively considered without imbalance.

Key Insight: Full Context is Nuanced

The fully contextualized accuracy gauge provides the most nuanced picture. For the Article Topic Labeler (62% accuracy, 5 classes, 40% majority class), the performance is good—significantly better than naive strategies (e.g., always guessing 'News' for 40% accuracy, or random 5-class guessing at 20%). However, the bar for 'great' is higher than if the classes were perfectly balanced, reflecting the slight advantage gained from the existing imbalance.

Strategy 2: The Agreement Gauge - Inherently Context-Aware

Rather than adding external context to interpret a raw accuracy gauge, an alternative and complementary approach is to use a metric that inherently incorporates this context. The Agreement gauge in Plexus (using Gwet's AC1 by default) does exactly this.

Standardized Interpretation Across All Scenarios

Metrics like Gwet's AC1 are designed to factor in both the number of classes and their distribution to calculate a chance-corrected agreement score. This creates a standardized scale:

  • 0.0: Performance equivalent to random chance agreement for that specific class distribution.
  • 1.0: Perfect agreement (all predictions correct).
  • -1.0: Perfect systematic disagreement (e.g., always wrong when it could be right).

A score of, for example, 0.6 on the Agreement gauge represents the same level of performance *above chance* regardless of whether it's a binary problem, a 10-class problem, or a highly imbalanced dataset.

Opposite

-1%
Agreement

Perfect systematic disagreement.

Random

0%
Agreement

No skill beyond chance.

Perfect

1%
Agreement

Perfect agreement.

Let's see how the Agreement gauge and the fully contextualized Accuracy gauge work together for various scenarios.

Random Coin Flip Prediction (50/50)

Fair coin, random guessing achieved 48% accuracy in this run.

Labels: Binary
Balanced distribution
Heads
Tails
-10.20.50.801-.04
Agreement
05070809010048%
Accuracy

Key Insight:

Both gauges show performance slightly below chance. Agreement (AC1=-0.04) is just under 0. Contextual Accuracy (48%) is just under the 50% chance baseline for a balanced binary problem.

Random Card Suit Prediction (4 Balanced Classes)

Four equally likely suits, random guessing achieved 23% accuracy.

Labels: 4 classes
Balanced distribution
♥️
♦️
♣️
♠️
Confusion matrix
♥️
12
13
13
14
♦️
13
12
14
13
♣️
13
14
12
13
♥️
♠️
14
♦️
13
♣️
13
♠️
12
Predicted
Actual
Predicted classes
♥️
♦️
♣️
♠️
-10.20.50.801-.03
Agreement
0254555.06510023%
Accuracy

Key Insight:

Both gauges indicate performance slightly below chance. Agreement (AC1=-0.03) is just under 0. Contextual Accuracy (23%) is just below the 25% chance baseline for a balanced 4-class problem.

Key Insight: Agreement Gauge Resists Deception from Imbalance

The true power of the Agreement gauge becomes evident in imbalanced scenarios where raw accuracy can be highly misleading. The Agreement gauge automatically adjusts for this, providing a consistent measure of skill beyond chance.

For example, a classifier that always guesses the majority class in an imbalanced dataset will have an Agreement score of 0.0 (or very close to it), clearly indicating no actual predictive ability, even if its raw accuracy is high. This exposes seemingly good performance as having no real skill.

Always Predicting Red (75/25 Stacked Deck)

Deck is 75% Red. Strategy: always guess 'Red'. Achieves 75% accuracy.

Labels: Binary
Imbalanced distribution
Red
Black
Confusion matrix
Red
75
0
Red
Black
25
Black
0
Predicted
Actual
Predicted classes
Red
-10.20.50.8010
Agreement
062.582.592.510075%
Accuracy

Key Insight:

Despite 75% accuracy, Agreement gauge (AC1=0.00) correctly shows no predictive skill beyond chance (which is effectively what always guessing majority is in this context). The contextual Accuracy gauge also correctly shows 75% is the baseline chance level here.

The "Always Safe" Email Filter - Both Gauges Revealing the Truth

Dataset: 97% 'Safe', 3% 'Prohibited'. Filter always predicts 'Safe', achieving 97% accuracy.

Labels: Binary
Imbalanced distribution
Safe
Confusion matrix
Safe
970
0
Safe
Prohibited
30
Prohibited
0
Predicted
Actual
Predicted classes
Safe
-10.20.50.8010
Agreement
010097%
Accuracy

Key Insight:

The extremity of this example is powerful. Despite a 97% accuracy, both gauges reveal the truth: zero predictive skill. Agreement (AC1=0.0) and the contextual Accuracy gauge (showing 97% is the baseline) both expose this filter as useless for catching prohibited content.

Benefits of the Agreement Gauge

  • Simplified Interpretation: Users don't need to mentally factor in class distribution or number of classes - the gauge does it for them. 0.0 is always chance.
  • Direct Comparability: Agreement scores can be directly compared across different classifiers and datasets, regardless of their underlying class structures.
  • Immediate Insight: Instantly reveals whether a classifier has actual predictive power beyond what's expected by chance for that specific problem.
  • Resistance to Deception: Exposes seemingly high accuracy numbers that actually represent no real predictive skill in imbalanced situations.

Article Topic Labeler - The Complete Picture with Both Gauges

Our 5-class imbalanced classifier (62% accuracy, 40% News). How do both gauges tell the story?

Labels: 5 classes
Imbalanced distribution
News
Sports
Business
Technology
Lifestyle
Confusion matrix
News
28
3
3
3
3
Sports
3
9
1
1
1
Business
3
1
8
2
1
Technology
3
1
2
8
1
News
Lifestyle
3
Sports
1
Business
1
Technology
1
Lifestyle
9
Predicted
Actual
Predicted classes
News
Sports
Business
Technology
Lifestyle
-10.20.50.801.51
Agreement
0254555.06510062%
Accuracy

Key Insight:

The Agreement gauge (AC1 = 0.512) shows moderate agreement beyond chance, inherently accounting for the 5-class nature and the 40/15/15/15/15 distribution. This score directly tells us its skill level above baseline. The fully contextualized Accuracy gauge confirms this: 62% is 'good' for this specific setup, but not in the highest tier. Both gauges provide a consistent, nuanced view.

For the Article Topic Labeler, the Agreement score of 0.512 instantly provides a clear assessment: moderate predictive skill. This single, standardized number shows it performs notably better than chance for its specific configuration, but isn't achieving top-tier agreement. It complements the contextualized Accuracy gauge, which visually shows *why* 62% is considered "good" in this particular imbalanced multi-class scenario.

Conclusion: A Multi-Faceted View for True Insight

Plexus utilizes both contextualized Accuracy gauges and inherently context-aware Agreement gauges (like Gwet's AC1). This dual approach provides a comprehensive and reliable understanding of classifier performance:

  • Contextualized Accuracy Gauge: Helps interpret the familiar 'percent correct' metric by visually adjusting its scale (colors and thresholds) based on the specific problem's class count and imbalance. It answers: "How good is this raw accuracy *for this particular problem setup*?"
  • Agreement Gauge (e.g., Gwet's AC1): Provides a standardized, chance-corrected score. It inherently accounts for class count and imbalance. It answers: "How much skill does this classifier demonstrate *beyond random chance*, in a way that's comparable across different problems?"

Together, these gauges offer robust insights, preventing misinterpretation of raw accuracy numbers and clearly highlighting a classifier's true performance relative to baseline expectations and its inherent skill.