The Plexus Agreement Gauge

The Agreement Gauge in Plexus typically displays a chance-corrected agreement coefficient, such as Gwet's AC1. Unlike raw accuracy, which can be misleading, these metrics are designed to measure concordance (or model performance) while accounting for agreement that could occur purely by chance. This provides a more reliable assessment of a classifier's true skill.

Why Use a Chance-Corrected Metric?

As discussed on our Accuracy Gauge page, raw accuracy figures can be easily misinterpreted due to factors like the number of classes and class imbalance. A high accuracy score might not always indicate good performance if chance agreement is also high.

Agreement coefficients, like Gwet's AC1, address this by factoring out the expected chance agreement. The resulting score reflects the extent to which the observed agreement (or model accuracy) exceeds what would be expected by chance, given the specific characteristics of the data (including class distributions and number of classes).

Learn More About Contextual Challenges

For a comprehensive discussion on how Plexus addresses various contextual factors in evaluation, see:

How the Plexus Agreement Gauge Works

The Agreement Gauge displays the calculated agreement coefficient, typically Gwet's AC1. This score usually ranges from -1 to +1:

+1.0: Perfect agreement.
0.0: Agreement is exactly what would be expected by chance. The model shows no skill beyond random guessing relative to the class distribution.
-1.0: Perfect systematic disagreement (meaning the model is consistently wrong in a patterned way).
Values between 0 and 1 indicate varying degrees of agreement better than chance.

The visual segments on the Agreement Gauge (colors like red, yellow, green) are generally based on established benchmarks for interpreting the strength of agreement coefficients (like those proposed by Landis & Koch for Kappa, which are often adapted for AC1). Because the AC1 score already has the context of class distribution and chance agreement baked into its calculation, these visual segments tend to be fixed. The interpretation of an AC1 score of, say, 0.7 (substantial agreement) is consistent regardless of whether it's a 2-class or 10-class problem, or whether the data is balanced or imbalanced.

Example: Gwet's AC1 Gauge

0.7%

Gwet's AC1

An AC1 score of 0.65 indicates 'Substantial' agreement, according to common benchmarks. This interpretation is generally stable across different problem contexts because the metric itself is context-adjusted.

Agreement Gauge in Action: Exposing Misleading Accuracy

One of the most powerful uses of the Agreement Gauge is its ability to reveal situations where high raw accuracy is deceptive. Consider the "Always Safe" Email Filter example:

The 'Always Safe' Email Filter - Agreement View

Strategy: Label ALL emails as 'Safe'. Raw Accuracy: 97%. Gwet's AC1: 0.0.

Labels: Binary

Imbalanced distribution

Safe

Confusion matrix

Safe

970

Safe

Prohibited

Predicted

Actual

Predicted classes

Safe

Agreement

97%

Accuracy

Key Insight:

Despite a 97% raw accuracy (which looks great on a fixed scale), the Gwet's AC1 score of 0.0 immediately signals that the filter has ZERO predictive skill beyond what's expected by chance for this highly imbalanced dataset. It's no better than a random process that respects the 97/3 base rates.

In this scenario, the Agreement Gauge cuts through the illusion. While the (uncontextualized) accuracy might suggest high performance, the AC1 score of 0.0 tells the true story: the model has learned nothing useful for distinguishing prohibited content.

Interpreting Agreement Score Ranges

While context is embedded in the score, it's helpful to have general benchmarks for interpreting the strength of agreement indicated by Gwet's AC1 (similar to Fleiss' Kappa). The following ranges are widely used (adapted from Landis & Koch, 1977, for Kappa):

AC1 Value	Strength of Agreement
< 0.00	Poor (Worse than chance)
0.00 – 0.20	Slight
0.21 – 0.40	Fair
0.41 – 0.60	Moderate
0.61 – 0.80	Substantial
0.81 – 1.00	Almost Perfect

Note: These are general guidelines. The practical significance of an agreement score also depends on the specific application and consequences of misclassification.

A Complementary Perspective

The Agreement Gauge offers a powerful, standardized way to assess performance corrected for chance. It's best used alongside the contextualized Accuracy Gauge.

The Accuracy Gauge (with its dynamic segments) helps you understand what a raw percentage means *for your specific problem's class structure and imbalance*.
The Agreement Gauge tells you how much better than chance your model is performing, in a way that's *comparable across different problems and datasets*.

Together, they provide a comprehensive view, helping you avoid misinterpretations and gain true insight into your classifier's performance.

Key Takeaways

The Plexus Agreement Gauge typically displays a chance-corrected metric like Gwet's AC1.
This metric inherently accounts for the number of classes and class imbalance, providing a score of "skill beyond chance."
A score of 0.0 means performance is no better than random chance for that context; +1.0 is perfect agreement.
The visual segments on the Agreement Gauge are generally fixed, as the metric value itself is already context-normalized.
It's a powerful tool for unmasking misleadingly high raw accuracy scores, especially in imbalanced datasets.
Use it alongside the contextualized Accuracy Gauge for a complete understanding.