Examples

Let's explore a variety of classifier scenarios to see how both Agreement (AC1) and Accuracy gauges represent different performance levels across different data distributions.

Balanced Distributions

When dealing with balanced distributions, where each class has an equal (or nearly equal) number of instances, the number of classes itself becomes a critical factor in interpreting raw accuracy. A 65% accuracy score, for instance, means something very different for a 2-class problem (where chance is 50%) compared to a 4-class problem (where chance is 25%). The dynamically colored segments on the Accuracy gauge are designed to help with this: they visually adjust the 'chance', 'okay', 'good', and 'great' regions based on the number of classes, providing immediate visual context for how the achieved accuracy compares to the baseline random chance performance for that specific number of classes.

Gwet's AC1 Agreement gauge, on the other hand, adapts to the number of classes in a different but equally powerful way. The AC1 calculation inherently accounts for chance agreement based on the number of classes and their distribution. This means Gwet's AC1 itself (ranging from -1 to 1) can be interpreted consistently: a score of 0.0 always indicates performance no better than chance, 1.0 indicates perfect agreement, and values in between (e.g., 0.2-0.4 for fair, 0.4-0.6 for moderate, 0.6-0.8 for substantial, 0.8-1.0 for almost perfect agreement) carry a similar meaning regardless of whether you have two, three, or ten classes. This consistency makes the Agreement gauge a very reliable indicator of true classifier skill, corrected for chance.

Binary Classifier, Balanced (65% Accuracy)

Two classes ('Yes', 'No'), 50 items each. Classifier achieves 65/100 correct.

Labels: Binary

Balanced distribution

Yes

Confusion matrix

Yes

Predicted

Actual

Predicted classes

Yes

0.30

Agreement

65%

Accuracy

Key Insight:

With 65% accuracy on a balanced binary task (chance = 50%), the AC1 is 0.30. This is only 15 points above chance, suggesting mediocre performance. The contextual accuracy gauge shows this level.

Ternary Classifier, Balanced (65% Accuracy)

Three classes ('Yes', 'No', 'NA'), roughly equal distribution. Classifier achieves 65/100 correct.

Labels: 3 classes

Balanced distribution

Yes

Confusion matrix

Yes

Predicted

Actual

Predicted classes

Yes

0.47

Agreement

65%

Accuracy

Key Insight:

With 65% accuracy on a balanced 3-class task (chance approx 33.3%), the AC1 is ~0.475. This is ~31.7 points above chance, indicating fairly good performance. The gauges reflect this improvement over the binary case.

Four-Class Classifier, Balanced (65% Accuracy)

Four classes ('A', 'B', 'C', 'D'), 25 items each. Classifier achieves 65/100 correct.

Labels: 4 classes

Balanced distribution

Confusion matrix

Predicted

Actual

Predicted classes

0.53

Agreement

65%

Accuracy

Key Insight:

With 65% accuracy on a balanced 4-class task (chance = 25%), the AC1 is ~0.533. This is a substantial 40 points above chance, representing good performance. The gauges clearly show this as better than the binary and ternary examples with the same accuracy.

Key Insight: Same Accuracy, Different Meanings

These three examples all showcase a classifier achieving 65% accuracy. However, the interpretation of this performance shifts dramatically when we consider the problem context, especially the number of classes and the baseline chance agreement. Gwet's AC1 and the contextualized Accuracy gauge help clarify these differences:

For the binary classifier (2 classes, 50% chance baseline), 65% accuracy yields an AC1 of 0.30. This is only 15 percentage points above chance, indicating somewhat mediocre performance.
For the ternary classifier (3 classes, ~33.3% chance baseline), the same 65% accuracy results in an AC1 of approximately 0.475. This is about 31.7 points above chance, suggesting fairly good performance.
For the four-class classifier (4 classes, 25% chance baseline), 65% accuracy gives an AC1 of approximately 0.533. This is a substantial 40 points above chance, representing good performance.

This clearly demonstrates that a raw accuracy score like 65% can be misleading on its own. Its true meaning heavily depends on the context of the task. Gwet's AC1 provides a more robust and comparable measure of agreement, highlighting how the same accuracy can correspond to vastly different levels of skill relative to chance.

Imbalanced Distributions

Interpreting classifier performance becomes even more challenging with imbalanced distributions, where one or more classes are significantly over or underrepresented. Raw accuracy, in particular, can be dangerously misleading. A classifier might achieve a high accuracy score simply by predicting the majority class most of the time, while completely failing on minority classes. The Accuracy gauge, with its dynamically adjusting segments, remains a key tool. It calculates a baseline chance agreement level that considers the skewed distribution. This means the 'good' performance regions on the gauge will shift, often to higher accuracy values, reflecting that a higher raw accuracy is needed to demonstrate skill beyond merely guessing the dominant class. The upcoming \"Always No\" strategy example will starkly illustrate this: high apparent accuracy, but the gauge will reveal it as unimpressive once contextualized against the skewed baseline.

Gwet's AC1 Agreement gauge proves especially invaluable for imbalanced datasets. Because it inherently corrects for chance agreement that arises from the specific class distribution (no matter how skewed), Gwet's AC1 provides a stable and reliable measure of a classifier's ability to agree with true labels beyond what random chance would produce for that particular imbalance. For instance, if a model achieves a high Gwet's AC1 on an imbalanced dataset, it indicates genuine skill in distinguishing between classes, including the rarer ones. Conversely, as we will see in the \"Always No\" example, a strategy that yields high raw accuracy by ignoring the minority class will correctly result in a Gwet's AC1 of 0.0, exposing its lack of true predictive power across the full spectrum of classes.

Binary Classifier, Imbalanced (5% 'Yes'), 90% Accuracy

Actual: 5 'Yes', 95 'No'. Classifier gets 90/100 correct.

Labels: Binary

Imbalanced distribution

Yes

Confusion matrix

Yes

Predicted

Actual

Predicted classes

Yes

0.40

Agreement

90%

Accuracy

Key Insight:

Despite 90% raw accuracy, the severe imbalance (only 5% 'Yes') means this performance is not as strong as it seems. Gwet's AC1 of ~0.401 indicates moderate agreement above chance. The dynamic accuracy gauge also reflects that beating the ~90.5% chance level (Pe for accuracy) is harder here.

Binary Classifier, Imbalanced (5% 'Yes'), 'Always No' Strategy

Actual: 5 'Yes', 95 'No'. Classifier always predicts 'No'.

Labels: Binary

Imbalanced distribution

Yes

Confusion matrix

Yes

Predicted

Actual

Predicted classes

Agreement

95%

Accuracy

Key Insight:

This 'cheating' classifier achieves 95% accuracy simply by predicting the majority class ('No') every time. However, Gwet's AC1 is 0.0, correctly exposing that it has zero predictive skill for the minority 'Yes' class and offers no value despite the high accuracy.

Ternary Classifier, Imbalanced (90% Accuracy)

Classes A:5, B:45, C:50. Classifier gets 90/100 correct.

Labels: 3 classes

Imbalanced distribution

Confusion matrix

Predicted

Actual

Predicted classes

0.82

Agreement

90%

Accuracy

Key Insight:

Even with imbalanced classes (one very rare), 90% accuracy yields a strong Gwet's AC1 of ~0.819. This indicates genuine predictive skill well above what chance agreement (Pe ~0.4465) would suggest for this distribution.

Four-Class Classifier, Imbalanced (90% Accuracy)

Classes A:5, B:15, C:30, D:50. Classifier gets 90/100 correct.

Labels: 4 classes

Imbalanced distribution

Confusion matrix

Predicted

Actual

Predicted classes

0.84

Agreement

90%

Accuracy

Key Insight:

With a more complex 4-class imbalanced scenario, 90% accuracy achieves a Gwet's AC1 of ~0.843. This is excellent agreement, demonstrating robust performance significantly above the chance level (Pe ~0.3625) for this specific distribution.

Next Steps

Now that you understand how to interpret these agreement gauges, explore related concepts to get the most out of your evaluation data.