Scores

Scores are the fundamental building blocks of evaluation in Plexus. They define what you want to measure or assess about your content.

What are Scores?

Scores are individual evaluation criteria that define specific aspects you want to assess in your content. While they can be questions, they're not limited to just questions - they can be any type of evaluation point that helps analyze your content.

Types of Scores

Scores can take many forms, including:

Questions: "Did the agent introduce themselves by stating the company name?"
Sentiment Analysis: Evaluating if the content's tone is positive, negative, or neutral
Compliance Checks: Verifying if specific required elements are present
Metrics: Quantitative measurements like response time or word count

Score Components

Each score consists of:

Criteria: What exactly is being evaluated
Evaluation Method: How the score should be determined
Response Format: The type of result (yes/no, numeric, categorical, etc.)
Instructions: Guidelines for consistent evaluation

Score Results

Plexus standardizes all score results around a common structure, ensuring consistency and enabling powerful analysis capabilities across different types of evaluations.

Result Structure

Every score result in Plexus contains these core components:

Result Value: The actual outcome of the evaluation (e.g., "yes"/"no", a numeric score, or a category)
Explanation: A detailed description of why this result was chosen, providing transparency into the decision-making process. For LLM-based scores, this often includes chain-of-thought reasoning similar to what you might see from models like OpenAI's GPT-4 o1/o3, Google's "thinking" models, or Deepseek R1.
Confidence Level: For applicable scores (like machine learning classifiers or LLM-based evaluations), this indicates how certain the system is about the result. This can be used for filtering, quality control, or triggering human review when needed.

Using Result Components

The standardized result structure enables powerful workflows:

Use confidence levels to filter results or trigger additional review for low-confidence evaluations
Leverage explanations for quality assurance, training, and understanding model decision-making
Build consistent interfaces and analysis tools that work across all score types
Create automated workflows based on result values while maintaining full explainability

Using Scores

Scores are organized into scorecards, which group related evaluation criteria together. When you run an evaluation, each score in the scorecard is applied to your content, building a comprehensive assessment.

Score Versions

Scores in Plexus support versioning, allowing you to track changes to score configurations over time. Each version represents a different configuration of the score, with one version designated as the "champion" (active) version that's used for evaluations.

Champion Version: The currently active version used for evaluations
Featured Versions: Versions that are highlighted for importance or reference
Configuration: Each version contains its own configuration, including prompts, parameters, and other settings

You can view score versions using the CLI command:

plexus scores info --scorecard "Example Scorecard" --score "Example Score"

This command displays up to 10 versions in reverse chronological order (newest first), showing which version is the champion and which versions are featured.