Scores
Scores are the fundamental building blocks of evaluation in Plexus. They define what you want to measure or assess about your content.
What are Scores?
Scores are individual evaluation criteria that define specific aspects you want to assess in your content. While they can be questions, they're not limited to just questions - they can be any type of evaluation point that helps analyze your content.
Types of Scores
Scores can take many forms, including:
- Questions: "Did the agent introduce themselves by stating the company name?"
- Sentiment Analysis: Evaluating if the content's tone is positive, negative, or neutral
- Compliance Checks: Verifying if specific required elements are present
- Metrics: Quantitative measurements like response time or word count
Score Components
Each score consists of:
- Criteria: What exactly is being evaluated
- Evaluation Method: How the score should be determined
- Response Format: The type of result (yes/no, numeric, categorical, etc.)
- Instructions: Guidelines for consistent evaluation
Score Results
Plexus standardizes all score results around a common structure, ensuring consistency and enabling powerful analysis capabilities across different types of evaluations.
Result Structure
Every score result in Plexus contains these core components:
- Result Value: The actual outcome of the evaluation (e.g., "yes"/"no", a numeric score, or a category)
- Explanation: A detailed description of why this result was chosen, providing transparency into the decision-making process. For LLM-based scores, this often includes chain-of-thought reasoning similar to what you might see from models like OpenAI's GPT-4 o1/o3, Google's "thinking" models, or Deepseek R1.
- Confidence Level: For applicable scores (like machine learning classifiers or LLM-based evaluations), this indicates how certain the system is about the result. This can be used for filtering, quality control, or triggering human review when needed.
Using Result Components
The standardized result structure enables powerful workflows:
- Use confidence levels to filter results or trigger additional review for low-confidence evaluations
- Leverage explanations for quality assurance, training, and understanding model decision-making
- Build consistent interfaces and analysis tools that work across all score types
- Create automated workflows based on result values while maintaining full explainability
Using Scores
Scores are organized into scorecards, which group related evaluation criteria together. When you run an evaluation, each score in the scorecard is applied to your content, building a comprehensive assessment.
Score Versions
Scores in Plexus support versioning, allowing you to track changes to score configurations over time. Each version represents a different configuration of the score, with one version designated as the "champion" (active) version that's used for evaluations.
- Champion Version: The currently active version used for evaluations
- Featured Versions: Versions that are highlighted for importance or reference
- Configuration: Each version contains its own configuration, including prompts, parameters, and other settings
You can view score versions using the CLI command:
plexus scores info --scorecard "Example Scorecard" --score "Example Score"
This command displays up to 10 versions in reverse chronological order (newest first), showing which version is the champion and which versions are featured.