Terminology
Dataset
Arena evaluation runs against a pre-prepared Dataset as input.
A Dataset is a collection of Samples, the smallest unit of evaluation.
- Sample: Data processed from an LLM request log used in production, adapted for arena evaluation.
Samples serve as shared inputs for comparing multiple models under identical conditions. They are typically centered around an input prompt and are designed to ensure reproducibility across evaluations.
Dataset
└─ SampleEvaluation
An Evaluation is the process of comparing multiple models using Samples as input, then aggregating and quantifying the results. In the Arena, this evaluation process is broken down into several conceptual units.
Relationship Between Match / Trial / Model
- Match: A single 1-on-1 contest comparing a pair of models (Candidate Model A and B).
- Trial: A single inference execution where one model generates a response for one Sample.
- CandidateModel: The model being evaluated.
- JudgeModel: The model that judges and scores responses.
- BaselineModel: An optional model used as a reference point for frequent comparison.
A Match always assumes a "pair" of models as its comparison unit, and within it each model executes a Trial. The generated outputs are then compared and judged by one or more JudgeModels.
Match (a single 1-on-1 contest comparing a pair of models)
├─ Trial (Candidate Model A)
├─ Trial (Candidate Model B)
└─ Judge (Judge Model) x N (N=1~3)Step and Rating
An evaluation doesn't conclude with a single Match. By aggregating results from multiple matches, the relative strength between models is estimated stably.
- Step: A unit that runs multiple matches together and updates Ratings based on the win/loss outcomes.
- Rating: A relative performance metric for a model, calculated by aggregating match results. Updated based on wins, losses, scores, and comparison outcomes, and represents the relative strength between models.
One Step contains Matches across multiple Samples. The results of each Match are aggregated, and Ratings are updated at the end of the Step.
Step (unit of Rating update)
└─ Match × N
├─ Sample #1 → Match → Result
├─ Sample #2 → Match → Result
├─ Sample #3 → Match → Result
└─ ...
↓
Rating UpdateThis structure enables stable evaluation that doesn't over-depend on any particular Sample.
Cost Metrics
In addition to performance evaluation, the Arena places importance on cost visibility.
Total evaluation cost is managed separately for inference models and judging models.
- TotalCost: The sum of all model response costs incurred during the evaluation. Includes inference costs for candidate models and judgment costs for the Judge.
- TrialCost: The total inference cost incurred for responses generated by candidate models in each Trial.
- JudgeCost: The total inference cost incurred for outputs generated by the JudgeModel to judge the results of each Trial.
These metrics allow you to clearly distinguish between "high-performance but expensive models" and "models with a good performance-to-cost balance."
TopRatingModel
- TopRatingModel: The model with the highest Rating.
TopRatingModel indicates the model currently ranked highest in the Arena.
This is strictly the result of relative evaluation and is expected to be updated as the Dataset or evaluation conditions change.
Coverage
Coverage indicates how completely the evaluation succeeded. It represents the degree of evaluation completion and execution resilience, independently of Rating (strength).
Session Coverage
The proportion of Matches that actually completed out of all planned Matches in the evaluation session.
Session Coverage = Completed Matches / Planned MatchesIndicates the overall reliability of the evaluation.
Model Coverage
The proportion of Matches a specific model completed out of all Matches it participated in.
Model Coverage = Completed Matches / Participated MatchesIndicates the execution stability of a model.
Relationship to Rating
- Rating is only updated from completed Matches
- Incomplete Matches (No Contest) do not affect Rating, only Coverage
This allows strength (Rating) and execution resilience (Coverage) to be evaluated separately.