Skip to content

Rating System

The Arena continuously estimates the relative strength of each model based on multiple match results.

This estimation is performed by the rating system.

Two systems are currently implemented — Elo and Glicko-2 — and either can be selected when configuring an evaluation.

Why Ratings Are Needed

Match results alone cannot stably compare the strength of models:

  • The meaning of a win or loss depends on the opponent's strength
  • A small number of matches is heavily influenced by chance
  • Directly comparing all models would require an enormous number of matches

Using a rating system enables indirect comparison (if A beats B and B beats C, we can infer A > C), allowing stable rankings to be built with fewer matches.

Elo Rating System

The classic rating system used in chess. Simple, easy to understand, and computationally inexpensive.

How It Works

Each model has a rating value (default 1500).

Before each match, the expected win probability is calculated, and ratings shift based on the difference from the actual outcome:

Expected score = 1 / (1 + 10^((opponent rating - own rating) / 400))
  • Expected outcome (stronger model wins) → small rating change
  • Upset (weaker model wins) → large rating change

Elo is a zero-sum game: points gained by the winner equal points lost by the loser.

K-Factor

A parameter that controls the magnitude of rating changes. Models with less match experience get a higher K-factor, resolving initial uncertainty more quickly.

Matches PlayedK-FactorMeaning
0–932 (initial)Unknown models move a lot
10–2916 (default)Normal range of change
30+10 (stable)Stable models move little

Elo's Limitations

Elo only tracks a single rating value, so it carries no information about how reliable that rating is. Two models with the same rating — one with few matches and one with many — should have different uncertainty levels, but Elo cannot distinguish between them.

Glicko-2 Rating System

An extension of Elo by Mark Glickman (2013). Achieves more accurate evaluation by tracking uncertainty and volatility alongside the rating.

Three Parameters

Glicko-2 represents each model with three values:

ParameterSymbolMeaningInitial Value
Ratingμ (mu)Estimated model strength1500
Rating Deviationφ (phi)Rating uncertainty. Larger means less reliable350
Volatilityσ (sigma)Performance consistency. Larger means more erratic results0.06

φ (RD: Rating Deviation) is the core of Glicko-2.

  • New model (few matches): large φ → low confidence in rating
  • Experienced model (many matches): small φ → high confidence in rating

After matches, φ shrinks (uncertainty decreases). During periods without matches, φ grows (performance may have changed).

Expected Score and the g Function

Glicko-2's expected score accounts for the opponent's RD:

g(φ_j) = 1 / √(1 + 3φ_j² / π²)        ← attenuation function for opponent RD
E(μ, μ_j, φ_j) = 1 / (1 + exp(-g(φ_j) × (μ - μ_j)))   ← expected score

Matches against opponents with high RD (low reliability) have a weaker influence on rating updates. This prevents results against unreliable opponents from being over-weighted.

Batch Update (Glicko-2)

Arena evaluation progresses in Step units, with each Step containing multiple Matches (see Terminology).

Per the Glicko-2 specification, all Matches within a single Step are treated as one rating period and processed together. This is called a batch update.

Why Batch Updates Matter

Sequential updates (updating after each Match in order) have these problems:

  • RD over-shrinkage: Models participating in multiple Matches have their RD shrunk after each Match, leading to overconfident estimates
  • Order dependency: Results vary depending on which Match is processed first, preventing deterministic outcomes

With batch updates, all Match opponent information is built using pre-batch ratings (ratings before the Step starts), and all models are then updated together.

Step begins

Snapshot pre-batch ratings for all models

Build opponent information for each Match using pre-batch ratings

Update all models simultaneously using their full opponent lists

Step ends

This ensures deterministic results independent of the order in which Matches complete.

Note: Elo uses sequential updates by default. The difference between sequential and batch updates is small for Elo, so this consideration is specific to Glicko-2.

Pair Selection Algorithms

Which models are matched against each other also significantly affects evaluation efficiency. JuryArena provides pair selection algorithms matched to each rating system.

BaselineStarSelector (for Elo)

A strategy that centers matches around a baseline model, pitting it against all other models.

  • Combines baseline vs. all matches with non-baseline vs. non-baseline matches
  • The nonbaseline_ratio parameter adjusts the proportion of non-baseline matches (0.0–1.0)
  • Prefers pairs with similar ratings (prefer_close_ratings)
  • Avoids repeated recent match-ups (avoid_recent_duplicates)

Glicko2Selector (for Glicko-2)

An information-theoretic selection algorithm that leverages φ (RD). Selects pairs based on Expected Variance Reduction:

Fisher information:           I_ij = g(φ_j)² × E × (1 - E)
Expected variance reduction:  gain = φ⁴ × I / (1 + φ² × I)
Pair score:                   score = gain_i + gain_j

This score naturally reflects:

PropertyReason
Models with larger φ are prioritizedφ⁴ factor
Pairs with similar ratings are more informativeE×(1−E) is maximized when E ≈ 0.5
Matches against reliable opponents are more informativeSmaller opponent RD → higher quality information

The baseline_weight parameter adjusts the ratio between pure Glicko-2-based selection and baseline matches.

Choosing Between Elo and Glicko-2

AspectEloGlicko-2
Uncertainty trackingNoneYes (φ)
Pair selection optimizationBasicInformation-maximizing via φ
Computational costLowMedium (uses Illinois algorithm)
Accuracy with few matchesLowHigh
InterpretabilityHigh (single value)Medium (3 parameters)

Glicko-2 is recommended for most use cases. Its advantage is especially pronounced in the early stages of evaluation when you want to quickly distinguish between models, or when the number of matches is limited.

References