SupraBench — Community-Driven AI Model Rankings

Benchmark	Bench Quality	Effective Score	Submissions
				View →

#	Model	Effective Score	Submissions
				View →

Every public AI leaderboard tells you something — but each one tells you something different, and many tell you almost nothing useful. Three failure modes are everywhere:

Contamination — the test set leaked into the training data, so the score doesn't measure capability, it measures memorisation.
Saturation — every frontier model scores 99 %, so the bench has no resolving power left, but it still dominates aggregate scores out of habit.
Bench-maxing — a model is tuned for a small set of popular benches and looks great there while being unusable in practice.

SupraBench gives every bench a community-voted quality and difficulty score, and adds an automatic saturation penalty on top. The result: a single number per model that respects how trustworthy and how informative each underlying benchmark actually is.

The Bench Score is the headline number shown on every bench page. It answers a single question: "how much should this benchmark count when ranking models?" A high-quality, hard, unsaturated bench that the community endorses should count more than a one-rater vanity bench on a trivial task.

It's a single number on $[0, 100]$ built from four independent factors. We define each variable first, then put them together.

Variables

$b$ — a single benchmark.
$Q(b) \in [0, 100]$ — the bench's quality: how trustworthy the score is. Mean of four community ratings (relevance, contamination resistance, discriminability, reproducibility) on a 1–5 scale, multiplied by 20. Defaults to $50$ when nobody has rated yet. (Full breakdown in "What exactly are the five bench dimensions?" below.)
$D(b) \in [0, 1]$ — the bench's difficulty: how much general intelligence it actually probes. Median of community 1–5 ratings, linearly mapped to $[0, 1]$ (a vote of $1$ becomes $0$, a vote of $5$ becomes $1$). Defaults to $0.5$ when un-rated.
$H(b) \in [0, 1]$ — the bench's headroom: how much measurement signal it has left before it saturates. Computed automatically from the top-$K$ frontier mean on $b$ — fully open ($H = 1$) until at least 3 models are evaluated, shrinks toward a floor of $0.1$ as the frontier mean climbs from $50$ toward $100$. (Full formula in "What is headroom and why does it exist?" below.)
$u_b \in \mathbb{Z}_{\geq 0}$ — the bench's net upvotes: number of distinct upvoting accounts minus number of distinct downvoting accounts on the bench's existence-vote, floored at $0$.
$U^\star = \max_{b'} u_{b'}$ — the maximum $u_b$ across every non-hidden bench in the system. Bootstrap: if $U^\star = 0$ (brand-new deployment, no votes anywhere), this factor is disabled for everyone.

The formula

\operatorname{BenchScore}(b) \;=\; \underbrace{Q(b)}_{\text{quality}} \;\cdot\; \underbrace{D(b)}_{\text{difficulty}} \;\cdot\; \underbrace{H(b)}_{\text{headroom}} \;\cdot\; \underbrace{u_b / U^\star}_{\text{upvote share}}

Quality is on $[0, 100]$; the three other factors are all on $[0, 1]$, so the Bench Score stays naturally on $[0, 100]$ — the same scale as the per-model scores.

What each factor does

Quality $Q(b)$. If the community thinks the bench is contaminated, trivia-grade, or non-reproducible, $Q$ shrinks and the bench's whole weight shrinks with it. A bench rated 1/5 across the board has $Q = 20$, fivefold less weight than a unanimous 5/5.
Difficulty $D(b)$. A bench rated 1/5 in difficulty has $D = 0$ and is silently filtered out — scoring 100 % on a trivial bench earns the model nothing. A bench rated 5/5 has $D = 1$ and is counted at full weight.
Headroom $H(b)$. Once a bench is solved (top frontier models all near 100), $H$ shrinks toward the $0.1$ floor, so the bench stops dominating the leaderboard the moment it loses resolving power. This makes ARC-AGI 3 → 4 hand-offs automatic, no manual retirement needed.
Upvote share $u_b / U^\star$. Without this, anyone could mint a brand-new bench, self-rate it $Q = D = H = 1$ and immediately appear at #1 on the bench leaderboard. With it, a 1-upvote vanity bench at $u_b = 1$ against an established $U^\star = 100$ is worth $1\,\%$ of an equally well-rated leader. The most-upvoted bench always has share = 1, so there's no self-penalty at the top.

The upvote share is linear on purpose: user trust should be able to dominate model coverage. Distinct-model coverage is still used later as evidence confidence, but it does not directly reduce the bench's ability weight.

A model's SupraScore is the headline number on every model page and the primary leaderboard sort key. It builds on top of the Bench Score from the previous section — so if you haven't read that one yet, start there. Conceptually, it answers: "how well does this model perform on the benchmarks the community trusts, adjusted by how much evidence we have?"

Three preparatory steps to turn raw submissions into a usable per-(model, bench) score, then two scoring steps to combine those into a final SupraScore.

Foundations — submission to per-pair score

Step 1 — Normalise each submission. Every bench has its own scale (e.g. 0–100, 0–10, 0–1). To compare scores across benches we map every raw submission to a common $[0, 100]$ scale using the bench's declared min and max.

$s_{\text{raw}}$ — a single raw submission, on the bench's native scale.
$s_{\min}, s_{\max}$ — the bench's declared lower and upper bounds.
$s_{\text{norm}} \in [0, 100]$ — the same submission, rescaled.

s_{\text{norm}} \;=\; \frac{s_{\text{raw}} - s_{\min}}{s_{\max} - s_{\min}} \cdot 100

In plain words: "where does this raw number sit between the bench's min and max, expressed as a percentage."

Step 2 — Validate via community votes. A submission only counts toward any score if its net vote is positive. The submitter gets an implicit $+1$, so a fresh submission starts at $1/0$. Wrong, duplicate, or fraudulent submissions get downvoted out of the calculation and stop contributing.

Step 3 — Take the median per (model, bench) pair. The same model is often submitted multiple times on the same bench (different runs, different prompts, different setups). We collapse all valid normalised submissions for a given $(m, b)$ pair into one number using the median.

$\mathcal{S}_{m,b}$ — the set of all community-validated normalised submissions ($s_{\text{norm}}$ values) for model $m$ on bench $b$.
$\mu_{m,b} \in [0, 100]$ — the median of that set. In plain words: the score of model $m$ on bench $b$. One number per (model, bench) cell.

\mu_{m,b} \;=\; \operatorname{median}\bigl(\mathcal{S}_{m,b}\bigr)

Median, not mean — one outlier submission (a fluke run, a misconfigured prompt, an outright fake that hasn't been downvoted yet) cannot shift the result.

Variables — model side

$m$ — a single model.
$\mathcal{B}_m$ — the set of benches on which $m$ has at least one community-validated score (i.e. at least one $\mu_{m,b}$ exists).
$\mu_{m,b} \in [0, 100]$ — the score of model $m$ on bench $b$, defined above.
$\operatorname{BenchScore}(b) \in [0, 100]$ — the user-trust-adjusted weight from the previous section. Used here as the ability weight for bench $b$ in the average.
$\bar{\mu}_m \in [0, 100]$ — the model's weighted mean across all its evaluated benches (computed in step 4 below).
$N_b$ — how many distinct models have at least one community-validated score on bench $b$.
$N^\star = \max_{b'} N_{b'}$ — the largest distinct-model count among non-hidden benches.
$E_m \;=\; \displaystyle\sum_{b \in \mathcal{B}_m} \operatorname{BenchScore}(b) \cdot \sqrt{N_b/N^\star}$ — the model's evidence weight. In plain words: how much trusted measurement, adjusted for comparative breadth, has been thrown at this model.
$E^\star \;=\; \max_{m'} E_{m'}$ — the maximum $E_m$ across every non-hidden model. Defines the highest-evidence reference point.

Step 4 — The weighted mean

In plain words: for every bench the model has been tested on, multiply the model's score on that bench by how much that bench is worth (its Bench Score), sum it all up, then divide by the total Bench-Score weight the model has accumulated.

\bar{\mu}_m \;=\; \frac{\displaystyle \sum_{b \in \mathcal{B}_m} \mu_{m,b} \cdot \operatorname{BenchScore}(b)}{\displaystyle \sum_{b \in \mathcal{B}_m} \operatorname{BenchScore}(b)}

Why weighted and not a simple mean? A simple mean (sum the scores, divide by the number of benches) treats every bench equally. So a saturated trivia bench at $\operatorname{BenchScore} = 1$ would count just as much as a fresh hard frontier bench at $\operatorname{BenchScore} = 50$. That gives a model-runner a free path to the top: pad the model with high scores on dozens of cheap benches and watch the average climb. The weighted mean kills that — a bench worth $1$ point of weight contributes $50$× less to both numerator and denominator than a bench worth $50$ points, so its influence on the average is exactly its share of the total weight.

And why does the denominator divide by the sum of weights, not by the bench count? Because that's what makes the result land back on the same $[0, 100]$ scale as the individual $\mu_{m,b}$ values. A weighted average of numbers in $[0, 100]$ is itself in $[0, 100]$ only if you divide by the sum of the weights. Dividing by the count would give you a number that scales with how much weight you've accumulated — but that's a separate question (covered in step 5), not the question this step is answering.

Think of step 4 as "how good is the model on the benches it actually was tested on" — pure performance intensity, normalised to $[0, 100]$. Step 5 below adds the second axis: "how confident are we in that estimate?" Two orthogonal pieces of information, separated by design.

Benches with $\operatorname{BenchScore}(b) = 0$ (zero difficulty, zero quality, etc.) are excluded from both numerator and denominator. They contribute nothing in either direction — neither help nor hurt.

Step 5 — Evidence confidence

In plain words: move the weighted mean from step 4 toward or away from the neutral 50-point midpoint based on the model's evidence share. A model with the highest evidence keeps its full weighted mean; a sparse model is treated as uncertain, not bad.

\operatorname{SupraScore}(m) \;=\; 50 \;+\; \sqrt{\dfrac{E_m}{E^\star}} \cdot \left(\bar{\mu}_m - 50\right)

If $E_m = E^\star$ the confidence factor is exactly $1$ (the highest-evidence model has no self-penalty). If $E_m = \tfrac{1}{4} E^\star$ the factor is $\sqrt{0.25} = 0.5$ — the estimate moves halfway from neutral 50 toward the model's weighted mean.

Why does this exist on top of step 4? Without it, a model with one extremely good score on one favourable bench could beat a well-covered rival too easily. But pushing sparse models toward $0$ is too harsh: if a model is #1 on the most trusted benches, missing scores on weak benches should not be treated as negative evidence. Shrinking toward $50$ means low evidence stays cautious without erasing the signal.

The $\sqrt{\cdot}$ shape mirrors the statistical $1/\sqrt{N}$ standard-error falloff: halving your evidence doesn't halve the distance from neutral, it shrinks it by $\sqrt{2} \approx 1.41$. Hidden models are excluded from the $E^\star$ maximum so a mothballed flagship can't permanently squash fresh entrants.

A consequence to be aware of: a model's SupraScore can move when other models gain evidence. A well-tested new flagship that raises $E^\star$ shifts everyone else's confidence share proportionally. This is deliberate — evidence is relative to what else has been tested. See the worked example below for concrete numbers.

Each bench is rated on a 1–5 scale on five orthogonal axes. The first four feed quality; the fifth is difficulty.

Relevance — does it measure something useful for real-world model use? High = meaningful task. Low = trivia, toy puzzles.
Contamination Resistance — how sure are we the test set wasn't in the training data? High = held-out, rotating, freshly generated. Low = scraped from the public web before model release.
Discriminability — does it separate weak from strong models? High = wide score spread. Low = everyone scores 95–100.
Reproducibility — can two independent runs reach the same conclusion? High = deterministic, well-specified. Low = vague setup, judge-LLM, hand-graded.
Difficulty — how much general intelligence does it actually probe? 1 = trivial. 5 = approaches frontier capability.

Quality is the mean of the first four × 20:

Q(b) \;=\; \frac{R + C + S + P}{4} \cdot 20 \quad \in [0,\,100]

Difficulty uses the median across raters (more robust to a single inflated vote), then linearly scaled to $[0,1]$:

D(b) \;=\; \max\!\left(0,\; \min\!\left(1,\; \frac{\operatorname{median}(d_i) - 1}{4}\right)\right)

If nobody has rated a bench yet, it defaults to neutral: $Q = 50$, $D = 0.5$.

Imagine ARC-AGI 3 just got solved — every frontier model scores 99 %. The bench has stopped measuring intelligence; it now just measures "is your model in the frontier club". A score on it should contribute almost nothing to a model's SupraScore.

The naïve fix would be: ask the community to retroactively lower its quality. Nobody does that consistently. So we automate it — every bench has a headroom factor that shrinks as the bench saturates:

The frontier-mean approach

Top-1 alone is too noisy: one freakishly good model would tank the bench's weight for everyone after a single submission. So we use the mean of the top-K models, which is robust to outliers:

N \;=\; \bigl|\{\, m : \exists \text{ valid score for } (m,b)\}\bigr|, \qquad K \;=\; \min(10,\, N)

f(b) \;=\; \frac{1}{K} \sum_{m \in \operatorname{top}_K(b)} \mu_{m,b}

where $\operatorname{top}_K(b)$ are the $K$ models with the highest per-model median on $b$.

From frontier-mean to headroom

Two regimes. With fewer than 3 evaluated models we don't have enough signal to claim saturation, so we don't penalise:

H(b) \;=\; \begin{cases} 1.0 & \text{if } N < 3 \\[6pt] \max\!\left(0.1,\; \dfrac{100 - \max(f(b),\, 50)}{50}\right) & \text{otherwise} \end{cases}

The pivot at 50 means a bench is "fully open" until at least its top-K models are above mid-range. The floor at 0.1 keeps historic benches alive — they never disappear, they just stop dominating.

Trajectory

Scenario	$N$	$f(b)$	$H(b)$
Brand-new bench, 1 model @ 92	1	—	1.00
3 models, top-3 mean 70	3	70	0.60
10 models, top-10 mean 84	10	84	0.32
30 models, top-10 mean 91	30	91	0.18
50 models, top-10 mean 99	50	99	0.10 (floor)

This is what makes the ARC-AGI 3 → 4 hand-off automatic. Nobody has to "retire" the old bench manually.

Take a hypothetical model "Claude X" with valid medians on three benches:

Bench	$\mu_{m,b}$	$Q$	$D$ (avg)	$N$	$f(b)$	$H$	$w$	$w \cdot \mu$
MMLU (saturated)	88	80	0.50 (3/5)	50	96	0.10	4.0	352
ARC-AGI 4 (fresh)	72	75	1.00 (5/5)	8	65	0.70	52.5	3 780
TrivialBench	99	60	0.00 (1/5)	20	99	0.10	0	0
Σ							56.5	4 132

\bar{\mu}_{\text{Claude X}} \;=\; \frac{4\,132}{56.5} \;\approx\; 73.1

That's the weighted mean. Now apply the evidence-confidence step. Suppose the most evidenced model in the table has $E^\star = 100$. Claude X's evidence weight is $E_m = 4.0 \cdot \sqrt{50/50} + 52.5 \cdot \sqrt{8/50} = 25.0$, so its confidence factor is $\sqrt{0.25} = 0.5$:

\operatorname{SupraScore}(\text{Claude X}) \;=\; 50 + 0.5 \cdot (73.1 - 50) \;\approx\; 61.6

If Claude X were the most-evidenced model ($E^\star = E_m = 25.0$), the factor would be $1$ and the SupraScore would equal the weighted mean $73.1$ exactly.

Four takeaways from this:

The frontier bench dominates ($w = 52.5$) because it's hard, trustworthy, and not yet saturated.
The saturated bench still counts a little ($w = 4$) but contributes less than 10 % of the total weight.
The trivial bench is silently filtered out by $D = 0$ — scoring 99 % on it earns the model exactly nothing.
The evidence-confidence step means a model that's been tested on only a small fraction of the bench pool is treated as uncertain, but missing weak benches no longer pushes it toward 0.

Anything the community can get wrong, the community can correct via voting. Five layers:

Submission votes — every individual score is up/downvoted. Only net-positive submissions count toward the SupraScore.
Tag votes — each tag on a model or bench is voted independently. Net positive keeps it in the canonical tag set.
Existence votes — fakes, duplicates, low-quality entries can be downvoted into a hidden state (see formula below).
Quality & difficulty ratings — anyone signed in can rate a bench on all five dimensions. Averaged across raters (mean for trust, median for difficulty).
Per-bench scaling — a bench with one corrupt rater barely moves; the more raters, the harder it is to game.

When does an entity get hidden?

The hide threshold scales with engagement, so a small mob can't kick a well-established bench, but spam still goes away fast:

\text{hide}(e) \;\Leftrightarrow\; \operatorname{down}(e) \;\geq\; \max\bigl(5,\, \lceil 0.6 \cdot (\operatorname{up}(e) + \operatorname{down}(e)) \rceil\bigr) \;\;\wedge\;\; \operatorname{down}(e) > \operatorname{up}(e)

A 5-downvote floor protects against drive-by spam. The 60 % ratio means that to remove a bench with 100 upvotes, you'd need at least 96 downvotes — much harder than 4 sock puppets.

Anti-resurrection

If you submit a model or bench under your name and it gets community-removed, you cannot re-submit it under the same name. Other users can re-submit it (with a numeric suffix on the slug), and the community votes on the new version independently.

Adversarial robustness is built into the formula at every layer:

One submission can't anchor a score — the per-(model, bench) median becomes robust to outliers as soon as $n \ge 2$ submissions exist: a single attacker number is replaced by the community median the moment a second honest submission lands.
One bench can only carry a model if users trust it — the SupraScore is averaged across all benches a model has scores on, weighted by quality × difficulty × headroom × community upvote share. Sparse evidence is then pulled toward the neutral 50-point midpoint instead of being treated as a full-strength claim.
One user can't carry a bench — every bench's contribution to its own headline score and to any model's SupraScore is multiplied by $u_b/U^\star$ where $u_b$ is its net upvote count and $U^\star$ is the leader's. A self-rated 100/100 vanity bench from a single account is worth $1/U^\star$ of an established bench at the same Q·D·H — you'd need $U^\star$ separate accounts upvoting it to even tie. Same defence works on the bench leaderboard and in the SupraScore aggregate, so you can't spawn a bench just to pump one model.
Single-model benches stay provisional — distinct model count is not allowed to override user trust in the Bench Score, but it still enters the evidence-confidence term. A high-upvote specialist bench can matter immediately; a one-account, one-model vanity bench stays low-evidence until other models and users engage with it.
Difficulty uses median, not mean — a single 5-star rater on a trivial bench can't fake difficulty.
Saturation auto-detected — pumping a saturated bench gives diminishing returns by construction.
Engagement-aware hide threshold — small voting cliques can't take down established entries.
Anti-resurrection — re-submitting your own removed entries under the same name is blocked.
Rate limiting — submissions are capped at 30 individual scores per 24 h per user.

None of these is bulletproof on its own. Together they make systemic gaming expensive enough that legitimate contribution is the cheaper path. Every defensive claim above is encoded as an executable invariant or attack scenario in tests/convex/adversarial-robustness.test.ts — the harness has three layers (invariants, an attack catalog, and a deterministic-PRNG fuzzer) and runs on every CI build, so a regression in the math fails a test instead of going unnoticed. One scenario (`A3-extreme`, an industrial-scale 8+ vanity bench farm) explicitly documents an attack that the pure math does not defend against — those are blocked operationally by the rate-limit, downvote, anti-resurrection and moderation rules above.

Any URL is a valid source — but only well-known academic, lab, and dedicated leaderboard hosts get an "Official source" badge. Everything else (YouTube, Substack, X/Twitter, personal blogs) is a valid community source and gets a "Community" badge instead. The submission is otherwise treated identically; the badge is a transparency signal, not a gatekeeper.

The whitelist lives in convex/urls.ts in the public source repository on GitLab — suggestions welcome via issues.

The full source code of SupraBench, including the ranking math you just read, is published under the Business Source License 1.1 (BSL):

gitlab.com/florian-fischer-group/suprabench ↗

Source-available, not OSI-open-source. The BSL is a "source-available" licence: you can read, audit, fork for research, learning and non-commercial use, patch and redistribute — but the licence does not meet the OSI Open Source Definition because it carves out commercial competing-service use until the Change Date of 2029-01-01. On that date the entire codebase auto-converts to Apache License 2.0 and becomes plain open source under that name. We use this exact wording instead of the unqualified "open source" label because the OSI / FSF community is — rightly — strict about that distinction.

Issues, feature requests and merge requests are welcome on GitLab. Imprint, privacy and terms in the sidebar footer.

A family is one specific lab release — not a vendor, a generation or an architecture line. "Claude Opus 4.6" and "Claude Opus 4.7" are separate families. "Claude" is not a family at all; it's a brand.

The reason is practical: a 0.1 version bump at a frontier lab usually means a different training run, different parameter count and measurably different benchmark behaviour. Grouping them defeats the point of a family ranking.

Variants go in the model name, not the family

When a lab ships the same release at different effort levels — sampling temperature, reasoning mode, context window, fine-tune — we keep them in the same family and disambiguate with a parenthetical suffix on the model's display name:

Family	Variants (model name)	What the suffix means
`Claude Opus 4.7`	`Claude Opus 4.7`, `Claude Opus 4.7 (max)`	Lab-branded "best-effort" sampling
`GPT-5.3 Codex`	`… (low)`, `(med)`, `(high)`, `(xhigh)`	Reasoning-effort level
`Gemini 3.1`	`Gemini 3.1`, `Gemini 3.1 (thinking)`	Explicit reasoning mode on/off

Common suffixes you'll see: (low) / (med) / (high) / (xhigh), (thinking), (max), (128k) / (200k) / (1M) for context-length SKUs, (instruct) / (chat) / (base) for open-weight post-training variants.

Family ranking ≠ average of member models

When you click a family tag in the rankings table, we rebuild the SupraScore from the union of the family's member models per bench: for each bench any member scored on, we take the family's median-of-medians and weight it by the bench's own weight, exactly as we do for individual models. The evidence-confidence step then uses the max over all families (not over all models) as denominator so family scores stay on the same $[0,100]$ scale. A family with more variants doesn't automatically win — the per-bench median is still robust to one outlier variant.

What if the community disagrees with a grouping?

Anyone signed in can edit a model's familyTag via the model-detail page. Rankings refresh on the next tick. There's no canonical list and no admin curation — if the community thinks "GPT-5.3 Codex (high)" is actually a separate model from "(low)", they can split it off by assigning a new family tag to the "(high)" variant.

A one-off release with no variants and no expected follow-ups is fine to have familyTag == name — the family ranking then shows it with modelCount: 1.

Tags exist on two different things on SupraBench, and they play two different roles:

Bench tags describe what the bench measures — reasoning, code, multilingual, vision, safety. These are structural: they decide which benches feed a tag-scoped score.
Model tags describe what the model is or claims to be — multimodal, open-weights, moe, frontier, agentic. These are descriptive: they help you find a model in search but don't enter the math.

Why the asymmetry?

The "Filtered Score" column reweighs a model's SupraScore using only the benches that match the active tag. That math only makes sense if the tag actually selects a non-empty set of benches. A pure model tag like multimodal — useful for search ("show me models that claim to do vision") — would select zero benches and produce a null filtered score for every row. So we hide model-only tags from the chip bar entirely.

What this means in practice

The tag-filter chip bar (top of the models / benches list) shows bench tags only, sorted by how many benches carry them.
The search box on either list matches against everything — model name, provider, bench tags, model tags. Type multimodal there and Gemini surfaces immediately, even though no bench is tagged that way (yet).
The tag picker popup ("+ N more" pill) lists every bench tag with its bench count, so you can drill into any structural slice without scrolling the chip strip.
Bench tags also show up in the benches table — clicking one there toggles the same shared filter as the chip bar.

Can a tag be both?

Yes. code is the obvious one: HumanEval is a code bench, GPT-5.3 Codex is a code-tuned model. The tag exists once globally; it just appears in a chip bar if at least one bench carries it, and surfaces in search regardless of which side it lives on. The tag-counts API tracks both sides separately so this remains true even after extensive recategorisation.

Implementation: the tagCounts Convex table keeps {benches, models} per tag; the chip bar reads tags.listForBenches and the autocomplete reads tags.listAll. See convex/tags.ts.

Yes — partially. The /v1/* endpoints are live and answering requests today, but the only keys we mint right now are free Partner keys for non-profit, research and open-source projects we explicitly approve. Paid self-serve tiers (Starter / Pro / Enterprise) stay on a waitlist until enough developers are actually queued for them — running Stripe + a dedicated edge cache for an API nobody uses isn't free. Pricing is TBD until that launch, the waitlist is how we figure out what each tier should actually cost. Enterprise plans are always custom. Billing will be handled by Stripe (EU VAT collected automatically).

Read the full API documentation — every endpoint, error code, rate-limit and example is already there. For paid tiers, join the waitlist for the one you'd actually subscribe to: Profile → API & Billing. When the queue hits launch threshold we ship and email everyone in signup order.

Running a non-profit, research or open-source project that could genuinely use the API today? Pitch us for a Partner key — free, negotiated quota, live right now. The Apply to become a partner button at the bottom of the pricing grid opens a pre-filled form with the bits we need to evaluate.

Have a specific use case (dashboard, leaderboard mirror, evaluation tooling)? Tell us via a GitLab issue or the partner mailto — high-signal asks weigh more than passive signups.

Model RealityRanking

Benchmark Performance

Bench WeightRanking

Submission Detail

Discussion

Contribute.

✓ Submitted

About.

Variables

The formula

What each factor does

Foundations — submission to per-pair score

Variables — model side

Step 4 — The weighted mean

Step 5 — Evidence confidence

The frontier-mean approach

From frontier-mean to headroom

Trajectory

When does an entity get hidden?

Anti-resurrection

Variants go in the model name, not the family

Family ranking ≠ average of member models

What if the community disagrees with a grouping?

Why the asymmetry?

What this means in practice

Can a tag be both?

My Submissions

My Creations

My Tag Votes

Your API access

API keys

Recent months

Plans

Starter

Pro

Enterprise

Enterprise+

Partner

Your subscription

API keys

Save this key now

Simulator.

Hypothetical leaderboard

Admin role

Granted tier

API keys view-only — the user mints their own keys from their API tab

Monthly usage

Model Reality
Ranking

Bench Weight
Ranking