Model Reality
Ranking
Benchmark Performance
| Benchmark | Bench Quality | Effective Score | Submissions | |
|---|---|---|---|---|
| View → |
Sign in to vote on tags or suggest new ones.
Benchmark Quality
Ranking
| # | Model | Effective Score | Submissions | |
|---|---|---|---|---|
| View → |
Powered by GitHub Discussions via giscus. Sign in with GitHub to comment.
Sign in to vote on tags or suggest new ones.
Submission Detail
Discussion
Powered by GitHub Discussions via giscus. Sign in with GitHub to comment.
Contribute.
Submit benchmark scores. The community validates through voting.
Sign in with Google to submit.
✓ Submitted
score(s) saved. Live and ready for community review.
About.
A community-driven, editorially honest ranking of AI models — and the math behind it.
Every public AI leaderboard tells you something — but each one tells you something different, and many tell you almost nothing useful. Three failure modes are everywhere:
- Contamination — the test set leaked into the training data, so the score doesn't measure capability, it measures memorisation.
- Saturation — every frontier model scores 99 %, so the bench has no resolving power left, but it still dominates aggregate scores out of habit.
- Bench-maxing — a model is tuned for a small set of popular benches and looks great there while being unusable in practice.
SupraBench gives every bench a community-voted quality and difficulty score, and adds an automatic saturation penalty on top. The result: a single number per model that respects how trustworthy and how informative each underlying benchmark actually is.
The Bench Score is the headline number shown on every bench page. It answers a single question: "how much should this benchmark count when ranking models?" A high-quality, hard, unsaturated bench that many models and many community members have engaged with should count more than a one-rater vanity bench on a trivial task.
It's a single number on $[0, 100]$ built from five independent factors. We define each variable first, then put them together.
Variables
- $b$ — a single benchmark.
- $Q(b) \in [0, 100]$ — the bench's quality: how trustworthy the score is. Mean of four community ratings (relevance, contamination resistance, discriminability, reproducibility) on a 1–5 scale, multiplied by 20. Defaults to $50$ when nobody has rated yet. (Full breakdown in "What exactly are the five bench dimensions?" below.)
- $D(b) \in [0, 1]$ — the bench's difficulty: how much general intelligence it actually probes. Median of community 1–5 ratings, linearly mapped to $[0, 1]$ (a vote of $1$ becomes $0$, a vote of $5$ becomes $1$). Defaults to $0.5$ when un-rated.
- $H(b) \in [0, 1]$ — the bench's headroom: how much measurement signal it has left before it saturates. Computed automatically from the top-$K$ frontier mean on $b$ — fully open ($H = 1$) until at least 3 models are evaluated, shrinks toward a floor of $0.1$ as the frontier mean climbs from $50$ toward $100$. (Full formula in "What is headroom and why does it exist?" below.)
- $u_b \in \mathbb{Z}_{\geq 0}$ — the bench's net upvotes: number of distinct upvoting accounts minus number of distinct downvoting accounts on the bench's existence-vote, floored at $0$.
- $U^\star = \max_{b'} u_{b'}$ — the maximum $u_b$ across every non-hidden bench in the system. Bootstrap: if $U^\star = 0$ (brand-new deployment, no votes anywhere), this factor is disabled for everyone.
- $N_b \in \mathbb{Z}_{\geq 0}$ — the bench's distinct-model count: how many different models have at least one net-positive (community-validated) score on $b$.
- $N^\star = \max_{b'} N_{b'}$ — the maximum $N_b$ across every non-hidden bench. Bootstrap: if $N^\star = 0$, this factor is disabled for everyone.
The formula
$$\operatorname{BenchScore}(b) \;=\; \underbrace{Q(b)}_{\text{quality}} \;\cdot\; \underbrace{D(b)}_{\text{difficulty}} \;\cdot\; \underbrace{H(b)}_{\text{headroom}} \;\cdot\; \underbrace{\sqrt{u_b / U^\star}}_{\text{upvote share}} \;\cdot\; \underbrace{\sqrt{N_b / N^\star}}_{\text{model-count share}}$$
Quality is on $[0, 100]$; the four other factors are all on $[0, 1]$, so the Bench Score stays naturally on $[0, 100]$ — the same scale as the per-model scores.
What each factor does
- Quality $Q(b)$. If the community thinks the bench is contaminated, trivia-grade, or non-reproducible, $Q$ shrinks and the bench's whole weight shrinks with it. A bench rated 1/5 across the board has $Q = 20$, fivefold less weight than a unanimous 5/5.
- Difficulty $D(b)$. A bench rated 1/5 in difficulty has $D = 0$ and is silently filtered out — scoring 100 % on a trivial bench earns the model nothing. A bench rated 5/5 has $D = 1$ and is counted at full weight.
- Headroom $H(b)$. Once a bench is solved (top frontier models all near 100), $H$ shrinks toward the $0.1$ floor, so the bench stops dominating the leaderboard the moment it loses resolving power. This makes ARC-AGI 3 → 4 hand-offs automatic, no manual retirement needed.
- Upvote share $\sqrt{u_b / U^\star}$. Without this, anyone could mint a brand-new bench, self-rate it $Q = D = H = 1$ and immediately appear at #1 on the bench leaderboard. With it, a 1-upvote vanity bench at $u_b = 1$ against an established $U^\star = 100$ is worth $\sqrt{1/100} = 10\,\%$ of an equally well-rated leader. The most-upvoted bench always has share = 1, so there's no self-penalty at the top.
- Model-count share $\sqrt{N_b / N^\star}$. A bench tested by only 1 model gives almost no comparative information. This factor shrinks "spawn a bench, test only your own model on it" attacks: $N_b = 1$ vs $N^\star = 20$ gives $\sqrt{1/20} \approx 22\,\%$, on top of whatever the upvote share already costs. Modality asymmetry (image benches naturally cover fewer models than text benches) is intentional — they genuinely tell us less about the broader model population.
The two $\sqrt{\cdot}$ shapes are deliberate: they mirror the statistical $1/\sqrt{N}$ standard-error falloff. Halving a bench's upvote count doesn't halve its weight — it shrinks by $\sqrt{2} \approx 1.41$. This makes the penalty real but never collapses the bench to zero on a single missing vote.
A model's SupraScore is the headline number on every model page and the primary leaderboard sort key. It builds on top of the Bench Score from the previous section — so if you haven't read that one yet, start there. Conceptually, it answers: "averaging across every benchmark this model has been tested on, how well does it perform — adjusted for how broadly it's actually been measured?"
Three preparatory steps to turn raw submissions into a usable per-(model, bench) score, then two scoring steps to combine those into a final SupraScore.
Foundations — submission to per-pair score
Step 1 — Normalise each submission. Every bench has its own scale (e.g. 0–100, 0–10, 0–1). To compare scores across benches we map every raw submission to a common $[0, 100]$ scale using the bench's declared min and max.
- $s_{\text{raw}}$ — a single raw submission, on the bench's native scale.
- $s_{\min}, s_{\max}$ — the bench's declared lower and upper bounds.
- $s_{\text{norm}} \in [0, 100]$ — the same submission, rescaled.
$$s_{\text{norm}} \;=\; \frac{s_{\text{raw}} - s_{\min}}{s_{\max} - s_{\min}} \cdot 100$$
In plain words: "where does this raw number sit between the bench's min and max, expressed as a percentage."
Step 2 — Validate via community votes. A submission only counts toward any score if its net vote is positive. The submitter gets an implicit $+1$, so a fresh submission starts at $1/0$. Wrong, duplicate, or fraudulent submissions get downvoted out of the calculation and stop contributing.
Step 3 — Take the median per (model, bench) pair. The same model is often submitted multiple times on the same bench (different runs, different prompts, different setups). We collapse all valid normalised submissions for a given $(m, b)$ pair into one number using the median.
- $\mathcal{S}_{m,b}$ — the set of all community-validated normalised submissions ($s_{\text{norm}}$ values) for model $m$ on bench $b$.
- $\mu_{m,b} \in [0, 100]$ — the median of that set. In plain words: the score of model $m$ on bench $b$. One number per (model, bench) cell.
$$\mu_{m,b} \;=\; \operatorname{median}\bigl(\mathcal{S}_{m,b}\bigr)$$
Median, not mean — one outlier submission (a fluke run, a misconfigured prompt, an outright fake that hasn't been downvoted yet) cannot shift the result.
Variables — model side
- $m$ — a single model.
- $\mathcal{B}_m$ — the set of benches on which $m$ has at least one community-validated score (i.e. at least one $\mu_{m,b}$ exists).
- $\mu_{m,b} \in [0, 100]$ — the score of model $m$ on bench $b$, defined above.
- $\operatorname{BenchScore}(b) \in [0, 100]$ — the five-factor weight from the previous section. Used here as the weight for bench $b$ in the average.
- $\bar{\mu}_m \in [0, 100]$ — the model's weighted mean across all its evaluated benches (computed in step 4 below).
- $W_m \;=\; \displaystyle\sum_{b \in \mathcal{B}_m} \operatorname{BenchScore}(b)$ — the model's accumulated bench-score weight. In plain words: how much trustworthy, hard, well-engaged measurement has been thrown at this model in total.
- $W^\star \;=\; \max_{m'} W_{m'}$ — the maximum $W_m$ across every non-hidden model. Defines the "fully covered" reference point against which everyone else's coverage is measured.
Step 4 — The weighted mean
In plain words: for every bench the model has been tested on, multiply the model's score on that bench by how much that bench is worth (its Bench Score), sum it all up, then divide by the total Bench-Score weight the model has accumulated.
$$\bar{\mu}_m \;=\; \frac{\displaystyle \sum_{b \in \mathcal{B}_m} \mu_{m,b} \cdot \operatorname{BenchScore}(b)}{\displaystyle \sum_{b \in \mathcal{B}_m} \operatorname{BenchScore}(b)}$$
Why weighted and not a simple mean? A simple mean (sum the scores, divide by the number of benches) treats every bench equally. So a saturated trivia bench at $\operatorname{BenchScore} = 1$ would count just as much as a fresh hard frontier bench at $\operatorname{BenchScore} = 50$. That gives a model-runner a free path to the top: pad the model with high scores on dozens of cheap benches and watch the average climb. The weighted mean kills that — a bench worth $1$ point of weight contributes $50$× less to both numerator and denominator than a bench worth $50$ points, so its influence on the average is exactly its share of the total weight.
And why does the denominator divide by the sum of weights, not by the bench count? Because that's what makes the result land back on the same $[0, 100]$ scale as the individual $\mu_{m,b}$ values. A weighted average of numbers in $[0, 100]$ is itself in $[0, 100]$ only if you divide by the sum of the weights. Dividing by the count would give you a number that scales with how much weight you've accumulated — but that's a separate question (covered in step 5), not the question this step is answering.
Think of step 4 as "how good is the model on the benches it actually was tested on" — pure performance intensity, normalised to $[0, 100]$. Step 5 below adds the second axis: "and how much of the bench landscape was that?" Two orthogonal pieces of information, separated by design.
Benches with $\operatorname{BenchScore}(b) = 0$ (zero difficulty, zero quality, etc.) are excluded from both numerator and denominator. They contribute nothing in either direction — neither help nor hurt.
Step 5 — The coverage-share shrinkage
In plain words: shrink the weighted mean from step 4 by the square root of the model's share of the largest weight any model in the system has accumulated. A model that's been tested broadly keeps its full weighted mean; a model tested on only a small slice of the bench landscape gets pulled down in proportion.
$$\operatorname{SupraScore}(m) \;=\; \bar{\mu}_m \cdot \sqrt{\dfrac{W_m}{W^\star}}$$
If $W_m = W^\star$ the shrinkage factor is exactly $1$ (the most-covered model has no self-penalty). If $W_m = \tfrac{1}{4} W^\star$ the factor is $\sqrt{0.25} = 0.5$ — quarter coverage costs you half your score.
Why does this exist on top of step 4? Without it, a model with one extremely good score on one favourable bench could beat a well-covered rival who has been tested on dozens of benches at slightly lower scores. The weighted mean alone is intensity-only; the $\sqrt{W_m/W^\star}$ factor adds the breadth dimension. Together they say "to rank highly, you need to score well and be widely measured."
The $\sqrt{\cdot}$ shape mirrors the statistical $1/\sqrt{N}$ standard-error falloff: halving your coverage doesn't halve your score, it shrinks it by $\sqrt{2} \approx 1.41$. Hidden models are excluded from the $W^\star$ maximum so a mothballed flagship can't permanently squash fresh entrants.
A consequence to be aware of: a model's SupraScore can move when other models gain coverage. A well-tested new flagship that raises $W^\star$ shifts everyone else's share proportionally. This is deliberate — "coverage" is a relative concept and only exists in comparison to what else has been tested. See the worked example below for concrete numbers.
Each bench is rated on a 1–5 scale on five orthogonal axes. The first four feed quality; the fifth is difficulty.
- Relevance — does it measure something useful for real-world model use? High = meaningful task. Low = trivia, toy puzzles.
- Contamination Resistance — how sure are we the test set wasn't in the training data? High = held-out, rotating, freshly generated. Low = scraped from the public web before model release.
- Discriminability — does it separate weak from strong models? High = wide score spread. Low = everyone scores 95–100.
- Reproducibility — can two independent runs reach the same conclusion? High = deterministic, well-specified. Low = vague setup, judge-LLM, hand-graded.
- Difficulty — how much general intelligence does it actually probe? 1 = trivial. 5 = approaches frontier capability.
Quality is the mean of the first four × 20:
$$Q(b) \;=\; \frac{R + C + S + P}{4} \cdot 20 \quad \in [0,\,100]$$
Difficulty uses the median across raters (more robust to a single inflated vote), then linearly scaled to $[0,1]$:
$$D(b) \;=\; \max\!\left(0,\; \min\!\left(1,\; \frac{\operatorname{median}(d_i) - 1}{4}\right)\right)$$
If nobody has rated a bench yet, it defaults to neutral: $Q = 50$, $D = 0.5$.
Imagine ARC-AGI 3 just got solved — every frontier model scores 99 %. The bench has stopped measuring intelligence; it now just measures "is your model in the frontier club". A score on it should contribute almost nothing to a model's SupraScore.
The naïve fix would be: ask the community to retroactively lower its quality. Nobody does that consistently. So we automate it — every bench has a headroom factor that shrinks as the bench saturates:
The frontier-mean approach
Top-1 alone is too noisy: one freakishly good model would tank the bench's weight for everyone after a single submission. So we use the mean of the top-K models, which is robust to outliers:
$$N \;=\; \bigl|\{\, m : \exists \text{ valid score for } (m,b)\}\bigr|, \qquad K \;=\; \min(10,\, N)$$
$$f(b) \;=\; \frac{1}{K} \sum_{m \in \operatorname{top}_K(b)} \mu_{m,b}$$
where $\operatorname{top}_K(b)$ are the $K$ models with the highest per-model median on $b$.
From frontier-mean to headroom
Two regimes. With fewer than 3 evaluated models we don't have enough signal to claim saturation, so we don't penalise:
$$H(b) \;=\; \begin{cases} 1.0 & \text{if } N < 3 \\[6pt] \max\!\left(0.1,\; \dfrac{100 - \max(f(b),\, 50)}{50}\right) & \text{otherwise} \end{cases}$$
The pivot at 50 means a bench is "fully open" until at least its top-K models are above mid-range. The floor at 0.1 keeps historic benches alive — they never disappear, they just stop dominating.
Trajectory
| Scenario | $N$ | $f(b)$ | $H(b)$ |
|---|---|---|---|
| Brand-new bench, 1 model @ 92 | 1 | — | 1.00 |
| 3 models, top-3 mean 70 | 3 | 70 | 0.60 |
| 10 models, top-10 mean 84 | 10 | 84 | 0.32 |
| 30 models, top-10 mean 91 | 30 | 91 | 0.18 |
| 50 models, top-10 mean 99 | 50 | 99 | 0.10 (floor) |
This is what makes the ARC-AGI 3 → 4 hand-off automatic. Nobody has to "retire" the old bench manually.
Take a hypothetical model "Claude X" with valid medians on three benches:
| Bench | $\mu_{m,b}$ | $Q$ | $D$ (avg) | $N$ | $f(b)$ | $H$ | $w$ | $w \cdot \mu$ |
|---|---|---|---|---|---|---|---|---|
| MMLU (saturated) | 88 | 80 | 0.50 (3/5) | 50 | 96 | 0.10 | 4.0 | 352 |
| ARC-AGI 4 (fresh) | 72 | 75 | 1.00 (5/5) | 8 | 65 | 0.70 | 52.5 | 3 780 |
| TrivialBench | 99 | 60 | 0.00 (1/5) | 20 | 99 | 0.10 | 0 | 0 |
| Σ | 56.5 | 4 132 |
$$\bar{\mu}_{\text{Claude X}} \;=\; \frac{4\,132}{56.5} \;\approx\; 73.1$$
That's the weighted mean. Now apply the coverage-share step. Suppose the best-covered model in the table has $W^\star = 100$; Claude X has $W_m = 56.5$, so its share is $0.565$ and the shrinkage factor is $\sqrt{0.565} \approx 0.752$:
$$\operatorname{SupraScore}(\text{Claude X}) \;=\; 73.1 \cdot \sqrt{\tfrac{56.5}{100}} \;\approx\; 55.0$$
If Claude X were the most-covered model ($W^\star = W_m = 56.5$), the factor would be $1$ and the SupraScore would equal the weighted mean $73.1$ exactly.
Four takeaways from this:
- The frontier bench dominates ($w = 52.5$) because it's hard, trustworthy, and not yet saturated.
- The saturated bench still counts a little ($w = 4$) but contributes less than 10 % of the total weight.
- The trivial bench is silently filtered out by $D = 0$ — scoring 99 % on it earns the model exactly nothing.
- The coverage-share step means a model that's been tested on only a small fraction of the bench pool can't beat a well-covered rival just by cherry-picking one favourable bench.
Anything the community can get wrong, the community can correct via voting. Five layers:
- Submission votes — every individual score is up/downvoted. Only net-positive submissions count toward the SupraScore.
- Tag votes — each tag on a model or bench is voted independently. Net positive keeps it in the canonical tag set.
- Existence votes — fakes, duplicates, low-quality entries can be downvoted into a hidden state (see formula below).
- Quality & difficulty ratings — anyone signed in can rate a bench on all five dimensions. Averaged across raters (mean for trust, median for difficulty).
- Per-bench scaling — a bench with one corrupt rater barely moves; the more raters, the harder it is to game.
When does an entity get hidden?
The hide threshold scales with engagement, so a small mob can't kick a well-established bench, but spam still goes away fast:
$$\text{hide}(e) \;\Leftrightarrow\; \operatorname{down}(e) \;\geq\; \max\bigl(5,\, \lceil 0.6 \cdot (\operatorname{up}(e) + \operatorname{down}(e)) \rceil\bigr) \;\;\wedge\;\; \operatorname{down}(e) > \operatorname{up}(e)$$
A 5-downvote floor protects against drive-by spam. The 60 % ratio means that to remove a bench with 100 upvotes, you'd need at least 96 downvotes — much harder than 4 sock puppets.
Anti-resurrection
If you submit a model or bench under your name and it gets community-removed, you cannot re-submit it under the same name. Other users can re-submit it (with a numeric suffix on the slug), and the community votes on the new version independently.
Adversarial robustness is built into the formula at every layer:
- One submission can't anchor a score — the per-(model, bench) median becomes robust to outliers as soon as $n \ge 2$ submissions exist: a single attacker number is replaced by the community median the moment a second honest submission lands.
- One bench can't carry a model — the SupraScore is averaged across all benches a model has scores on, weighted by trust × difficulty × headroom, and then further shrunk by a coverage-share factor $\sqrt{W_m/W^\star}$. A model tested on only one bench loses $\approx\!\sqrt{1/N^\star}$ of its score versus the best-covered rival, so bench-maxing your way to the top with a single vanity bench is literally impossible.
- One user can't carry a bench — the same $\sqrt{\cdot}$ shape is also applied to the bench side: every bench's contribution to its own headline score and to any model's SupraScore is shrunk by $\sqrt{u_b/U^\star}$ where $u_b$ is its net upvote count and $U^\star$ is the leader's. A self-rated 100/100 vanity bench from a single account is worth $\sqrt{1/U^\star}$ of an established bench at the same Q·D·H — you'd need $U^\star$ separate accounts upvoting it to even tie. Same defence works on the bench leaderboard and in the SupraScore aggregate, so you can't spawn a bench just to pump one model.
- Single-model vanity benches don't count — every Bench Score is additionally shrunk by $\sqrt{N_b/N^\star}$, where $N_b$ is the number of distinct models that have actually been scored on the bench and $N^\star$ is the maximum across non-hidden benches. A "community bench" used to test only the attacker's own model has $N_b = 1$ and so contributes $\sqrt{1/N^\star}$ of the weight a well-used bench would. Combined with the upvote-share, a 1-rater + 1-model vanity bench is worth $\sqrt{1/(U^\star \!\cdot\! N^\star)}$ of an established peer.
- Difficulty uses median, not mean — a single 5-star rater on a trivial bench can't fake difficulty.
- Saturation auto-detected — pumping a saturated bench gives diminishing returns by construction.
- Engagement-aware hide threshold — small voting cliques can't take down established entries.
- Anti-resurrection — re-submitting your own removed entries under the same name is blocked.
- Rate limiting — submissions are capped at 30 individual scores per 24 h per user.
None of these is bulletproof on its own. Together they make systemic gaming expensive enough that legitimate contribution is the cheaper path. Every defensive claim above is encoded as an executable invariant or attack scenario in tests/convex/adversarial-robustness.test.ts — the harness has three layers (invariants, an attack catalog, and a deterministic-PRNG fuzzer) and runs on every CI build, so a regression in the math fails a test instead of going unnoticed. One scenario (`A3-extreme`, an industrial-scale 8+ vanity bench farm) explicitly documents an attack that the pure math does not defend against — those are blocked operationally by the rate-limit, downvote, anti-resurrection and moderation rules above.
Any URL is a valid source — but only well-known academic, lab, and dedicated leaderboard hosts get an "Official source" badge. Everything else (YouTube, Substack, X/Twitter, personal blogs) is a valid community source and gets a "Community" badge instead. The submission is otherwise treated identically; the badge is a transparency signal, not a gatekeeper.
The whitelist lives in convex/urls.ts in the public source repository on GitLab — suggestions welcome via issues.
The full source code of SupraBench, including the ranking math you just read, is published under the Business Source License 1.1 (BSL):
gitlab.com/florian-fischer-group/suprabench ↗
Source-available, not OSI-open-source. The BSL is a "source-available" licence: you can read, audit, fork for research, learning and non-commercial use, patch and redistribute — but the licence does not meet the OSI Open Source Definition because it carves out commercial competing-service use until the Change Date of 2029-01-01. On that date the entire codebase auto-converts to Apache License 2.0 and becomes plain open source under that name. We use this exact wording instead of the unqualified "open source" label because the OSI / FSF community is — rightly — strict about that distinction.
Issues, feature requests and merge requests are welcome on GitLab. Imprint, privacy and terms in the sidebar footer.
A family is one specific lab release — not a vendor, a generation or an architecture line. "Claude Opus 4.6" and "Claude Opus 4.7" are separate families. "Claude" is not a family at all; it's a brand.
The reason is practical: a 0.1 version bump at a frontier lab usually means a different training run, different parameter count and measurably different benchmark behaviour. Grouping them defeats the point of a family ranking.
Variants go in the model name, not the family
When a lab ships the same release at different effort levels — sampling temperature, reasoning mode, context window, fine-tune — we keep them in the same family and disambiguate with a parenthetical suffix on the model's display name:
| Family | Variants (model name) | What the suffix means |
|---|---|---|
Claude Opus 4.7 | Claude Opus 4.7, Claude Opus 4.7 (max) | Lab-branded "best-effort" sampling |
GPT-5.3 Codex | … (low), (med), (high), (xhigh) | Reasoning-effort level |
Gemini 3.1 | Gemini 3.1, Gemini 3.1 (thinking) | Explicit reasoning mode on/off |
Common suffixes you'll see: (low) / (med) / (high) / (xhigh), (thinking), (max), (128k) / (200k) / (1M) for context-length SKUs, (instruct) / (chat) / (base) for open-weight post-training variants.
Family ranking ≠ average of member models
When you click a family tag in the rankings table, we rebuild the SupraScore from the union of the family's member models per bench: for each bench any member scored on, we take the family's median-of-medians and weight it by the bench's own weight, exactly as we do for individual models. The coverage-share step then uses the max over all families (not over all models) as denominator so family scores stay on the same $[0,100]$ scale. A family with more variants doesn't automatically win — the per-bench median is still robust to one outlier variant.
What if the community disagrees with a grouping?
Anyone signed in can edit a model's familyTag via the model-detail page. Rankings refresh on the next tick. There's no canonical list and no admin curation — if the community thinks "GPT-5.3 Codex (high)" is actually a separate model from "(low)", they can split it off by assigning a new family tag to the "(high)" variant.
A one-off release with no variants and no expected follow-ups is fine to have familyTag == name — the family ranking then shows it with modelCount: 1.
Tags exist on two different things on SupraBench, and they play two different roles:
- Bench tags describe what the bench measures —
reasoning,code,multilingual,vision,safety. These are structural: they decide which benches feed a tag-scoped score. - Model tags describe what the model is or claims to be —
multimodal,open-weights,moe,frontier,agentic. These are descriptive: they help you find a model in search but don't enter the math.
Why the asymmetry?
The "Filtered Score" column reweighs a model's SupraScore using only the benches that match the active tag. That math only makes sense if the tag actually selects a non-empty set of benches. A pure model tag like multimodal — useful for search ("show me models that claim to do vision") — would select zero benches and produce a null filtered score for every row. So we hide model-only tags from the chip bar entirely.
What this means in practice
- The tag-filter chip bar (top of the models / benches list) shows bench tags only, sorted by how many benches carry them.
- The search box on either list matches against everything — model name, provider, bench tags, model tags. Type
multimodalthere and Gemini surfaces immediately, even though no bench is tagged that way (yet). - The tag picker popup ("+ N more" pill) lists every bench tag with its bench count, so you can drill into any structural slice without scrolling the chip strip.
- Bench tags also show up in the benches table — clicking one there toggles the same shared filter as the chip bar.
Can a tag be both?
Yes. code is the obvious one: HumanEval is a code bench, GPT-5.3 Codex is a code-tuned model. The tag exists once globally; it just appears in a chip bar if at least one bench carries it, and surfaces in search regardless of which side it lives on. The tag-counts API tracks both sides separately so this remains true even after extensive recategorisation.
Implementation: the tagCounts Convex table keeps {benches, models} per tag; the chip bar reads tags.listForBenches and the autocomplete reads tags.listAll. See convex/tags.ts.
Yes — partially. The /v1/* endpoints are live and answering requests today, but the only keys we mint right now are free Partner keys for non-profit, research and open-source projects we explicitly approve. Paid self-serve tiers (Starter / Pro / Enterprise) stay on a waitlist until enough developers are actually queued for them — running Stripe + a dedicated edge cache for an API nobody uses isn't free. Pricing is TBD until that launch, the waitlist is how we figure out what each tier should actually cost. Enterprise plans are always custom. Billing will be handled by Stripe (EU VAT collected automatically).
Read the full API documentation — every endpoint, error code, rate-limit and example is already there. For paid tiers, join the waitlist for the one you'd actually subscribe to: setProfileTab('api'))">Profile → API & Billing. When the queue hits launch threshold we ship and email everyone in signup order.
Running a non-profit, research or open-source project that could genuinely use the API today? Pitch us for a Partner key — free, negotiated quota, live right now. The Apply to become a partner button at the bottom of the pricing grid opens a pre-filled form with the bits we need to evaluate.
Have a specific use case (dashboard, leaderboard mirror, evaluation tooling)? Tell us via a GitLab issue or the partner mailto — high-signal asks weigh more than passive signups.
Sign in to view your profile.
My Submissions
| Model | Benchmark | Score | Status | Submitted | |
|---|---|---|---|---|---|
| hidden | hidden | View → |
My Creations
My Tag Votes
Your API access
Tier is active — ready to call
https://api.suprabench.com/v1/ with any of the keys below.
/v1/export.jsonAPI keys
created
· last used
· never used
https://api.suprabench.com/v1/.
Recent months
| Month | API calls |
|---|---|
Need more capacity, additional keys, or to swap a key? Email us and we'll re-provision. Read the full API docs for endpoints, schemas and rate-limit headers.
Plans
Starter
TBD at launch
10 000 requests / month
- Read access to all public endpoints
- 60 req/min rate limit
- 1 API key
- Community support
Pro
TBD at launch
100 000 requests / month
- Everything in Starter
- 300 req/min rate limit
- 3 API keys
- Email support
- Bulk export endpoint
Enterprise
TBD at launch
1 000 000 requests / month
- Everything in Pro
- 1 200 req/min rate limit
- 10 API keys
- Priority support (best-effort, no formal SLA)
Enterprise+
Custom
Custom quota, optional contractual SLA, on-prem mirror
- Everything in Enterprise
- Custom rate limits
- 50+ keys
- Dedicated Slack channel; SLA negotiable per contract
- Custom data agreements
Partner
Free (sponsored by the project)
Custom quota & key count
- Full API access (read + bulk export)
- For non-profit, research & open-source projects
- Hobby / friend-of-the-project sites welcome too
- Quota, rate limit & key count set per partner
- Attribution expected (
Powered by SupraBenchfooter link)
Pricing is intentionally TBD until launch — the waitlist is how we figure out what each tier should actually cost. Read the full API documentation →. Billing will be handled by Stripe (EU VAT collected automatically, B2B reverse-charge supported via VAT-ID at checkout). See Terms § API for cancellation, refund and uptime details.
Your subscription
You're on the plan.
Your subscription is set to cancel on . You keep API access until then.
API keys
Save this key now
This is the only time we'll show this key. Store it in a password manager — you can't see it again, only revoke + recreate.
Simulator.
Calculate the SupraScore your unreleased model would land at if you submitted these scores against existing benches. Nothing is saved — runs are purely a what-if for your decks.
Your
grant has simulationsPerDay = 0.
Open the Admin tab → find your own account → set Simulator runs / day to 20 (or whatever) → re-grant.
Ask your SupraBench account contact to bump it.
Per-bench impact
| Bench | Frontier (live → sim) | Weight (live → sim) |
|---|---|---|
| → | → |
Hypothetical leaderboard
Top 25 of the simulated full ranking. Δ columns compare against the live ranking right now.
| Rank | Model | Provider | SupraScore | Δ Score | Δ Rank |
|---|---|---|---|---|---|
| simulated | — | — |
Search accounts by name or email. Grant partner or enterprise+ with custom limits — they mint their own keys from the API tab. As primary admin, you can also promote other admins.
Showing all accounts with elevated privileges. Type to search the full user table.
Admin role
Only the primary admin can promote or demote other admins.
This account is the primary admin and cannot be demoted.
Granted tier
API keys view-only — the user mints their own keys from their API tab
| Name | Prefix | Tier | Created | Last used | Status | |
|---|---|---|---|---|---|---|
|
active revoked |
Monthly usage
| Month | API calls | Active keys |
|---|---|---|