April 29, 2026 · 8 min read

Inside the ML alt-text-quality pass

axe-core checks whether alt exists. We added a second pass that checks whether it's any good. This is the deep dive on how that works, what we measured, and where it breaks. Companion to axe-core finds the violation. We suggest the fix.

Run axe-core on a typical SaaS landing page. It tells you whether every <img> has an altattribute. It doesn't tell you whether that attribute means anything.

Three patterns that show up repeatedly on real production sites:

<img src="/hero.jpg"          alt="image" />
<img src="/team-photo.jpg"    alt="DSC_4823.jpg" />
<img src="/cat-food.jpg"      alt="Buy our premium cat food today!" />

All three pass image-alt, axe-core's rule for non-text content under WCAG success criterion 1.1.1. All three fail real screen reader users. The first is an empty word that wastes the user's time. The second is a filename leak — somebody dragged a photo into a CMS without typing anything. The third is marketing copy where descriptive alt should be.

This is the most common gap in axe-core's coverage we've measured. AccessPulse's alt-text-quality pass is the second-opinion check that catches it.

For the broader story of how this fits alongside our Tier 1/2/3 fix suggestions, see the thesis post. This piece is the technical deep dive.

Why automated testing only catches 57%

Deque has run the same study three times: how much of WCAG can a static analyzer catch automatically? Each time the answer comes back near 57%. The other 43% needs human judgment. This isn't a missing feature in the tools — it's a structural property of static analysis.

What lives in the 43%:

  • Alt text quality— a vision-language problem. axe-core can't see the image, so it can't tell if the alt is right.
  • Reading order— a rendering problem. Screen readers walk the DOM; the visual layout walks something else.
  • Cognitive load— a judgment problem. Is the navigation consistent enough across pages?
  • Custom widget keyboard interaction— an interaction problem. Static analysis can't press Tab.
  • Video caption accuracy— a speech recognition problem. Captions can be present and wrong.

Each gap requires a different kind of analysis. We picked alt text quality first because it's the gap most amenable to multimodal language models today, and it's the highest-frequency one.

Why alt text quality, specifically

Three reasons we built the alt-text-quality pass first.

Frequency.Look at any page with images and you'll find this gap. WebAIM's Million study finds alt text issues on a majority of homepages — generic placeholders, filename leaks, mismatched descriptions, decorative images with content alt. Accessibility consultants we've talked to cite it as the most-found issue in manual audits.

Multimodal models can now be evaluated against ground truth. Gemini 2.5 Flash, Claude 3.5 Sonnet, and GPT-4o all accept (image, text)pairs and return structured judgments. We can build a held-out test set of human-labeled image-alt pairs, measure model agreement against it, and ship only if accuracy clears a threshold. That's a real engineering process — not “AI for accessibility.”

The detection problem is bounded.“Does this alt text describe this image?” is closer to a yes/no question than “Is this page's reading order coherent?” Yes/no questions are easier to evaluate against ground truth, easier to ship behind a quality gate, and easier to communicate to users when they're wrong.

The mechanism, plainly

After axe-core finishes scanning a page, AccessPulse runs a second pass on each <img> element with a non-empty alt attribute that we can fetch (we skip inline data:URLs and unsupported image formats — we handle PNG, JPEG, WebP, HEIC, HEIF). We send the image and the alt text to Gemini 2.5 Flash with a structured prompt that evaluates against five criteria: accuracy, length, usefulness for screen reader users, absence of redundant prefixes (“image of”, “photo of”), and context-appropriate use of empty alt for decorative images.

The model returns one of three verdicts:

  • good — alt text is accurate, concise, and useful. We suppress these from your scan report. A page with 50 well-described images and 3 issues should look like a 3-issue report, not a 53-line list.
  • marginal — alt text is accurate but could be sharper (vague, missing context, or includes redundant phrasing). We surface the model's reasoning so you can decide whether to revise. We don't suggest a replacement here, because the existing alt isn't broken — it's improvable.
  • poor — alt text likely fails screen reader users. We surface the model's reasoning AND a suggested alternative as a starting point.

What it doesn't do

The alt-text-quality pass doesn't catch missingalt — that's image-alt, axe-core's job. The two layer together. If a content image has empty alt, axe-core catches it. If a decorative image has content alt, the quality pass catches it (and suggests alt=""). If alt is present but bad, the quality pass catches it. If alt is present and accurate, both tools agree.

The quality pass doesn't make WCAG conformance assertions. “Likely fails screen reader users” is not “fails WCAG 1.1.1.” Conformance is a formal evaluation; we surface suggestions for a human to confirm.

The quality pass doesn't auto-apply fixes. The suggested alt text is a starting point, not a replacement. The accessiBe overlay model — auto-applying ML-generated accessibility fixes that got the company fined $1M by the FTC — is exactly what AccessPulse is not.

In V1, ML is used in two narrow places: evaluating the quality of existing alt text (the subject of this post), and — conditionally on a July 1 accuracy gate — generating starting-point text for a small set of axe-core violation types (alt text content for missing-alt images, form labels, button names). All other WCAG checks remain axe-core's deterministic rule engine. The deliberate scope: ML where it adds something axe-core can't do, deterministic rules everywhere else.

How we measured

Building an ML feature without an evaluation set is how you ship something that's right 30% of the time and don't notice.

We evaluate the model against a held-out test set of 47 image-alt pairs drawn from four sources: WebAIM canonical examples (third-party-verified ground truth, stable URLs over a decade old), Wikimedia Commons stable assets, NOAA scientific charts (for the complex-image bucket), and constructed edge cases targeting the patterns where alt text most commonly fails. Ground truth is assigned by human review with per-case rationale, including disagreements between reviewers.

We track three things: verdict-level accuracy (does the model agree with the human label?), poor precision (when the model says poor, is it actually poor? — false positives erode trust faster than false negatives. A user who sees “your alt is bad” on alt that isn't bad will stop trusting the tool), and per-verdict precision and recall.

V1 ships only when verdict accuracy clears 80% and poor precision clears 70% on this set.

If those numbers don't hold, V1 doesn't ship. You'll see a different post here explaining why.

The eval set itself ships open-source on the day this post becomes the Show HN companion piece. Audit the methodology, the ground truth, every disagreement we flagged. If you think we got a case wrong, file an issue.

Live results

V1 cleared the gates. Held-out 47-case eval, locked at v1.0.0:

  • Verdict accuracy: 90.5%— model agrees with the human label across the held-out set. Threshold to ship was 80%.
  • poor precision: 100% — every time the model says poor, the human reviewer agreed it's actually poor. Zero false positives on this set. Threshold was 70%.
  • Mean latency: 3.3 seconds per image, end-to-end (Gemini 2.5 Flash called from Modal).
  • Cost per evaluation: $0.0004 – $0.0008, depending on image size (Gemini 2.5 Flash paid tier).

The 100% poorprecision is the load-bearing number — it's the one users would punish us for getting wrong. False positives (“your alt is bad” on alt that isn't bad) erode trust faster than false negatives, which is why we biased the prompt to under-grade rather than over-grade. We expect that bias to relax as we ship corrections from real-world feedback (every finding has a 👍/👎 button — see §8).

The eval set + ground truth + per-case rationale ships open-source on the day this post becomes the Show HN companion piece. Audit the methodology. Every disagreement between model and human is flagged with reasoning. If you think we got a case wrong, file an issue.

We'll run V1 against ~100 SaaS landing pages in a follow-up — see the companion data piece when it's live.

What ships alongside the alt-text pass

The alt-text quality pass is the headline. V1 ships alongside it:

  • Hosted dashboard with scan history, trend charts, and team seats. Free tier scans persist 30 days; Developer 90 days; Team 1 year; Agency unlimited.
  • Three-tier suggested fixes— deterministic templates for mechanical rules (html-has-lang, decorative alt=""), computed values for color contrast and target-size, and model-generated starting points for the cases that need judgment — missing alt, form labels, button names — review before applying. Full breakdown in the thesis post.
  • PDF reports on every scan (Agency tier removes the watermark) and CSV violations exports (free, no tier gating). Agency tier also exports an EAA accessibility statement— the documentary artifact European public-sector procurement asks for. Not a compliance claim. The document, not the attestation.

V1 ships with the exact scope above and nothing wider. We didn't pad the launch with features we couldn't ship to a quality bar.

What's next

We chose alt text quality first because it's the highest-frequency gap and multimodal models are good at it. Other gaps in the 43% sit at different difficulty levels. Video caption accuracy is a speech recognition problem with reasonable tooling; reading order needs visual rendering analysis; cognitive load needs multi-page judgment that we don't think any current model does well. We'll evaluate each on the same axes — can we measure it against a held-out set, can we ship if accuracy clears a threshold — and decide based on the data.

Closer in: the GitHub Action (AccessPulse/scan@v1) posts suggested fixes as inline PR review comments. CodeRabbit-cadence — review the fix where you're already reading the diff. We're shipping that with V1.

If you scan a site with AccessPulse and the alt-text-quality pass flags an image, every finding has a 👍 / 👎 button. That feedback is our highest-signal input for what to fix in the prompt and what gaps to attempt next.


axe-core for the structural checks, plus ML to read what your alt text actually says. Free scan, no signup: accesspulse.dev.

Frequently asked questions

How does AccessPulse evaluate alt text quality?

After axe-core finishes scanning a page, AccessPulse runs a second pass on each <img> element with a non-empty altattribute. We send the image and the alt text to Gemini 2.5 Flash with a structured prompt that evaluates against five criteria: accuracy, length, usefulness for screen reader users, absence of redundant prefixes (“image of”, “photo of”), and context-appropriate use of empty alt for decorative images. The model returns one of three verdicts: good, marginal, or poor.

What model does the alt-text-quality pass use?

Gemini 2.5 Flash, called from Modal. Mean latency is 3.3 seconds per image end-to-end. Cost per evaluation is $0.0004 – $0.0008 depending on image size, using the Gemini 2.5 Flash paid tier.

Does AccessPulse catch missing alt text?

Missing alt text (an image with no altattribute) is caught by axe-core's image-alt rule under WCAG success criterion 1.1.1 — deterministic static analysis. The alt-text-quality pass is a second, separate layer: it checks whether existing alt text is any good. The two layer together. For AccessPulse's broader three-tier fix suggestion model, see the methodology page and the companion thesis post.

Can I see the eval set you use?

Yes. The eval set ships open-source on the day this post becomes the Show HN companion piece. It includes all 47 image-alt pairs, ground truth labels, and per-case reviewer rationale — including disagreements between reviewers. If you think we got a case wrong, file an issue.

Is alt-text-quality scoring a WCAG audit?

No. We surface findings as suggestions for a human to confirm, never as auto-applied fixes. “Likely fails screen reader users” is not “fails WCAG 1.1.1.” Conformance is a formal evaluation. ML evaluation is a signal to help you prioritize. The deliberate scope: ML where it adds something axe-core can't do, deterministic rules everywhere else.