How AccessPulse works

The scan pipeline

When you submit a URL, here is exactly what happens:

  1. Browser launch. We spin up an isolated Chromium instance in a sandboxed container. No state is shared between scans.
  2. Page load. Chromium navigates to your URL and waits for DOMContentLoaded, then pauses for 2 seconds to let client-side JavaScript render. This matters for SPAs — a static HTML scraper would miss dynamically rendered content.
  3. axe-core injection. We inject axe-core v4.10, the open-source accessibility testing engine created and maintained by Deque Systems. axe-core is used by Google Lighthouse, Microsoft Accessibility Insights, and most enterprise accessibility platforms.
  4. Rule execution. axe-core runs its full ruleset against the rendered DOM. We test against WCAG 2.0 A/AA, WCAG 2.1 A/AA, WCAG 2.2 AA, and additional best practices — over 90 rules covering color contrast, ARIA usage, form labels, heading structure, keyboard navigation, and more.
  5. Results processing.We parse axe-core's output, compute a weighted accessibility score, and structure the violations by severity (critical, serious, moderate, minor).
  6. Cleanup. The browser instance is destroyed. We do not store page content, screenshots, or cookies — only the accessibility report.

What we test

AccessPulse evaluates your page against WCAG 2.2 Level AA success criteria that can be verified programmatically. This includes:

What we can't test

Automated testing has real limits. According to Deque's own research, automated tools catch approximately 57% of WCAG violations. The remaining 43% require human judgment:

We are transparent about this because the alternative — claiming full compliance from automated scans — is how overlay vendors got fined by the FTC. AccessPulse runs axe-core 4.10 against WCAG 2.2 Level AA on every scan and layers two capabilities on top: (1) ML evaluation of existing alt text quality, and (2) suggested fixes for common WCAG violations ranging from deterministic templates (adding a missing lang attribute) through heuristic computation (the nearest WCAG-passing color contrast) through, conditionally, ML-generated content (a starting-point alt for an image missing alt text). Together these address several specific gaps inside the 43% — but not the 43% in full. For the rest, we recommend pairing automated scans with periodic manual audits.

Alt text quality scoring

axe-core checks WCAG success criterion 1.1.1 by verifying that an altattribute is present. It can't see the image, so it can't verify that the alt text actually describes what's on screen. alt="image", alt="DSC_4823.jpg", and a marketing tagline on a photo of a cat all pass axe-core. They all fail real screen reader users.

After axe-core finishes, AccessPulse runs a second pass on each image with a non-empty altattribute that we can fetch (skipping inline data URLs and unsupported formats — we handle PNG, JPEG, WebP, HEIC, HEIF). We send the image and its alt to a multimodal model (Gemini 2.5 Flash) with a prompt that evaluates against five criteria: accuracy, appropriate length, usefulness for screen reader users, absence of redundant prefixes ("image of", "photo of"), and context-appropriate use of empty alt for decorative images.

The model returns one of three verdicts:

Honest limits. ML evaluation is not a WCAG audit. We surface findings as suggestions for a human to confirm, never as auto-applied fixes. A portion of suggested replacements will be plausible-but-suboptimal — the model can describe what's visible but can't see the surrounding page context the author has in mind. The 👍/👎 feedback on each finding tunes future iterations.

How we measure accuracy. We evaluate the model against a held-out test set of 47 image-alt pairs drawn from WebAIM canonical examples, Wikimedia Commons, NOAA scientific charts, and constructed edge cases. Ground truth is assigned by human review with per-case rationale, including disagreements between reviewers. We track verdict-level accuracy (does the model agree with the human label), poor precision (when the model says poor, is it really poor — false positives erode trust faster than false negatives), and per-verdict precision and recall. V1 cleared the gates at lock (v1.0.0): 90.5% verdict accuracy, 100% poor precision (zero false positives on the held-out set), 3.3s mean latency per image, and $0.0004–$0.0008 per evaluation (Gemini 2.5 Flash paid tier). Thresholds to ship were 80% accuracy and 70% poor precision. The eval set is published open-source at our public launch — every case, ground-truth label, and reviewer rationale, including disagreements.

Free-tier scans analyze the first 5 images on a page. Paid tiers analyze more (Developer: 100, Team: 500, Agency: 2,000) — these caps are subject to revision after our May 5 cost analysis with fix suggestions factored in. Verdicts are cached for 30 days keyed on the image, alt text, and prompt version — so re-scanning the same site weekly costs nothing on unchanged images.

Suggested fixes

Finding a violation is half the work. The other half is fixing it — and for most of axe-core's rules, the fix is bounded enough that we can suggest it. AccessPulse pairs each violation with one of three kinds of suggestion, depending on how the right answer is computed.

Tier 1 — Deterministic template fixes

Some axe-core rules have exactly one right answer. A <button> without a type attribute defaults to type="submit" inside a form, which is almost never what you want. The fix is type="button". An <html> element missing the lang attribute fails WCAG success criterion 3.1.1. The fix is lang="en"(or the document's actual language).

For these rule types we ship deterministic templates: a single string-template fix per rule ID, no model in the loop, no accuracy bar to clear. Examples include button-name (when icon context permits a deterministic label), html-has-lang, html-lang-valid, meta-viewport, decorative image-alt (where the image is recognizably decorative and the suggested fix is empty alt=""), and similar. The exact rule coverage is documented alongside the eval set.

Tier 1 fixes are deterministic by construction — they're the rule's definition expressed as a fix. We surface them as copy-pasteable code suggestions next to each violation.

Tier 2 — Computed fixes

Some axe-core rules need a specific computed value, not a template. color-contrastfails when the foreground and background colors don't hit WCAG 1.4.3 (4.5:1 for normal text, 3:1 for large text). The fix is a color that DOES hit the ratio — and ideally one close to the existing color so the design intent isn't broken.

For these rules AccessPulse runs a small algorithm: pick the nearest passing color in HSL space from your existing CSS palette, or fall back to a darker/lighter shade of the offending color. Same approach for target-size (WCAG 2.5.8) — the fix is a specific minimum dimension (24×24px) computed from the existing element's bounding box.

Tier 2 fixes are bounded computations. They're right by construction for the math (a 4.5:1 ratio is a 4.5:1 ratio) but they may not match your design system perfectly. We surface the computed value with the original side-by-side so you can decide whether to accept it or tweak.

Tier 3 — ML-generated suggestions

Some axe-core rules can't be fixed without judgment about content the model can see but a deterministic algorithm can't. A button with only an icon and no accessible name needs a name that describes what the button does — not a generic "button". A form input without a label needs a label drawn from the input's context. A content image without alt text needs alt text that actually describes what's in the image.

When axe-core flags one of three specific violations — a content image without alt text, a form input without a label, or a button without an accessible name — AccessPulse calls Gemini 2.5 Flash with task-specific context. For missing alt, the model receives image bytes (never page-wide HTML — the image IS the context). For missing form labels, it reads the input's attributes plus DOM-local text within roughly three hops. For unnamed buttons, it reads existing ARIA, the contents of any child icon, and adjacent text. T3 ships in V1 covering exactly those three task types — others (link-name content, ARIA descriptions, table captions) are V2 work.

We never auto-apply Tier 3 suggestions. Model-generated content is a starting point for a human to confirm, not a final answer. The model can describe what's visible and inferable from local context, but it can't read the surrounding page copy or the design intent the author has in mind. Every suggestion ships with a copy button, the model's one-sentence rationale, and a "review before applying" disclaimer.

Tier 3 ships in V1 only if the model passes a binary workable-starting-point bar across a held-out eval set. Each generated suggestion is rated by a second reviewer against the criterion "would a developer accept this with at most minor edits (typo, brevity, single word swap)?" Edits that change the meaning count as a No. T3 ships only if the second-reviewer Yes-rate is ≥70% across the held-out set on July 1. Below that bar, the violation is shown without a Tier 3 suggestion — the user gets axe-core's finding plus Tier 1 and Tier 2 fixes, which ship regardless.

Tier 3 is non-blocking by design in two ways. The model can decline a specific suggestion by returning no string (a low-confidence refusal) — AccessPulse surfaces axe-core's finding without a Tier 3 starting point, and the rest of the scan is unaffected. If the Tier 3 service itself is unavailable (API outage, rate limit exhausted past retry budget), the entire Tier 3 pass is skipped for that scan and the user sees axe-core's findings plus Tier 1 and Tier 2 fixes — never an error, never a blocked report.

We publish two eval sets at HN launch. alt-text-v1.json covers the alt-text-quality classifier (existing alt judged against good / marginal / poor verdicts). fix-generation-v1.json covers Tier 3 generation (new alt / label / button text judged against the workable-starting-point criterion). Different tasks, different ground-truth methodologies, different bars — ≥80% verdict accuracy for the classifier, ≥70% workable rate for generation. Both publish open-source same day, with every case, label, and reviewer rationale.

Honest scope

We don't generate fixes for everything. The 43% of WCAG issues axe-core misses (reading order, cognitive load, video caption accuracy, custom widget keyboard interaction) still requires human review or, in specific cases, future ML work. AccessPulse covers the alt-text-quality slice with the scoring described above and the suggested-fix slice for the rules where the right answer is computable, bounded, or generative-with-eval. Manual audits remain the right tool for the rest.

Scoring methodology

The AccessPulse score (0–100) is computed as a weighted ratio of passing checks to total checks:

score = passes / (passes + weighted_violations) * 100

Severity weights:
  critical  = 10x (e.g., missing form labels, no alt text)
  serious   = 5x  (e.g., low color contrast)
  moderate  = 2x  (e.g., heading order skipped)
  minor     = 1x  (e.g., redundant ARIA role)

This means a single critical violation (like an unlabeled form on a login page) impacts the score more than several minor issues. We weight this way because critical violations directly block users from completing tasks.

Infrastructure

Open source foundation

AccessPulse is built on axe-core (MPL-2.0 license), the same engine used by Google, Microsoft, and the US government for accessibility testing. We chose axe-core over alternatives like htmlCS (used by Pa11y) or WAVE because of its active maintenance, broad rule coverage, and industry adoption. AccessPulse is not affiliated with or endorsed by Deque Systems.