How URL Checking Works¶
check_broken_urls (and the other URL-based checks) can end up sending
thousands of requests across a large repository. This page explains the
pipeline that makes that practical without hammering any single host, and
why some failures are reported as warnings rather than errors.
The life of a URL¶
Extraction. Every markdown link destination is extracted per file (see How Link Detection Works) and classified as a URL if it starts with
http:///https://(protocol-relative//host/...links are normalized tohttps://).Normalization and dedupe. Before checking, each URL is normalized (scheme and host lowercased, trailing slash and fragment dropped; the query string is kept). If the same normalized URL appears in fifty files, it is only ever requested once - every other occurrence shares the same in-flight result.
Per-host queueing and pacing. Requests are grouped by host and paced with a minimum delay between two requests to the same host (
--per-host-delay), so a link-heavy file doesn’t burst a host with many simultaneous requests just because it happens to link to it repeatedly.The request itself. Each URL is checked with a
HEADrequest first; if that isn’t a success, aGETis attempted as a fallback, since some servers rejectHEADbut allowGET. A successful (2xx) response on either is treated asalive.Hard-failure retries. If neither request succeeds and there’s no rate-limit or auth signal, the attempt is retried with exponential backoff up to
--retriestimes before being reported asbroken.Rate limiting and auth errors return immediately. A
429(or any non-success response carryingRetry-After) is reported asrate_limitedright away - it is not retried in place. Likewise, a401/403response is reported asunverifiableimmediately: the server is reachable but blocking automated access, and retrying won’t change that. Retrying either case would only generate more traffic against a host that’s already unhappy with us.Host circuit breaker. When a host produces two rate-limited responses in a row, it’s treated as persistently throttled: every other URL already queued for that host is resolved as
rate_limitedwithout spending a network request on each, and the host is blocked from new requests until itsRetry-Afterwindow elapses.Backpressure. The number of URL checks in flight at once is bounded (
max_pendingin the underlying configuration), so checking a huge repository doesn’t grow memory unboundedly - submitting more work simply waits for a slot to free up.
Why warnings don’t fail the build¶
rate_limited, unverifiable, and transient_error all mean the
checker couldn’t determine whether the link is actually broken - the host
is either throttling, blocking bots, or momentarily unreachable. Treating
these the same as a confirmed 404 would make CI fail on
infrastructure noise rather than real broken links, so they’re reported as
warnings instead: visible, but never blocking. Only broken (a
confirmed non-2xx response with no rate-limit/auth signal) is an
error. See Exit Codes and Issue Severity for the full status table.
Overlapping file processing¶
Separately from URL-level concurrency, files themselves are processed through an adaptive window: a file’s links are submitted for checking as soon as they’re extracted, without waiting for earlier files’ checks to finish first. This overlaps link extraction and I/O wait time across files, while still bounding how many files can be “in flight” simultaneously so memory use stays predictable regardless of how many links a single file happens to contain.