Hypothesis: If we build a scalable way to identify known clients, we can allow exceptions to general rate-limits for bots of verified origin, and move towards systematic enforcement of our rules.
This will integrate with work done as part of T398161: FY 25/26 WE 5.4.3: CDN (text) filtering rationalization, particularly around upstream rule evaluation once a request has been attributed to a known client (e.g., User-Agent / IP pair).
High-level requirements (draft)
Administrative: administrative control over known client identities, including management operations (CRUD) and underlying data model.
- Must support management of known-client identities (CRUD) with the following (minimum) attributes:
- Unique known-client name (e.g., “mybotname”)
- One or more pattern references corresponding to known-client User-Agent strings
- One or more pattern references corresponding to X-Provenance key=value pairs that identify known-client src IP ranges
- Administrative state (enable | disable) for both known-client attribution and impersonation enforcement
- Administrative comment (internal, free-form text)
- Must support management of known-client-associated IP block ingestion sources with the following (minimum) attributes:
- Unique known-client IP block name (to write to - e.g., known-clients/mybot_ipblock)
- Ingestion source URL
- Ingestion source format
- Administrative comment (internal, free-form text)
- Should offer on-demand IP block ingestion (e.g., at creation), so it’s impossible to commit an identity (e.g., with impersonation enforcement) referencing an unusable known-client IP block.
- Should offer separation between “static” and “managed” ingestion ipblock objects (i.e., it should be impossible to define an ingestion source that clobbers an existing static ipblock).
- May provide policy enforcement to ensure the domain is controlled by the client (e.g., matches contact info in the User-Agent).
Functional: rendering managed resources to HAProxy DSL, integration into HAProxy and Varnish, etc.
- Must isolate known-client HAProxy DSL and Varnish VCL from normal requestctl actions.
- Must support impersonation denial on User-Agent pattern match with negative src IP match.
- Must provide observability into impersonation denial rates.
- Should provide impersonation denial rates on a per-known-client basis.
- Must integrate with X-Trusted-Request by applying score B (T399057) for subsequent decision making.
- Must enrich X-Provenance with additional metadata - i.e., key-value pair - indicating User-Agent has matched (e.g., client=mybot_ipblock;id=mybotname).
- Must skip normal requestctl rule processing for X-Trusted-Request score B or higher (currently limited to score A), and in the Varnish case, shunt to per-{X-Provenance} x {cache-text, cache-upload} rate limits.
- May provide the capability to configure rate limits on a per-known-client basis (in which case, this will need to exist somewhere in the data model - e.g., as an attribute of the known-client identity object).