Page MenuHomePhabricator

FY 25/26 WE 5.4.2: Known bots / clients
Open, HighPublic

Description

Hypothesis: If we build a scalable way to identify known clients, we can allow exceptions to general rate-limits for bots of verified origin, and move towards systematic enforcement of our rules.

This will integrate with work done as part of T398161: FY 25/26 WE 5.4.3: CDN (text) filtering rationalization, particularly around upstream rule evaluation once a request has been attributed to a known client (e.g., User-Agent / IP pair).


High-level requirements (draft)

Administrative: administrative control over known client identities, including management operations (CRUD) and underlying data model.

  1. Must support management of known-client identities (CRUD) with the following (minimum) attributes:
    • Unique known-client name (e.g., “mybotname”)
    • One or more pattern references corresponding to known-client User-Agent strings
    • One or more pattern references corresponding to X-Provenance key=value pairs that identify known-client src IP ranges
    • Administrative state (enable | disable) for both known-client attribution and impersonation enforcement
    • Administrative comment (internal, free-form text)
  2. Must support management of known-client-associated IP block ingestion sources with the following (minimum) attributes:
    • Unique known-client IP block name (to write to - e.g., known-clients/mybot_ipblock)
    • Ingestion source URL
    • Ingestion source format
    • Administrative comment (internal, free-form text)
  3. Should offer on-demand IP block ingestion (e.g., at creation), so it’s impossible to commit an identity (e.g., with impersonation enforcement) referencing an unusable known-client IP block.
  4. Should offer separation between “static” and “managed” ingestion ipblock objects (i.e., it should be impossible to define an ingestion source that clobbers an existing static ipblock).
  5. May provide policy enforcement to ensure the domain is controlled by the client (e.g., matches contact info in the User-Agent).

Functional: rendering managed resources to HAProxy DSL, integration into HAProxy and Varnish, etc.

  1. Must isolate known-client HAProxy DSL and Varnish VCL from normal requestctl actions.
  2. Must support impersonation denial on User-Agent pattern match with negative src IP match.
  3. Must provide observability into impersonation denial rates.
  4. Should provide impersonation denial rates on a per-known-client basis.
  5. Must integrate with X-Trusted-Request by applying score B (T399057) for subsequent decision making.
  6. Must enrich X-Provenance with additional metadata - i.e., key-value pair - indicating User-Agent has matched (e.g., client=mybot_ipblock;id=mybotname).
  7. Must skip normal requestctl rule processing for X-Trusted-Request score B or higher (currently limited to score A), and in the Varnish case, shunt to per-{X-Provenance} x {cache-text, cache-upload} rate limits.
  8. May provide the capability to configure rate limits on a per-known-client basis (in which case, this will need to exist somewhere in the data model - e.g., as an attribute of the known-client identity object).

Details

Other Assignee
Scott_French

Event Timeline

Scott_French updated Other Assignee, added: Scott_French; removed: JMeybohm.