Page MenuHomePhabricator

FY 24/25 WE 4.3.11 Define a policy for maintenance of requestctl rules
Closed, ResolvedPublic

Description

Requestctl rules are stacking up: we currently have 101 varnish rules, of which 68 are enabled (so, banning traffic), and a further 18 are in log-only mode.

While it makes sense that some requestctl rules stay in place long-term, I suspect most of them can be disabled. This is why we want to build a policy and also enforce it.

My initial idea for the policy is:

  • Every month the DDOS response WG reviews new rules and decides which, if any, should become permanent
  • Unless a rule is marked as permanent, one month after being enabled it will be set to log-matching only
  • If a rule stays in log-matching only for two months, it will be disabled completely.
  • Rules that are disabled since 6 months are removed (the git archive still exists)

We can of course do a first pass of the rules now, and write a script that, based on git history on the conftool2git dump, can perform automated actions to perform the remaining housekeeping.

Details

Related Changes in GitLab:
TitleReferenceAuthorSource BranchDest Branch
Add support for action metadatas, and cleanuprepos/sre/hiddenparma!42obliviancleanup_supportmain
requestctl: add cleanup function to disable/stop logging/remove old untouched rulesrepos/sre/conftool!70oblivianaction_metadatamain
Customize query in GitLab

Event Timeline

Joe triaged this task as High priority.May 7 2025, 9:44 AM

I've decided to implement a command in requestctl to enforce the above rules. It's part of a larger MR that will cause a schema change in production.

oblivian updated https://gitlab.wikimedia.org/repos/sre/conftool/-/merge_requests/70

requestctl: add cleanup function to disable/stop logging/remove old untouched rules

oblivian merged https://gitlab.wikimedia.org/repos/sre/conftool/-/merge_requests/70

requestctl: add cleanup function to disable/stop logging/remove old untouched rules

Mentioned in SAL (#wikimedia-operations) [2025-05-14T08:06:22Z] <oblivian@cumin2002> START - Cookbook sre.deploy.hiddenparma Hiddenparma deployment to the alerting hosts with reason: "T393381 - oblivian@cumin2002"

Mentioned in SAL (#wikimedia-operations) [2025-05-14T08:06:26Z] <oblivian@cumin2002> START - Cookbook sre.deploy.python-code hiddenparma to alert[1002,2002].wikimedia.org with reason: T393381 - oblivian@cumin2002

Mentioned in SAL (#wikimedia-operations) [2025-05-14T08:06:59Z] <oblivian@cumin2002> END (PASS) - Cookbook sre.deploy.python-code (exit_code=0) hiddenparma to alert[1002,2002].wikimedia.org with reason: T393381 - oblivian@cumin2002

Mentioned in SAL (#wikimedia-operations) [2025-05-14T08:07:02Z] <oblivian@cumin2002> END (PASS) - Cookbook sre.deploy.hiddenparma (exit_code=0) Hiddenparma deployment to the alerting hosts with reason: "T393381 - oblivian@cumin2002"

I've come to the realization that part of the logic should also inform the priority of rules, that decides the order in which they're processed.

I came up with the following:

prionamedescription
0-9MOATA rule we declare in an emergency situation that we want to be short-term AND executed early
10-19DEPRECATIONA rule that is meant to be kept long-term and is deprecating some URL. These might eventually be moved to the public repository
20-29GENERALRules that limit broadly what can be done, for instance throttling of public clouds, or rules that apply to all users without a specific cookie, impersonation of a known bot, etc These are meant to be kept long term and will NOT be moved to a public repository
30-49GENERICrules that apply to generic patterns like use of randomized query strings or headers. These rules are meant to be kept and will NOT be moved to a public repository
50-69SPECIFICRules that apply to specific usage of our sites and are meant to be kept long term, such as rules about use of a specific query parameter that needs to be rate-limited from clouds.
70-89HI-PRIORITYAlbeit the priority is high enough, this is intended to mean these are high-priority ephemeral rules, that are for specific abusive clients or traffic patterns that should NOT be kept long-term
90-100DEFAULTAll ephemeral rules that don’t match one of the above groups go here. They’re not supposed to be kept long-term.

I think overall the idea of priorities is sound, although I have some questions about the finer points.

  • We need better names (or at least much less similar) than general and generic
    • Is the main difference that generic is allowed to inspect the URL and that general isn't?
  • hi-priority being towards the end of the list is confusing

I also expect that most first responders will put rules right in MOAT... and that's probably okay? It could maybe be primarily the WG's responsibility to move rules out of there?

I think overall the idea of priorities is sound, although I have some questions about the finer points.

  • We need better names (or at least much less similar) than general and generic
    • Is the main difference that generic is allowed to inspect the URL and that general isn't?

We can codify something in that direction. I agree. Also, +1 to better names. Do you have any?

  • hi-priority being towards the end of the list is confusing

I'm open to better naming :)

I also expect that most first responders will put rules right in MOAT... and that's probably okay? It could maybe be primarily the WG's responsibility to move rules out of there?

I expect most rules to be created with default priority of 100, actually, unless we change it. But yes, it would make sense to create emergency rules in MOAT priority and move them to default/high priority afterwards.

Prio RangeLabelPurpose
0–9EMERGENCYUrgent, high-impact rules to immediately mitigate active abuse or incidents. Must be reviewed regularly and removed quickly.
10–29EPHEMERALShort-lived rules to mitigate temporary abuse, bad bots, scrapers, etc. May be specific or general, but not intended to be kept long-term.
30–59PERMANENTLong-term rules that reflect stable policy decisions, such as rate-limiting public clouds, enforcing well-behaved agent headers, or deprecating deprecated API paths.
60–79SPECIFICLong-term rules targeting specific behaviors or parameters, like abuse of a particular query string or a user-agent misbehaving from a known source. Often complex but stable.
80–99LOW-PRECEDENCECatch-all rules applied only when no higher-priority rules match. Good for general sanitation, edge-case rate-limits, or low-confidence heuristics.
100DEFAULTRules not explicitly prioritized will fall here. Used primarily as fallback logic.

I've tried to use names with more clarity (MOAT was tricky, at least for me), we should make very clear on the interface that rules follow the priority per layer, so rules on HAProxy will be evaluated first than rules on Varnish.

Prio RangeLabelPurpose
0–9EMERGENCYUrgent, high-impact rules to immediately mitigate active abuse or incidents. Must be reviewed regularly and removed quickly.
10–29EPHEMERALShort-lived rules to mitigate temporary abuse, bad bots, scrapers, etc. May be specific or general, but not intended to be kept long-term.
30–59PERMANENTLong-term rules that reflect stable policy decisions, such as rate-limiting public clouds, enforcing well-behaved agent headers, or deprecating deprecated API paths.
60–79SPECIFICLong-term rules targeting specific behaviors or parameters, like abuse of a particular query string or a user-agent misbehaving from a known source. Often complex but stable.
80–99LOW-PRECEDENCECatch-all rules applied only when no higher-priority rules match. Good for general sanitation, edge-case rate-limits, or low-confidence heuristics.
100DEFAULTRules not explicitly prioritized will fall here. Used primarily as fallback logic.

I've tried to use names with more clarity (MOAT was tricky, at least for me), we should make very clear on the interface that rules follow the priority per layer, so rules on HAProxy will be evaluated first than rules on Varnish.

Mostly agree, would simplify even more joining PERMANENT and SPECIFIC that looks pretty similar to me

Mostly agree, would simplify even more joining PERMANENT and SPECIFIC that looks pretty similar to me

yes, you could argue that are the same thing and the lower priorities in PERMANENT are exactly the same as SPECIFIC

Prio RangeLabelPurpose
0–9EMERGENCYUrgent, high-impact rules to immediately mitigate active abuse or incidents. Must be reviewed regularly and removed quickly.
10–29EPHEMERALShort-lived rules to mitigate temporary abuse, bad bots, scrapers, etc. May be specific or general, but not intended to be kept long-term.
30–69PERMANENTLong-term rules that reflect stable policy decisions, such as rate-limiting public clouds, enforcing well-behaved agent headers, or deprecating deprecated API paths.
70–89LOW-PRECEDENCECatch-all rules applied only when no higher-priority rules match. Good for general sanitation, edge-case rate-limits, or low-confidence heuristics.
90-100DEFAULTRules not explicitly prioritized will fall here. Used primarily as fallback logic.

I would also propose, as I did in the meet, to clarify priorities between HAProxy and Varnish rules, to be sure that users will be very aware that even highest priority Varnish rule will always be placed below HAProxy ones, by using (eg.) the 0-1000 range for HAProxy and 1001-whatever for Varnish.

This would eventually also help when constructing a map (graphically or just logically) of rules priorities

that effectively duplicates the priorities instead of having the same ones in both layers. I think it's easier to think about a single set of priorities applied to two different layers.

that effectively duplicates the priorities instead of having the same ones in both layers. I think it's easier to think about a single set of priorities applied to two different layers.

I agree FWIW

I will resolve this task because the policy is established and we did the first round of cleanups. I will still work on defining the priorities better, although I suspect this work will be superseded by the changes we're going to do soon to filtering at the CDN layer.