We should see if we can re-use parts of the wikiprod WAF stack to reduce impact of scrapers on Toolforge tools. In particular, the tools-infrastructure-team meeting pointed out that the requestctl interface for managing rules as well as the traffic (cloud) source classification data could be useful.
Description
| Status | Subtype | Assigned | Task | ||
|---|---|---|---|---|---|
| Invalid | Feature | None | T16720 robots.txt (tracking) | ||
| Resolved | bd808 | T127206 provide a more strict robots.txt at Tool Labs | |||
| Open | None | T226688 Block web crawlers from accessing Cloud Services | |||
| Open | None | T409759 See if we can borrow parts of the wikiprod WAF for Toolforge | |||
| Open | taavi | T410721 Stand up an etcd cluster for testing requestctl in toolsbeta |
Event Timeline
One of the big things that makes hiddenparma useful on the prod edge is the webrequest-live traffic analysis dashboard on superset.wikimedia.org. I'm sure that we can benefit from the etcd managed rule sets without that traffic visibility, but having it really makes the whole prod edge management system work.
the requestctl interface for managing rules as well as the traffic (cloud) source classification data could be useful
This was discussed during the recent WE5+6 offsite, and IIUC @Joe thinks that requestctl and hiddenparma are too tailored to production use cases to be easily reusable, but other parts of the production tool set might be reusable more easily (namely Lua rules used to filter traffic in haproxy).
the webrequest-live traffic analysis dashboard on superset.wikimedia.org
This was also discussed, and it should be possible to replicate the kafka-based log ingestion in cloud to get a similar level of observability for cloud traffic. A more lightweight alternative could be https://goaccess.io/