Use poolcounter to refuse service to users who are making too many requests at a time (DoS attacks).
Description
Details
Subject | Repo | Branch | Lines +/- | |
---|---|---|---|---|
ores: add PoolCounter nodes settings | operations/puppet | production | +14 -1 |
Status | Subtype | Assigned | Task | ||
---|---|---|---|---|---|
Resolved | Ladsgroup | T148997 Implement parallel connection limit for querying ORES | |||
Resolved | Ladsgroup | T137962 [Spec] Tracking and blocking specific IP/user-agent combinations | |||
Resolved | Ladsgroup | T160692 Use poolcounter to limit number of connections to ores uwsgi | |||
Resolved | Ladsgroup | T201823 Implement PoolCounter support in ORES | |||
Resolved | akosiaris | T203465 Site: 4 VM request for ORES poolcounter | |||
Resolved | akosiaris | T201824 Spin up a new poolcounter node for ores | |||
Resolved | Ladsgroup | T201825 Test poolcounter support for ores in beta cluster | |||
Resolved | Ladsgroup | T201826 Implement support for whitelisting and proxy requests for poolcounter in ORES | |||
Declined | Ladsgroup | T204897 Add Wiki Education Dashboard and Programs & Events Dashboard to ORES connection whitelist |
Event Timeline
@akosiaris, we're considering using Redis to track open connections to ORES so we can more effectively do rate limiting. What do you think?
@Halfak, overall the idea sounds ok. It would be great if we could build rate limiting at higher levels and ORES didn't really have to care for it, but we are not there yet (nor will we be for the next 6-12 months).
As far as lock managers go, we already have poolcounter. https://wikitech.wikimedia.org/wiki/PoolCounter. And thumbor (also python) has support for it from what I see in https://github.com/wikimedia/operations-debs-python-thumbor-wikimedia/blob/master/wikimedia_thumbor/poolcounter/__init__.py (T151066 was the tracking task), so there isn't much need to create a library (nor is the library complicated in any way).
Now, while redis can definitely be used as lock manager, poolcounter has the added benefit of being written for this role specifically and actually NOT persisting the state. Which can be a bonus as far as lock managers go (fewer bugs to meet, easier to ditch the state by just restarting the service and so on). Note that we 've had exactly 0 architectural/software issues with poolcounter in the last 5 years. It does have the downside of one more moving part being added in the mix, so if code distribution is important that might cause issues.
I would encourage evaluating it before commiting to using redis, without of course dismissing the redis solution either. Should you end up choosing redis, I would strongly suggest setting it up in a non-persisting mode and not reusing the already present redis databases (jobqueue, cache)
I've boldly change the title and description because poolcounter seems to get a strong recommendation. I'll look into that.
Just found out the poolcounter was developed in-house and has some pretty poor documentation. It might not even be in version control. @akosiaris, are you sure we should invest in this?
Aha!. Looks like it exists here https://github.com/wikimedia/mediawiki-extensions-PoolCounter/tree/master/daemon
Where would we run the daemon? Maybe we could use one of the redis nodes. What is the memory performance of poolcounter like? Do you think running it next to our two redis daemons would be a good idea?
I think PoolCounter is reasonably well documented at https://wikitech.wikimedia.org/wiki/PoolCounter - is there something specific you see as missing?
Yeah, link to the server code is broken (just fixed it). No description of how to build and invoke the daemon (I'd have to dig through puppet code). No discussion of memory performance. No discussion of how to configure it if not running from mediawiki code. Incomplete description of the message passing protocol (but that can be reverse engineered from something like https://github.com/wikimedia/operations-debs-python-thumbor-wikimedia/blob/master/wikimedia_thumbor/poolcounter/__init__.py). It seems this was designed for a very specific use-case and the documentation is written minimally to that. It's OK. I don't mean to blindly critique, but we should refactor *a lot* if we're going to use it for ORES and expect future devs to know what's up.
I'm comparing this documentation to using mutexes in Redis, so it's pretty hard to measure up.
I should mention that I'm also looking at this as a SOFIXIT thing. If we're going to use this, we should FIXIT the documentation.
cd daemon && make && ./poolcounterd (I also documented this on https://www.mediawiki.org/wiki/Extension:PoolCounter)
No discussion of memory performance. No discussion of how to configure it if not running from mediawiki code.
There isn't anything that needs configuring from the poolcounter side if that's what you mean. Tim could probably tell you about the memory performance.
Incomplete description of the message passing protocol (but that can be reverse engineered from something like https://github.com/wikimedia/operations-debs-python-thumbor-wikimedia/blob/master/wikimedia_thumbor/poolcounter/__init__.py). It seems this was designed for a very specific use-case and the documentation is written minimally to that. It's OK. I don't mean to blindly critique, but we should refactor *a lot* if we're going to use it for ORES and expect future devs to know what's up.
Sure, help on improving documentation is always welcomed :)
We already run 2 instances for mediawiki/search and thumbor. Those would be poolcounter1001.eqiad.wmnet, poolcounter1002.eqiad.wmnet and despite all the traffic they are handling are faring quite well. https://grafana.wikimedia.org/dashboard/file/server-board.json?var-server=poolcounter1001&var-network=eth0 and https://grafana.wikimedia.org/dashboard/file/server-board.json?var-server=poolcounter1002&var-network=eth0 respectively. From the same graphs you can see that memory requirements are minimal.
I highly doubt there is a need for an ORES dedicated poolcounter installation. I would suggest reusing the current infrastructure.
No discussion of memory performance.
That's probably cause it's so minimal that is almost not worth talking about.
Incomplete description of the message passing protocol (but that can be reverse engineered from something like https://github.com/wikimedia/operations-debs-python-thumbor-wikimedia/blob/master/wikimedia_thumbor/poolcounter/__init__.py).
I would argue that the protocol is adequately documented. It's extremely simple of course so half a page in Wikitech seems enough. But I agree it could be done better (the documentation living next to the code for example).
It seems this was designed for a very specific use-case and the documentation is written minimally to that. It's OK.
Yes clearly. For avoiding the Michael Jackson problem. It has already been reused however in other projects as well like thumbor.
I don't mean to blindly critique, but we should refactor *a lot* if we're going to use it for ORES and expect future devs to know what's up.
I am not sure what are the locking primitives we are looking into for in ORES. By refactor, you mean refactor the ORES code, or the poolcounter code ? If it is the latter, it's probably worth evaluating whether it will turn out to be more effective or not.
Expanded the documentation a bit, based on what I could make out of the source. If someone who understands PoolCounter better wants to improve it, an explanation of exactly what LOCK_HELD does would be nice; also a better explanation of the difference between ACQ4ANY and ACQ4ME and the exact logic for wakeup. My guess is that ACQ4ANY locks send a DONE to everyone who is waiting upon successful finish, and ACQ4ME locks send a LOCKED to one process instead, but it's not clearly documented and the C code is hard to follow.
Change 457422 had a related patch set uploaded (by Ladsgroup; owner: Amir Sarabadani):
[operations/puppet@production] ores: add PoolCounter nodes settings
Change 457422 merged by Alexandros Kosiaris:
[operations/puppet@production] ores: add PoolCounter nodes settings
I highly doubt there is a need for an ORES dedicated poolcounter installation. I would suggest reusing the current infrastructure.
For posterity's sake and to justify this comment contradicting T203465, the comment and recommendation was solely based on load and did not take into account the, granted uncalculated, chance that ores might end up interfering with mediawiki and thumbor key wise, or all 3 services exhibiting issues in case of infrastructure failure. The former is of course something that could be prevented, the latter mostly as well (e.g. via distributing locks to multiple machines, already implementedi n ORES) but after some discussions it did seem more prudent after all to split the infra. That decision was driven as well by the fact the cost for those 4 VMs is minimal compared to obtaining hardware for that.
Mentioned in SAL (#wikimedia-operations) [2018-09-17T20:15:47Z] <ladsgroup@deploy1001> Started deploy [ores/deploy@ae96071]: PoolCounter support: Let's get the party started (T160692)
Mentioned in SAL (#wikimedia-operations) [2018-09-17T20:44:05Z] <ladsgroup@deploy1001> Finished deploy [ores/deploy@ae96071]: PoolCounter support: Let's get the party started (T160692) (duration: 28m 19s)