Before raising the number of workers for ORES:
- has anyone done an analysis of where this additional traffic comes from?
- Is logging of both the MW extension and the application adequate to understand where this additional traffic is coming from?
I would rather block an abuse/fix the logging so that we could understand what is going on than trying to throw more resources (that we don't really have, given the scb cluster has a couple of servers already in the red RAM-wise) to it.
So after taking a quick look at ORES's logs: around 70% of requests come from changepropagation for "precaching". Also
My opinion is we should rather:
- Turn precaching off now while we season the storm
- Make the MW extension send to ores the client IP and UA as headers, and log those in ORES so that we can debug the origin of issues we have.
- Maybe make the MW extension not go via varnish to talk to ORES itself would be a nice, unrelated plus.
I'm ok with turning precaching back on when we have found and resolved the source of this enormous traffic surge we are seeing (if it's not precaching itself).
https://github.com/wikimedia/change-propagation/pull/161 will reduce the old by disabling CP in wikis where the extension is enabled (so it won't hurt because the extension does the precaching too).
From my further analysis of logs:
- there is one API heavy hitter, whose rate of consumption didn't change significantly in the last few days
- there is, at the same time, a surge in the number registered at ores.*.score_processed.count that doesn't seem to have anything to do with that
Until we can understand what is causing that surge, I don't see a good reason to increase the number of workers we have.
So, graphing ores.*.scores_request.*.count it shows most requests seem to come from etwiki, investigating this further. RechentChanges suggests this is not coming from any form of bot activity.
Looking into it better, the api user wasn't a red herring after all; I am going to ban the use of oresscores from the mw api since:
- AIUI it's not "officially released"
- it has only one user, the abuser
- we don't really want to raise the number of workers for this use, we could not keep up with the MW API capacity anyways.