ORES Overloaded (particularly 2017-02-05 02:25-02:30)
Closed, ResolvedPublic

JustBerry created this task.Feb 5 2017, 2:41 AM
Restricted Application added a subscriber: Aklapper. · View Herald TranscriptFeb 5 2017, 2:41 AM
JustBerry added a subscriber: Halfak.
Restricted Application added a project: Scoring-platform-team. · View Herald TranscriptFeb 5 2017, 2:42 AM
Ladsgroup triaged this task as Unbreak Now! priority.Feb 5 2017, 2:42 AM
Restricted Application added subscribers: Jay8g, TerraCodes. · View Herald TranscriptFeb 5 2017, 2:42 AM

Number of workers needs to be increased to reduce the load.

Temporary patch being created by @Ladsgroup needs to be reviewed by ops.

Change 336048 had a related patch set uploaded (by Ladsgroup):
ores: Increase capacity

https://gerrit.wikimedia.org/r/336048

JustBerry added a comment.EditedFeb 5 2017, 3:00 AM

Email sent to ops-l. Awaiting patch review from ops.

Change 336048 merged by Madhuvishy:
ores: Increase capacity

https://gerrit.wikimedia.org/r/336048

JustBerry added a subscriber: madhuvishy.EditedFeb 5 2017, 3:24 AM

Still ~100 errors. Monitoring...

Mentioned in SAL (#wikimedia-operations) [2017-02-05T03:28:53Z] <Amir1> ladsgroup@scb100[1-4]:~$ sudo service celery-ores-worker restart (T157206)

Change 336176 had a related patch set uploaded (by Ladsgroup):
ores: increase capacity

https://gerrit.wikimedia.org/r/336176

greg added a subscriber: greg.Feb 6 2017, 7:07 AM

Whatever it was hasn't subsided yet.

Joe added a subscriber: Joe.Feb 6 2017, 7:24 AM

Before raising the number of workers for ORES:

  • has anyone done an analysis of where this additional traffic comes from?
  • Is logging of both the MW extension and the application adequate to understand where this additional traffic is coming from?

I would rather block an abuse/fix the logging so that we could understand what is going on than trying to throw more resources (that we don't really have, given the scb cluster has a couple of servers already in the red RAM-wise) to it.

Joe added a comment.Feb 6 2017, 7:45 AM

So after taking a quick look at ORES's logs: around 70% of requests come from changepropagation for "precaching". Also

My opinion is we should rather:

  • Turn precaching off now while we season the storm
  • Make the MW extension send to ores the client IP and UA as headers, and log those in ORES so that we can debug the origin of issues we have.
  • Maybe make the MW extension not go via varnish to talk to ORES itself would be a nice, unrelated plus.

I'm ok with turning precaching back on when we have found and resolved the source of this enormous traffic surge we are seeing (if it's not precaching itself).

https://github.com/wikimedia/change-propagation/pull/161 will reduce the old by disabling CP in wikis where the extension is enabled (so it won't hurt because the extension does the precaching too).

Change 336176 abandoned by Ladsgroup:
ores: increase capacity

Reason:
Let's pursue other approaches and if it didn't work out, we go with this.

https://gerrit.wikimedia.org/r/336176

Ladsgroup added a comment.EditedFeb 6 2017, 8:52 AM

I'm pretty sure someone externally is putting pressure on the service too. This correlates with overload errors perfectly:

Change 336197 had a related patch set uploaded (by Giuseppe Lavagetto):
ORES: reduce concurrency, disable various wikis

https://gerrit.wikimedia.org/r/336197

Change 336197 merged by Giuseppe Lavagetto:
ORES: reduce concurrency, disable various wikis

https://gerrit.wikimedia.org/r/336197

Joe added a comment.Feb 6 2017, 11:21 AM

From my further analysis of logs:

  • there is one API heavy hitter, whose rate of consumption didn't change significantly in the last few days
  • there is, at the same time, a surge in the number registered at ores.*.score_processed.count that doesn't seem to have anything to do with that

Until we can understand what is causing that surge, I don't see a good reason to increase the number of workers we have.

Joe added a comment.Feb 6 2017, 11:41 AM

So, graphing ores.*.scores_request.*.count it shows most requests seem to come from etwiki, investigating this further. RechentChanges suggests this is not coming from any form of bot activity.

Joe added a comment.Feb 6 2017, 11:47 AM

scratch what I said; the counter for etwiki is most likely broken.

The surge in requests comes from enwiki.

Joe added a comment.Feb 6 2017, 12:08 PM

Looking into it better, the api user wasn't a red herring after all; I am going to ban the use of oresscores from the mw api since:

  • AIUI it's not "officially released"
  • it has only one user, the abuser
  • we don't really want to raise the number of workers for this use, we could not keep up with the MW API capacity anyways.

Okay, let's block them for now. Until we find a way to only hold-out the abuser.

Change 336215 had a related patch set uploaded (by Addshore):
Remove all (except meta) API funcationality hooks

https://gerrit.wikimedia.org/r/336215

Change 336215 merged by jenkins-bot:
Remove all (except meta) API funcationality hooks

https://gerrit.wikimedia.org/r/336215

Mentioned in SAL (#wikimedia-operations) [2017-02-06T14:09:13Z] <addshore@tin> Synchronized php-1.29.0-wmf.10/extensions/ORES/extension.json: T157206 [[gerrit:336215|ORES - Remove all (except meta) API funcationality hooks]] (duration: 00m 51s)

Mentioned in SAL (#wikimedia-operations) [2017-02-06T15:10:18Z] <addshore@tin> Synchronized php-1.29.0-wmf.10/extensions/ORES/extension.json: T157206 [[gerrit:336215|ORES - Remove all (except meta) API funcationality hooks]] (take2) (duration: 00m 54s)

Joe closed this task as Resolved.Feb 6 2017, 4:42 PM
Aklapper renamed this task from ORES Overloaded (particularly 02/05/17 2:25-2:30) to ORES Overloaded (particularly 2017-02-05 02:25-02:30).Feb 7 2017, 12:55 AM

Change 349955 merged by jenkins-bot:
[mediawiki/extensions/ORES@master] Revert "Remove all (except meta) API funcationality hooks"

https://gerrit.wikimedia.org/r/349955