Page MenuHomePhabricator

[Investigate] ORES time out errors in logs
Closed, ResolvedPublic

Description

It happens in 1% of cases but still needs investigating. The link to logstash. The first step is to add revision id to logs so we can see what edits is getting an issue and try to reproduce.

Details

Related Gerrit Patches:
mediawiki/services/ores/deploy : masterIncrease workers from 16 to 24
mediawiki/extensions/ORES : wmf/1.28.0-wmf.12Add revision_id to log for errors
mediawiki/extensions/ORES : masterAdd revision_id to log for errors
mediawiki/extensions/ORES : wmf/1.28.0-wmf.11Add revision_id to log for errors

Event Timeline

Restricted Application added a subscriber: Aklapper. · View Herald TranscriptJul 26 2016, 3:08 PM

Change 301135 had a related patch set uploaded (by Ladsgroup):
Add revision_id to log for errors

https://gerrit.wikimedia.org/r/301135

Change 301135 abandoned by Ladsgroup:
Add revision_id to log for errors

Reason:
Really wrong branch

https://gerrit.wikimedia.org/r/301135

Change 301136 had a related patch set uploaded (by Ladsgroup):
Add revision_id to log for errors

https://gerrit.wikimedia.org/r/301136

Change 301136 merged by jenkins-bot:
Add revision_id to log for errors

https://gerrit.wikimedia.org/r/301136

Change 301516 had a related patch set uploaded (by Ladsgroup):
Add revision_id to log for errors

https://gerrit.wikimedia.org/r/301516

Change 301516 merged by jenkins-bot:
Add revision_id to log for errors

https://gerrit.wikimedia.org/r/301516

Mentioned in SAL [2016-07-27T23:21:31Z] <dereckson@tin> Synchronized php-1.28.0-wmf.12/extensions/ORES/includes/Cache.php: Add revision_id to log for errors (T141368, 1/2) (duration: 00m 31s)

Mentioned in SAL [2016-07-27T23:22:38Z] <dereckson@tin> Synchronized php-1.28.0-wmf.12/extensions/ORES/maintenance/PopulateDatabase.php: Add revision_id to log for errors (T141368, 2/2, no-op) (duration: 00m 29s)

We usually have around 300-500 cases but today it was around 1K. I still don't know what was the reason for today spikes.

Okay, By checking the logs it seems we had that spike because API didn't return proper answer to ORES service and as consequence it return Revision not found error to requests for revisions that existed. It happened twice today and only for wikidata (link to log)

Here's some facts:

  • ORES celery queue is 100 revisions once it gets more than 100 revision. We start to send 503 errors.
  • Number of our workers is 16 per node (32 overall).

Here's my hypotheses:
It happens when:

  • We haven't reached the 100 queue size but we are close to it.
  • Our workers are busy and they couldn't get to revision in time so ores itself starts to abandon the process and send out timeout error instead of overload

The solution in that case is to increase number of workers or reduce queue size (obviously the farmer is more desirable). Let's wait to get the refactor in prod so we have some free memory to use.

Change 301750 had a related patch set uploaded (by Ladsgroup):
Increase workers from 16 to 24

https://gerrit.wikimedia.org/r/301750

Change 301750 merged by Ladsgroup:
Increase workers from 16 to 24

https://gerrit.wikimedia.org/r/301750

Steps to fix this:

Ladsgroup closed this task as Resolved.Aug 8 2016, 9:59 PM
Pavanaja removed a subscriber: Pavanaja.
mmodell changed the subtype of this task from "Task" to "Production Error".Aug 28 2019, 11:11 PM