Page MenuHomePhabricator

Failed executing job: ORESFetchScoreJob
Closed, ResolvedPublic

Description

Error

Request ID: W6FMIQrAADcAAErE9RcAAABY

message
Failed executing job: ORESFetchScoreJob ### models=["damaging","goodfaith","wp10"] originalRequest={"ip":"##.##.##.##","userAgent":"Mozilla/5.0 ### Firefox/62.0"} precache=1 requestId=W6FMIQrAADcAAErE9RcAAABY revid=860163077

Notes

Started with wmf.20 deployment (regression). The job had 0 failures in the 20 days before that.

4,690 failed jobs since 12 September.

Event Timeline

Krinkle triaged this task as High priority.Sep 18 2018, 7:14 PM
Krinkle created this task.
Restricted Application added a project: Scoring-platform-team. · View Herald TranscriptSep 18 2018, 7:14 PM
Restricted Application added a subscriber: Aklapper. · View Herald Transcript
awight added a subscriber: awight.EditedSep 18 2018, 7:30 PM

Some of the errors point to a service failure,

Exception executing job: ORESFetchScoreJob Discipline_(academia) models=["damaging","goodfaith","wp10"] originalRequest={"ip":"<redacted>","userAgent":"ClueBot/2.0"} precache=1 requestId=W6FP8ArAEHQAACKHyRAAAAAG revid=860164910 : RuntimeException: Failed to make ORES request to [http://ores.discovery.wmnet:8081/v3/scores/enwiki/?models=damaging%7Cgoodfaith%7Cwp10&revids=860164910&precache=1&format=json], There was a problem during the HTTP request: 500 INTERNAL SERVER ERROR

This is probably related to the datacenter switchover. Score processing load on our CODFW cluster doubled on Sept 11, and we've had unusual levels of errors during high traffic. Prior to the switchover, our error rate was consistently zero, and afterwards it has been exhibited spikes up to c. 30 errors per minute (out of c. 2,500 scores processed per minute).

Here's something odd grabbed from the ORES service at the same second as one of the job failure messages:

ores1001:
[pid: 5573] <redacted IP> (-) {30 vars in 418 bytes} [Tue Sep 18 20:08:59 2018] POST /v3/precache => generated 0 bytes in 0 msecs (HTTP/1.1 204) 7 headers in 252 bytes (0 switches on core 0) user agent "ChangePropagation/WMF"

There are 5-10 of these empty responses per second. According to the ORES code, this should only happen if the wiki context name is unexpected or if the model names are missing or unexpected.

According to the ORES code, this should only happen if the wiki context name is unexpected or if the model names are missing or unexpected.

CP have switched to POSTing all the new revision events to ORES and relying on ORES to find appropriate models to precache, so I'm not surprised there's some POSTs with unexpected wikis. It shouldn't be an error on the ORES side, this was done deliberately to make ORES completely own it's precaching config

According to the ORES code, this should only happen if the wiki context name is unexpected or if the model names are missing or unexpected.

CP have switched to POSTing all the new revision events to ORES and relying on ORES to find appropriate models to precache, so I'm not surprised there's some POSTs with unexpected wikis. It shouldn't be an error on the ORES side, this was done deliberately to make ORES completely own it's precaching config

That's right, thanks. It's completely unrelated and normal.

The URLS we're getting 500 errors for are ones that we do have models for, and return healthy responses after the fact, for example:
https://ores.wikimedia.org/v3/scores/enwiki/?models=damaging%7Cgoodfaith%7Cwp10&revids=860164910&precache=1&format=json

It seems strange because grafana says we didn't have any overload error except on spike (That was because of a deployment, it's fine): https://grafana.wikimedia.org/dashboard/db/ores?refresh=1m&panelId=9&fullscreen&orgId=1&from=now-30d&to=now-1m

it also rules out the possibility of metrics being broken too so I need to check why we returned 500 and now are not sending it. I could say PoolCounter could make us return 500 but precache requests should bypass it. Will double check.

This glitch predates PoolCounter by a few days, luckily.

One action item we should take is to make the "scores errored" graph take MAX across each time sample slice, because the errors disappear at some point when zooming out. Seems that anything more than a 7-day window causes the errors to disappear.

Why do we not have any errors before the switchover? Some low level of errors should be normal, from bad requests and requests for invalid revisions, etc.

Ladsgroup added a comment.EditedSep 19 2018, 2:56 PM

I reviewed all graphs and can't find what you're saying. The job failure rate is around 0.4% and it has never been zero: https://grafana.wikimedia.org/dashboard/db/ores-extension?orgId=1&from=now-30d&to=now
This doesn't also give any jumps after the switchover or deployment of wmf.20: https://logstash.wikimedia.org/goto/85e622ce496419c9cd87483793aeabf5
Do you have a link to logstash handy?

I made an adjustment to the graph, and it's now clear that there was no change in the overall error rate around switchover, it just shifted from one data center to the other as anticipated.

Should the following error have have its own task, or should it be tracked here as well?

message
[{exception_id}] {exception_url}   RuntimeException from line 95 of /srv/mediawiki/php-1.32.0-wmf.22/extensions/ORES/includes/ORESService.php: Failed to make ORES request to [http://ores.discovery.wmnet:8081/v3/scores/wikidatawiki/?models=damaging%7Cgoodf
trace
Failed to make ORES request to [http://ores.discovery.wmnet:8081/v3/scores/wikidatawiki/?models=damaging%7Cgoodfaith&revids=###%7C###&format=json], HTTP request timed out.

#0 /srv/mediawiki/php-1.32.0-wmf.22/extensions/ORES/includes/ScoreFetcher.php(55): ORES\ORESService->request(array, NULL)
#1 /srv/mediawiki/php-1.32.0-wmf.22/extensions/ORES/includes/Hooks/ApiHooksHandler.php(273): ORES\ScoreFetcher->getScores(array)
#2 /srv/mediawiki/php-1.32.0-wmf.22/extensions/ORES/includes/Hooks/ApiHooksHandler.php(227): ORES\Hooks\ApiHooksHandler::loadScoresForRevisions(array)
#3 /srv/mediawiki/php-1.32.0-wmf.22/includes/Hooks.php(174): ORES\Hooks\ApiHooksHandler::onApiQueryBaseAfterQuery(ApiQueryRecentChanges, Wikimedia\Rdbms\ResultWrapper, array)
#4 /srv/mediawiki/php-1.32.0-wmf.22/includes/Hooks.php(202): Hooks::callHook(string, array, array, NULL)
#5 /srv/mediawiki/php-1.32.0-wmf.22/includes/api/ApiQueryBase.php(381): Hooks::run(string, array)
#6 /srv/mediawiki/php-1.32.0-wmf.22/includes/api/ApiQueryRecentChanges.php(426): ApiQueryBase->select(string, array, array)
#7 /srv/mediawiki/php-1.32.0-wmf.22/includes/api/ApiQueryRecentChanges.php(133): ApiQueryRecentChanges->run()
#8 /srv/mediawiki/php-1.32.0-wmf.22/includes/api/ApiQuery.php(249): ApiQueryRecentChanges->execute()
#9 /srv/mediawiki/php-1.32.0-wmf.22/includes/api/ApiMain.php(1587): ApiQuery->execute()
#10 /srv/mediawiki/php-1.32.0-wmf.22/includes/api/ApiMain.php(531): ApiMain->executeAction()
#11 /srv/mediawiki/php-1.32.0-wmf.22/includes/api/ApiMain.php(502): ApiMain->executeActionWithErrorHandling()
#12 /srv/mediawiki/php-1.32.0-wmf.22/api.php(87): ApiMain->execute()
#13 /srv/mediawiki/w/api.php(3): include(string)
Ladsgroup removed a project: Regression.EditedSep 20 2018, 4:43 PM

Should the following error have have its own task, or should it be tracked here as well?

We have now T204961

I'm removing Regression as the failed jobs are basically happening all the time: https://logstash.wikimedia.org/goto/36bce7ef50772aaf6e66cb0e9a50bc24 the current number of failed jobs is 2k per day. It's not much and it's expected but it should just retry instead of erroring out.

Change 461710 had a related patch set uploaded (by Ladsgroup; owner: Amir Sarabadani):
[mediawiki/extensions/ORES@master] Retry once when ores service is timed out

https://gerrit.wikimedia.org/r/461710

Change 461710 merged by jenkins-bot:
[mediawiki/extensions/ORES@master] Retry once when ores service is timed out

https://gerrit.wikimedia.org/r/461710

Krinkle added a comment.EditedSep 20 2018, 7:38 PM

It's not much and it's expected but it should just retry instead of erroring out.

Cool. There shouldn't be uncaught exceptions in a Job though. So if this normal, the job should catch it and return. Maybe log it directly in the ORES channel as INFO if interested in monitoring its frequency.

Change 461815 had a related patch set uploaded (by Ladsgroup; owner: Amir Sarabadani):
[mediawiki/extensions/ORES@master] When service fails to respond, retry the job

https://gerrit.wikimedia.org/r/461815

Change 461815 merged by jenkins-bot:
[mediawiki/extensions/ORES@master] When service fails to respond, retry the job

https://gerrit.wikimedia.org/r/461815

Change 465775 had a related patch set uploaded (by Krinkle; owner: Amir Sarabadani):
[mediawiki/extensions/ORES@wmf/1.32.0-wmf.24] When service fails to respond, retry the job

https://gerrit.wikimedia.org/r/465775

Change 465775 merged by jenkins-bot:
[mediawiki/extensions/ORES@wmf/1.32.0-wmf.24] When service fails to respond, retry the job

https://gerrit.wikimedia.org/r/465775

Mentioned in SAL (#wikimedia-operations) [2018-10-10T22:41:09Z] <krinkle@deploy1001> Synchronized php-1.32.0-wmf.24/extensions/ORES/includes/FetchScoreJob.php: T204753 - Icc28230585bc (duration: 00m 49s)

Ladsgroup closed this task as Resolved.Oct 11 2018, 8:57 AM
Ladsgroup claimed this task.
Restricted Application added a project: User-Ladsgroup. · View Herald TranscriptOct 11 2018, 8:58 AM
mmodell changed the subtype of this task from "Task" to "Production Error".Aug 28 2019, 11:09 PM