Maniphest T204753

Failed executing job: ORESFetchScoreJob
Closed, ResolvedPublicPRODUCTION ERROR
Actions

Description

Error

Request ID: W6FMIQrAADcAAErE9RcAAABY

message

Failed executing job: ORESFetchScoreJob ### models=["damaging","goodfaith","wp10"] originalRequest={"ip":"##.##.##.##","userAgent":"Mozilla/5.0 ### Firefox/62.0"} precache=1 requestId=W6FMIQrAADcAAErE9RcAAABY revid=860163077

Notes

Started with wmf.20 deployment (regression). The job had 0 failures in the 20 days before that.

4,690 failed jobs since 12 September.

Details

Subject	Repo	Branch	Lines +/-
When service fails to respond, retry the job	mediawiki/extensions/ORES	wmf/1.32.0-wmf.24	+13 -6
When service fails to respond, retry the job	mediawiki/extensions/ORES	master	+13 -6
Retry once when ores service is timed out	mediawiki/extensions/ORES	master	+17 -2

Customize query in gerrit

Related Objects

Mentioned Here: T204961: ORES requests for wikidatawiki models=damaging end up with HTTP request timed out

Event Timeline

Krinkle triaged this task as High priority.Sep 18 2018, 7:14 PM

Krinkle created this task.

Restricted Application added a project: Machine-Learning-Team. · View Herald TranscriptSep 18 2018, 7:14 PM

Restricted Application added a subscriber: Aklapper. · View Herald Transcript

Krinkle added a subscriber: Ladsgroup.Sep 18 2018, 7:14 PM

awight edited projects, added Machine-Learning-Team (Active Tasks); removed Machine-Learning-Team.Sep 18 2018, 7:24 PM

Some of the errors point to a service failure,

Exception executing job: ORESFetchScoreJob Discipline_(academia) models=["damaging","goodfaith","wp10"] originalRequest={"ip":"<redacted>","userAgent":"ClueBot/2.0"} precache=1 requestId=W6FP8ArAEHQAACKHyRAAAAAG revid=860164910 : RuntimeException: Failed to make ORES request to [http://ores.discovery.wmnet:8081/v3/scores/enwiki/?models=damaging%7Cgoodfaith%7Cwp10&revids=860164910&precache=1&format=json], There was a problem during the HTTP request: 500 INTERNAL SERVER ERROR

• Pchelolo added projects: WMF-JobQueue, Services (watching).Sep 18 2018, 7:35 PM

This is probably related to the datacenter switchover. Score processing load on our CODFW cluster doubled on Sept 11, and we've had unusual levels of errors during high traffic. Prior to the switchover, our error rate was consistently zero, and afterwards it has been exhibited spikes up to c. 30 errors per minute (out of c. 2,500 scores processed per minute).

Here's something odd grabbed from the ORES service at the same second as one of the job failure messages:

ores1001:
[pid: 5573] <redacted IP> (-) {30 vars in 418 bytes} [Tue Sep 18 20:08:59 2018] POST /v3/precache => generated 0 bytes in 0 msecs (HTTP/1.1 204) 7 headers in 252 bytes (0 switches on core 0) user agent "ChangePropagation/WMF"

There are 5-10 of these empty responses per second. According to the ORES code, this should only happen if the wiki context name is unexpected or if the model names are missing or unexpected.

According to the ORES code, this should only happen if the wiki context name is unexpected or if the model names are missing or unexpected.

CP have switched to POSTing all the new revision events to ORES and relying on ORES to find appropriate models to precache, so I'm not surprised there's some POSTs with unexpected wikis. It shouldn't be an error on the ORES side, this was done deliberately to make ORES completely own it's precaching config

In T204753#4595550, @Pchelolo wrote:

According to the ORES code, this should only happen if the wiki context name is unexpected or if the model names are missing or unexpected.

CP have switched to POSTing all the new revision events to ORES and relying on ORES to find appropriate models to precache, so I'm not surprised there's some POSTs with unexpected wikis. It shouldn't be an error on the ORES side, this was done deliberately to make ORES completely own it's precaching config

That's right, thanks. It's completely unrelated and normal.

The URLS we're getting 500 errors for are ones that we do have models for, and return healthy responses after the fact, for example:
https://ores.wikimedia.org/v3/scores/enwiki/?models=damaging%7Cgoodfaith%7Cwp10&revids=860164910&precache=1&format=json

It seems strange because grafana says we didn't have any overload error except on spike (That was because of a deployment, it's fine): https://grafana.wikimedia.org/dashboard/db/ores?refresh=1m&panelId=9&fullscreen&orgId=1&from=now-30d&to=now-1m

it also rules out the possibility of metrics being broken too so I need to check why we returned 500 and now are not sending it. I could say PoolCounter could make us return 500 but precache requests should bypass it. Will double check.

This glitch predates PoolCounter by a few days, luckily.

One action item we should take is to make the "scores errored" graph take MAX across each time sample slice, because the errors disappear at some point when zooming out. Seems that anything more than a 7-day window causes the errors to disappear.

Why do we not have any errors before the switchover? Some low level of errors should be normal, from bad requests and requests for invalid revisions, etc.

I reviewed all graphs and can't find what you're saying. The job failure rate is around 0.4% and it has never been zero: https://grafana.wikimedia.org/dashboard/db/ores-extension?orgId=1&from=now-30d&to=now
This doesn't also give any jumps after the switchover or deployment of wmf.20: https://logstash.wikimedia.org/goto/85e622ce496419c9cd87483793aeabf5
Do you have a link to logstash handy?

I made an adjustment to the graph, and it's now clear that there was no change in the overall error rate around switchover, it just shifted from one data center to the other as anticipated.

Should the following error have have its own task, or should it be tracked here as well?

message

[{exception_id}] {exception_url}   RuntimeException from line 95 of /srv/mediawiki/php-1.32.0-wmf.22/extensions/ORES/includes/ORESService.php: Failed to make ORES request to [http://ores.discovery.wmnet:8081/v3/scores/wikidatawiki/?models=damaging%7Cgoodf

trace

Failed to make ORES request to [http://ores.discovery.wmnet:8081/v3/scores/wikidatawiki/?models=damaging%7Cgoodfaith&revids=###%7C###&format=json], HTTP request timed out.

#0 /srv/mediawiki/php-1.32.0-wmf.22/extensions/ORES/includes/ScoreFetcher.php(55): ORES\ORESService->request(array, NULL)
#1 /srv/mediawiki/php-1.32.0-wmf.22/extensions/ORES/includes/Hooks/ApiHooksHandler.php(273): ORES\ScoreFetcher->getScores(array)
#2 /srv/mediawiki/php-1.32.0-wmf.22/extensions/ORES/includes/Hooks/ApiHooksHandler.php(227): ORES\Hooks\ApiHooksHandler::loadScoresForRevisions(array)
#3 /srv/mediawiki/php-1.32.0-wmf.22/includes/Hooks.php(174): ORES\Hooks\ApiHooksHandler::onApiQueryBaseAfterQuery(ApiQueryRecentChanges, Wikimedia\Rdbms\ResultWrapper, array)
#4 /srv/mediawiki/php-1.32.0-wmf.22/includes/Hooks.php(202): Hooks::callHook(string, array, array, NULL)
#5 /srv/mediawiki/php-1.32.0-wmf.22/includes/api/ApiQueryBase.php(381): Hooks::run(string, array)
#6 /srv/mediawiki/php-1.32.0-wmf.22/includes/api/ApiQueryRecentChanges.php(426): ApiQueryBase->select(string, array, array)
#7 /srv/mediawiki/php-1.32.0-wmf.22/includes/api/ApiQueryRecentChanges.php(133): ApiQueryRecentChanges->run()
#8 /srv/mediawiki/php-1.32.0-wmf.22/includes/api/ApiQuery.php(249): ApiQueryRecentChanges->execute()
#9 /srv/mediawiki/php-1.32.0-wmf.22/includes/api/ApiMain.php(1587): ApiQuery->execute()
#10 /srv/mediawiki/php-1.32.0-wmf.22/includes/api/ApiMain.php(531): ApiMain->executeAction()
#11 /srv/mediawiki/php-1.32.0-wmf.22/includes/api/ApiMain.php(502): ApiMain->executeActionWithErrorHandling()
#12 /srv/mediawiki/php-1.32.0-wmf.22/api.php(87): ApiMain->execute()
#13 /srv/mediawiki/w/api.php(3): include(string)

In T204753#4599481, @Krinkle wrote:

Should the following error have have its own task, or should it be tracked here as well?

We have now T204961

I'm removing Regression as the failed jobs are basically happening all the time: https://logstash.wikimedia.org/goto/36bce7ef50772aaf6e66cb0e9a50bc24 the current number of failed jobs is 2k per day. It's not much and it's expected but it should just retry instead of erroring out.

Change 461710 had a related patch set uploaded (by Ladsgroup; owner: Amir Sarabadani):
[mediawiki/extensions/ORES@master] Retry once when ores service is timed out

https://gerrit.wikimedia.org/r/461710

gerritbot added a project: Patch-For-Review.Sep 20 2018, 7:15 PM

Change 461710 merged by jenkins-bot:
[mediawiki/extensions/ORES@master] Retry once when ores service is timed out

https://gerrit.wikimedia.org/r/461710

In T204753#4602698, @Ladsgroup wrote:

It's not much and it's expected but it should just retry instead of erroring out.

Cool. There shouldn't be uncaught exceptions in a Job though. So if this normal, the job should catch it and return. Maybe log it directly in the ORES channel as INFO if interested in monitoring its frequency.

Ladsgroup moved this task from Parked to Pending deployment on the Machine-Learning-Team (Active Tasks) board.Sep 20 2018, 7:58 PM

ReleaseTaggerBot added a project: MW-1.32-notes (WMF-deploy-2018-09-25 (1.32.0-wmf.23)).Sep 20 2018, 8:01 PM

Change 461815 had a related patch set uploaded (by Ladsgroup; owner: Amir Sarabadani):
[mediawiki/extensions/ORES@master] When service fails to respond, retry the job

https://gerrit.wikimedia.org/r/461815

Ladsgroup moved this task from Pending deployment to Review on the Machine-Learning-Team (Active Tasks) board.Sep 20 2018, 11:01 PM

Change 461815 merged by jenkins-bot:
[mediawiki/extensions/ORES@master] When service fails to respond, retry the job

https://gerrit.wikimedia.org/r/461815

ReleaseTaggerBot edited projects, added MW-1.32-notes (WMF-deploy-2018-10-16 (1.32.0-wmf.26)); removed MW-1.32-notes (WMF-deploy-2018-09-25 (1.32.0-wmf.23)).Oct 9 2018, 5:00 PM

Ladsgroup moved this task from Review to Pending deployment on the Machine-Learning-Team (Active Tasks) board.Oct 9 2018, 5:28 PM

Krinkle removed a project: Patch-For-Review.Oct 10 2018, 9:04 PM

Change 465775 had a related patch set uploaded (by Krinkle; owner: Amir Sarabadani):
[mediawiki/extensions/ORES@wmf/1.32.0-wmf.24] When service fails to respond, retry the job

https://gerrit.wikimedia.org/r/465775

gerritbot added a project: Patch-For-Review.Oct 10 2018, 10:08 PM

Change 465775 merged by jenkins-bot:
[mediawiki/extensions/ORES@wmf/1.32.0-wmf.24] When service fails to respond, retry the job

https://gerrit.wikimedia.org/r/465775

Krinkle moved this task from Apr 2019 / 1.33.wmf.25+ to Resolved on the Wikimedia-production-error board.Oct 10 2018, 10:40 PM

Mentioned in SAL (#wikimedia-operations) [2018-10-10T22:41:09Z] <krinkle@deploy1001> Synchronized php-1.32.0-wmf.24/extensions/ORES/includes/FetchScoreJob.php: T204753 - Icc28230585bc (duration: 00m 49s)

ReleaseTaggerBot edited projects, added MW-1.32-notes (WMF-deploy-2018-10-02 (1.32.0-wmf.24)); removed MW-1.32-notes (WMF-deploy-2018-10-16 (1.32.0-wmf.26)).Oct 10 2018, 11:00 PM

Ladsgroup closed this task as Resolved.Oct 11 2018, 8:57 AM

Ladsgroup claimed this task.

Ladsgroup moved this task from Pending deployment to Completed on the Machine-Learning-Team (Active Tasks) board.

Restricted Application added a project: User-Ladsgroup. · View Herald TranscriptOct 11 2018, 8:58 AM

• mmodell changed the subtype of this task from "Task" to "Production Error".Aug 28 2019, 11:09 PM

Maintenance_bot removed a project: Patch-For-Review.Aug 28 2019, 11:31 PM

Failed executing job: ORESFetchScoreJobClosed, ResolvedPublicPRODUCTION ERRORActions