Maniphest T196076

ORESFetchScoreJob fails quite a lot
Closed, ResolvedPublic
Actions

Description

The ORESFetchScoreJob fails quite a lot and gets retried, however ORES itself fails much rarer than the job. It seems like the job reports failure in some cases when it actually is a success - for example, if a page was created and then rapidly deleted, or if a certain model legitimately couldn't be computed as a revision doesn't have a parent.

The job should return false only if it wants to be retried, for example in case of a timeout or some unexpected error.

Details

	Subject	Repo	Branch	Lines +/-
	Do not retry RevisionNotFound job failures	mediawiki/extensions/ORES	master	+6 -2

Customize query in gerrit

Related Objects
Search...

Status	Assigned	Task
Resolved	• Pchelolo	T157088 [EPIC] Develop a JobQueue backend based on EventBus
Resolved	• Pchelolo	T190327 FY17/18 Q4 Program 8 Services Goal: Complete the JobQueue transition to EventBus
Resolved	Ladsgroup	T196076 ORESFetchScoreJob fails quite a lot

Event Timeline

• Pchelolo triaged this task as Medium priority.May 31 2018, 2:05 PM

• Pchelolo created this task.

Restricted Application added projects: User-Ladsgroup, Machine-Learning-Team. · View Herald TranscriptMay 31 2018, 2:05 PM

• Nuria raised the priority of this task from Medium to Needs Triage.May 31 2018, 4:39 PM

• Nuria moved this task from Incoming to Radar on the Analytics board.

Change 436785 had a related patch set uploaded (by Ladsgroup; owner: Amir Sarabadani):
[mediawiki/extensions/ORES@master] Do not retry RevisionNotFound job failures

https://gerrit.wikimedia.org/r/436785

gerritbot added a project: Patch-For-Review.Jun 1 2018, 12:57 PM

Ladsgroup edited projects, added Machine-Learning-Team (Active Tasks); removed Machine-Learning-Team.Jun 1 2018, 1:01 PM

Ladsgroup moved this task from Parked to Review on the Machine-Learning-Team (Active Tasks) board.

• mobrovac triaged this task as Medium priority.Jun 1 2018, 3:17 PM

awight moved this task from Review to Pending deployment on the Machine-Learning-Team (Active Tasks) board.Jun 4 2018, 11:18 AM

Change 436785 merged by jenkins-bot:
[mediawiki/extensions/ORES@master] Do not retry RevisionNotFound job failures

https://gerrit.wikimedia.org/r/436785

ReleaseTaggerBot added a project: MW-1.32-notes (WMF-deploy-2018-06-05 (1.32.0-wmf.7)).Jun 4 2018, 12:01 PM

Thank you @Ladsgroup !

• Vvjjkkii renamed this task from ORESFetchScoreJob fails quite a lot to rwbaaaaaaa.Jul 1 2018, 1:07 AM

• Vvjjkkii reopened this task as Open.

• Vvjjkkii removed Ladsgroup as the assignee of this task.

• Vvjjkkii raised the priority of this task from Medium to High.

• Vvjjkkii added projects: CheckUser, Connected-Open-Heritage-Batch-uploads (RAÄ-KMB_1_2017-02), Tamil-Sites, Gamepress, Hashtags, Jade, KartoEditor, Language-2018-Apr-June, New-Editor-Experiences, Mail, TCB-Team (now WMDE-TechWish).

• Vvjjkkii updated the task description. (Show Details)

• Vvjjkkii edited subscribers, added: Ladsgroup; removed: gerritbot, Aklapper.

• Community_Tech_bot renamed this task from rwbaaaaaaa to ORESFetchScoreJob fails quite a lot.Jul 1 2018, 6:21 AM

• Community_Tech_bot closed this task as Resolved.

• Community_Tech_bot assigned this task to Ladsgroup.

• Community_Tech_bot updated the task description. (Show Details)

• Community_Tech_bot removed projects: TCB-Team (now WMDE-TechWish), Mail, New-Editor-Experiences, Language-2018-Apr-June, KartoEditor, Jade, Hashtags, Gamepress, Tamil-Sites, Connected-Open-Heritage-Batch-uploads (RAÄ-KMB_1_2017-02), CheckUser.

• Community_Tech_bot edited subscribers, added: gerritbot, Aklapper; removed: Ladsgroup.

The rate has gone down, but there is still a significant amount of ORESFetchScoreJobs that fail, e.g.

Failed executing job: ORESFetchScoreJob List_of_songs_by_Lata_Mangeshkar models=["damaging","goodfaith","wp10"] originalRequest={"ip":"XXXXX","userAgent":"XXX"} precache=1 requestId=WzhzvQpAICMAAJSUCW8AAACY revid=848321115

Most of these seem to be for the goodfaith or wp10 models. @Ladsgroup is there something else that can be done or are these legitimate?

CommunityTechBot closed this task as Resolved.Jul 3 2018, 3:25 AM

CommunityTechBot lowered the priority of this task from High to Medium.

^^^

I checked and those are time out errors which is better to be retried and they usually pass in the second or third try. We can reduce the maximum number of retries if it's still too high.

We have automatic retries set up (up to 3). But, just to try to understand better, you say all of them are time outs? For example, in the last2 4h I see ~100 failed jobs, but only a handful failed because of a 503 received from ORES or a time out (explicitly stated in the error message). Most of the other failures look like the one posted above, i.e. without a reason.

In T196076#4397164, @mobrovac wrote:

We have automatic retries set up (up to 3). But, just to try to understand better, you say all of them are time outs? For example, in the last2 4h I see ~100 failed jobs, but only a handful failed because of a 503 received from ORES or a time out (explicitly stated in the error message). Most of the other failures look like the one posted above, i.e. without a reason.

So there are different type of time out. This one returns 200 but it's timeout in scoring https://ores.wikimedia.org/v3/scores/enwiki/848321115. It makes things complicated because some models might not timeout but some models might. Do you think ores should return 503 instead?

Ah, I see. Thank you for the background info.

If there is a time out somewhere internally then yes, a 503 would be an appropriate return code.

Note that returning 503 also implies sending the Retry-After header in the response. The reason for that is that our load balancers (which are in front of ORES) honour 503 responses and depool a node in that case. By supplying an R-A header, you instruct the load balancer to repool the node automatically after that time.

hmm, yeah. I think we should make another phabricator ticket because that's about the ores service and not the extension way of handling unorthodox responses.

I agree. Would you mind opening one, @Ladsgroup ? We could then close this one and continue there.