Page MenuHomePhabricator

ORESFetchScoreJob fails quite a lot
Closed, ResolvedPublic

Description

The ORESFetchScoreJob fails quite a lot and gets retried, however ORES itself fails much rarer than the job. It seems like the job reports failure in some cases when it actually is a success - for example, if a page was created and then rapidly deleted, or if a certain model legitimately couldn't be computed as a revision doesn't have a parent.

The job should return false only if it wants to be retried, for example in case of a timeout or some unexpected error.

Event Timeline

Pchelolo created this task.
Nuria raised the priority of this task from Medium to Needs Triage.May 31 2018, 4:39 PM
Nuria moved this task from Incoming to Radar on the Analytics board.

Change 436785 had a related patch set uploaded (by Ladsgroup; owner: Amir Sarabadani):
[mediawiki/extensions/ORES@master] Do not retry RevisionNotFound job failures

https://gerrit.wikimedia.org/r/436785

Change 436785 merged by jenkins-bot:
[mediawiki/extensions/ORES@master] Do not retry RevisionNotFound job failures

https://gerrit.wikimedia.org/r/436785

Vvjjkkii renamed this task from ORESFetchScoreJob fails quite a lot to rwbaaaaaaa.Jul 1 2018, 1:07 AM
Vvjjkkii reopened this task as Open.
Vvjjkkii removed Ladsgroup as the assignee of this task.
Vvjjkkii raised the priority of this task from Medium to High.
Vvjjkkii updated the task description. (Show Details)
Vvjjkkii edited subscribers, added: Ladsgroup; removed: gerritbot, Aklapper.

The rate has gone down, but there is still a significant amount of ORESFetchScoreJobs that fail, e.g.

Failed executing job: ORESFetchScoreJob List_of_songs_by_Lata_Mangeshkar models=["damaging","goodfaith","wp10"] originalRequest={"ip":"XXXXX","userAgent":"XXX"} precache=1 requestId=WzhzvQpAICMAAJSUCW8AAACY revid=848321115

Most of these seem to be for the goodfaith or wp10 models. @Ladsgroup is there something else that can be done or are these legitimate?

CommunityTechBot lowered the priority of this task from High to Medium.

I checked and those are time out errors which is better to be retried and they usually pass in the second or third try. We can reduce the maximum number of retries if it's still too high.

We have automatic retries set up (up to 3). But, just to try to understand better, you say all of them are time outs? For example, in the last2 4h I see ~100 failed jobs, but only a handful failed because of a 503 received from ORES or a time out (explicitly stated in the error message). Most of the other failures look like the one posted above, i.e. without a reason.

We have automatic retries set up (up to 3). But, just to try to understand better, you say all of them are time outs? For example, in the last2 4h I see ~100 failed jobs, but only a handful failed because of a 503 received from ORES or a time out (explicitly stated in the error message). Most of the other failures look like the one posted above, i.e. without a reason.

So there are different type of time out. This one returns 200 but it's timeout in scoring https://ores.wikimedia.org/v3/scores/enwiki/848321115. It makes things complicated because some models might not timeout but some models might. Do you think ores should return 503 instead?

Ah, I see. Thank you for the background info.

If there is a time out somewhere internally then yes, a 503 would be an appropriate return code.

Note that returning 503 also implies sending the Retry-After header in the response. The reason for that is that our load balancers (which are in front of ORES) honour 503 responses and depool a node in that case. By supplying an R-A header, you instruct the load balancer to repool the node automatically after that time.

hmm, yeah. I think we should make another phabricator ticket because that's about the ores service and not the extension way of handling unorthodox responses.

I agree. Would you mind opening one, @Ladsgroup ? We could then close this one and continue there.

Thank you! Closing this one now.