Page MenuHomePhabricator

Beta Cluster ORES celery worker dies
Closed, ResolvedPublicPRODUCTION ERROR

Description

Beta ORES reports an overload condition: https://ores-beta.wmflabs.org/v3/scores/enwiki/123456

The Celery worker has been dying. Diagnose and correct for long-term stability.


According to logstash-beta, there are several exceptions being thrown on ORES.

[Wk@HawpEFhUAAEprrgMAAAAH] /rpc/RunSingleJob.php   RuntimeException from line 96 of /srv/mediawiki/php-master/extensions/ORES/includes/Api.php: Failed to make ORES request to [https://ores-beta.wmflabs.org/v3/scores/enwiki/?models=damaging%7Cgoodfaith%7Cdraftquality&revids=375138&precache=1&format=json], There was a problem during the HTTP request: 503 SERVICE UNAVAILABLE

Event Timeline

Restricted Application added a subscriber: Aklapper. · View Herald Transcript

@MarcoAurelio Thanks for the report!

Our celery worker died three days ago, probably due to out-of-memory. It's not the first time we've seen this with beta ORES. I'll repurpose this task to find a long-term fix.

awight renamed this task from Flood of ORES errors at Beta Cluster to Beta Cluster ORES celery worker dies.Jan 5 2018, 2:26 PM
awight updated the task description. (Show Details)

Dear @awight; thanks for your quick response. Yesterday @Krenair was discussing at -releng that there were a number of Puppet errors on all Beta Cluster machines such as full disks, missing or wrong hieradata, etc. Maybe worth having a look at those as well? Nota bene: I hardly know puppet stuff so apologies if I'm wrong. Regards.

It looks like we might need more memory on sca03 (or whatever beta cluster node we're deploying to). Maybe it's time to make our beta node look like ores-staging (which has 16 GB of memory)

Alternatively, we could also reduce the # of workers from 8 to 4. I think we could still handle beta-capacity with that.

Looking at /srv/log/ores/app.log, we've been down for at least 2 weeks. Any useful evidence has been rotated out of logs at this point. Let's let it crash again and try to check the diagnostics sooner?

Mentioned in SAL (#wikimedia-releng) [2018-01-09T08:51:51Z] <Amir1> ladsgroup@deployment-tin:~$ mwscript extensions/ORES/maintenance/PopulateDatabase.php --wiki=enwiki (T184276)

mmodell changed the subtype of this task from "Task" to "Production Error".Aug 28 2019, 11:09 PM