Page MenuHomePhabricator

Beta Cluster ORES celery worker dies
Closed, ResolvedPublic


Beta ORES reports an overload condition:

The Celery worker has been dying. Diagnose and correct for long-term stability.

According to logstash-beta, there are several exceptions being thrown on ORES.

[Wk@HawpEFhUAAEprrgMAAAAH] /rpc/RunSingleJob.php   RuntimeException from line 96 of /srv/mediawiki/php-master/extensions/ORES/includes/Api.php: Failed to make ORES request to [], There was a problem during the HTTP request: 503 SERVICE UNAVAILABLE

Event Timeline

Restricted Application added a project: Scoring-platform-team. · View Herald TranscriptJan 5 2018, 2:19 PM
Restricted Application added a subscriber: Aklapper. · View Herald Transcript is one of them, but you should check the fatalmonitor board for more.

awight added a subscriber: awight.Jan 5 2018, 2:25 PM

@MarcoAurelio Thanks for the report!

Our celery worker died three days ago, probably due to out-of-memory. It's not the first time we've seen this with beta ORES. I'll repurpose this task to find a long-term fix.

awight renamed this task from Flood of ORES errors at Beta Cluster to Beta Cluster ORES celery worker dies.Jan 5 2018, 2:26 PM
awight updated the task description. (Show Details)

Dear @awight; thanks for your quick response. Yesterday @Krenair was discussing at -releng that there were a number of Puppet errors on all Beta Cluster machines such as full disks, missing or wrong hieradata, etc. Maybe worth having a look at those as well? Nota bene: I hardly know puppet stuff so apologies if I'm wrong. Regards.

Halfak added a subscriber: Halfak.Jan 5 2018, 2:31 PM

It looks like we might need more memory on sca03 (or whatever beta cluster node we're deploying to). Maybe it's time to make our beta node look like ores-staging (which has 16 GB of memory)

Halfak added a comment.Jan 5 2018, 2:37 PM

Alternatively, we could also reduce the # of workers from 8 to 4. I think we could still handle beta-capacity with that.

awight added a comment.Jan 5 2018, 2:37 PM

Looking at /srv/log/ores/app.log, we've been down for at least 2 weeks. Any useful evidence has been rotated out of logs at this point. Let's let it crash again and try to check the diagnostics sooner?

Mentioned in SAL (#wikimedia-releng) [2018-01-09T08:51:51Z] <Amir1> ladsgroup@deployment-tin:~$ mwscript extensions/ORES/maintenance/PopulateDatabase.php --wiki=enwiki (T184276)

Restricted Application added a project: User-Ladsgroup. · View Herald TranscriptJan 9 2018, 9:37 AM
Ladsgroup moved this task from Incoming to Done on the User-Ladsgroup board.Jan 12 2018, 12:15 PM
Halfak closed this task as Resolved.