Page MenuHomePhabricator

Grid Engine down
Closed, ResolvedPublic

Description

I get errors like:

tools.giftbot@tools-bastion-01:~$ qstat 
error: commlib error: got select error (Connection refused)
error: unable to send message to qmaster using port 6444 on host "tools-grid-master.tools.eqiad.wmflabs": got send error

Related Objects

StatusSubtypeAssignedTask
ResolvedNone
ResolvedNone
Resolvedvalhallasw

Event Timeline

Giftpflanze raised the priority of this task from to Unbreak Now!.
Giftpflanze updated the task description. (Show Details)
Giftpflanze added a project: Toolforge.
Giftpflanze added a subscriber: Giftpflanze.
Restricted Application added a project: Cloud-Services. · View Herald TranscriptJan 24 2016, 8:18 AM
Restricted Application added a subscriber: Aklapper. · View Herald Transcript
valhallasw closed this task as Resolved.Jan 24 2016, 9:59 AM
valhallasw claimed this task.
valhallasw added a subscriber: valhallasw.

Last information in messages:

01/24/2016 04:00:36| timer|tools-grid-master|E|Corrupted database detected. Freeing all resources to prepare for a reconnect with recovery.
01/24/2016 04:00:36| timer|tools-grid-master|E|error checkpointing berkeley db: (-30973) BDB0087 DB_RUNRECOVERY: Fatal error, run database recovery
01/24/2016 04:00:36| timer|tools-grid-master|E|trigger function of rule "default rule" in context "berkeleydb spooling" failed
01/24/2016 04:00:39|worker|tools-grid-master|E|Corrupted database detected. Freeing all resources to prepare for a reconnect with recovery.
01/24/2016 04:00:39|worker|tools-grid-master|E|error starting a transaction: (-30973) BDB0087 DB_RUNRECOVERY: Fatal error, run database recovery
01/24/2016 04:00:39|worker|tools-grid-master|E|transaction function of rule "default rule" in context "berkeleydb spooling" failed
01/24/2016 04:00:39|worker|tools-grid-master|W|scheduler sent an order for a changed user/project "tools.hashtags" (version: old 38292) new 38293

but nothing to indicate a shutdown. The master is not running:

valhallasw@tools-grid-master:~$ ps aux | grep sge
valhall+ 22410  0.0  0.0  10432   672 pts/0    S+   09:57   0:00 grep --color=auto sge

So: the restarts we disabled in T122638: GridEngine down due to bdb issues might actually not be the cause of the issue, but rather an effect -- where SGE is restarted after it dies from database corruption.

In any case, a service gridengine-master start seems to have solved the issue for now.