SGE master down
Closed, ResolvedPublic
Actions

Assigned To

None

Authored By

	valhallasw
	Jan 7 2016, 10:06 AM

Description

valhallasw@tools-bastion-01:~/ldap$ qstat
error: commlib error: got select error (Connection refused)
error: unable to send message to qmaster using port 6444 on host "tools-grid-master.tools.eqiad.wmflabs": got send error

valhallasw@tools-bastion-01:~/ldap$ ping tools-grid-master
PING tools-grid-master.tools.eqiad.wmflabs (10.68.20.158) 56(84) bytes of data.
64 bytes from tools-grid-master.tools.eqiad.wmflabs (10.68.20.158): icmp_seq=1 ttl=64 time=0.381 ms

Related Objects
Search...

Status	Assigned	Task
Resolved	None	T123034 SGE master down
Resolved	None	T122638 GridEngine down due to bdb issues
Resolved	valhallasw	T124586 Grid Engine down

Event Timeline

valhallasw created this task.Jan 7 2016, 10:06 AM

valhallasw raised the priority of this task from to Unbreak Now!.

valhallasw updated the task description. (Show Details)

valhallasw added a project: Toolforge.

valhallasw subscribed.

Restricted Application added a project: Cloud-Services. · View Herald TranscriptJan 7 2016, 10:06 AM

Restricted Application added a subscriber: Aklapper. · View Herald Transcript

BDB issues again. From /data/project/.system/gridengine/spool/qmaster/messages:

01/07/2016 09:48:45|worker|tools-grid-master|E|The job -j of user(s) tools.toolschecker does not exist
01/07/2016 09:48:45|worker|tools-grid-master|W|unable to find job 2050995 from the scheduler order package
01/07/2016 09:48:45|worker|tools-grid-master|W|Skipping remaining 0 orders
01/07/2016 09:48:45|schedu|tools-grid-master|E|unable to find job 2050995 from the scheduler order package
01/07/2016 09:48:53|worker|tools-grid-master|W|job 2005690.1 failed on host tools-exec-1407.eqiad.wmflabs rescheduling because: 01/07/2016 09:48:53 [600:18080]: exit_status of job start = 99
01/07/2016 09:48:53|worker|tools-grid-master|W|rescheduling job 2005690.1
01/07/2016 09:49:50|worker|tools-grid-master|W|unable to find job 2051014 from the scheduler order package
01/07/2016 09:49:50|worker|tools-grid-master|W|Skipping remaining 0 orders
01/07/2016 09:49:50|schedu|tools-grid-master|E|unable to find job 2051014 from the scheduler order package
01/07/2016 09:51:05|worker|tools-grid-master|E|error writing object with key "USER:tools.zkbot" into berkeley database: (22) Invalid argument
01/07/2016 09:51:05|worker|tools-grid-master|W|aborting transaction (rollback)
01/07/2016 09:51:34|worker|tools-grid-master|E|Corrupted database detected. Freeing all resources to prepare for a reconnect with recovery.
01/07/2016 09:51:34| timer|tools-grid-master|E|Corrupted database detected. Freeing all resources to prepare for a reconnect with recovery.
01/07/2016 09:51:34|worker|tools-grid-master|E|error starting a transaction: (-30973) BDB0087 DB_RUNRECOVERY: Fatal error, run database recovery
01/07/2016 09:51:34| timer|tools-grid-master|E|error checkpointing berkeley db: (-30973) BDB0087 DB_RUNRECOVERY: Fatal error, run database recovery
01/07/2016 09:51:34|worker|tools-grid-master|W|rule "default rule" in spooling context "berkeleydb spooling" failed writing an object
01/07/2016 09:51:34| timer|tools-grid-master|E|trigger function of rule "default rule" in context "berkeleydb spooling" failed
01/07/2016 09:51:34|worker|tools-grid-master|E|error writing object "2051074.1" to spooling database

Issues started early this morning:

01/07/2016 04:05:50|worker|tools-grid-master|W|rescheduling job 2005690.1
01/07/2016 04:06:11|worker|tools-grid-master|E|error writing object with key "USER:tools.zkbot" into berkeley database: (22) Invalid argument
01/07/2016 04:06:11|worker|tools-grid-master|W|aborting transaction (rollback)
01/07/2016 04:06:34| timer|tools-grid-master|E|Corrupted database detected. Freeing all resources to prepare for a reconnect with recovery.
01/07/2016 04:06:34| timer|tools-grid-master|E|error checkpointing berkeley db: (-30973) BDB0087 DB_RUNRECOVERY: Fatal error, run database recovery
01/07/2016 04:06:34| timer|tools-grid-master|E|trigger function of rule "default rule" in context "berkeleydb spooling" failed

but I'm unsure what the recovery steps would be at the moment.

zhuyifei1999 subscribed.Jan 7 2016, 10:19 AM

I tried the following:

make a backup of the database,
run db_recover on that,
move the recovered database in place

but this did not solve the issue. However, moving the original data files back and restarting gridengine-master somehow did bring it back online, although not in a completely sane state (e.g. '01/07/2016 10:20:17| timer|tools-grid-master|E|error checkpointing berkeley db: (-30973) BDB0087 DB_RUNRECOVERY: Fatal error, run database recovery'). Nonetheless, it's working again for now. Hopefully @yuvipanda can take a look once the sun rises in California.

Augh. And <3 to @valhallasw for dealing with it!

I did a bunch of tests and everything seems ok. I'll check back on it tomorrow to see if... anything seems amiss.

Not all things are ok,

qstat -->

1958791 0.30662 lighttpd-p tools.phetoo dr 01/04/2016 23:48:09 webgrid-lighttpd@tools-webgrid 1

it runs on tools-webgrid-lighttpd-1205 but that task no longer exists on this server, webserver restart doesn't work so I've no way to get a working webservice, qdel -f 1958791 does nothing too.

task 2052506 on the same server (dplbot) seems to have the same trouble

Force-deleted job 1958791.

Dplbots webservice is running:

valhallasw@tools-bastion-01:~$ ssh tools-webgrid-lighttpd-1205.eqiad.wmflabs ps aux | grep dpl
51290    13527  0.0  0.0  52160  2844 ?        Ss   11:14   0:01 /usr/sbin/lighttpd -f /var/run/lighttpd/dplbot.conf -D

valhallasw mentioned this in T123072: tools-proxy lost dplbot.Jan 7 2016, 4:22 PM

but marked as Deleting on toollabs status page since some hours:

2052506 lighttpd-dplbot dplbot Webgrid-lighttpd / Deleting 2016-01-07 11:14:53 28s 1/3

I am still missing the execution of some diverse cronjobs last days, and I am not very lucky about it, because fellow Wikipedians rely on me and the working of my bots to set time as in the crontab stated.

(No error outputs in .out files at all.)

@doctaxon: You have been a Wikipedian now for more than ten years (congratulations and thanks!), and I have seen you around the Toolserver and Toolforge for probably half that time at the very least. With that experience, do you think that "I am still missing the execution of some diverse cronjobs last days" is in any way, shape or form a helpful comment to identify an issue? Do you believe that anyone believes that in all that time you never came across https://www.mediawiki.org/wiki/How_to_report_a_bug or any other similar essay (I personally find http://www.chiark.greenend.org.uk/~sgtatham/bugs.html more appealing)?

But there is nothing more I can say about it.

/data/project/taxonbot

The cronjob beta-newday.tcl should start 0:00 UTC today, but it did not without any error report in beta-newday.out
The cronjob lkday.tcl should start 23:00 UTC January 6th and 7th, but it did not without any error report in lkday.out
- but same time, 23:00 UTC, two other cronjobs did start: autoarchiv0.tcl and autoarchiv-voy.tcl
Running jobs didn't give back an error report this time in their .out log.

I hope, that is better, but I have very less informations to help you. I really know, that is very difficult, sorry for that!
Maybe it's not SGE but instable crontab?

doctaxon set Security to None.Jan 10 2016, 10:42 AM

doctaxon added a subscriber: Giftpflanze.

Not sure if it's tracked or not, but there was a similar outage yesterday (Jan 9), between 17:25:05 UTC and 17:50:07 UTC:

tools-exec-1206: Sat Jan 9 17:25:05 UTC 2016
error: commlib error: got select error (Connection refused)
error: unable to send message to qmaster using port 6444 on host "tools-grid-master.tools.eqiad.wmflabs": got send error
error: commlib error: got select error (Connection refused)
Unable to run job: unable to send message to qmaster using port 6444 on host "tools-grid-master.tools.eqiad.wmflabs": got send error.
Exiting.
error: commlib error: got select error (Connection refused)
error: unable to send message to qmaster using port 6444 on host "tools-grid-master.tools.eqiad.wmflabs": got send error
error: commlib error: got select error (Connection refused)
Unable to run job: unable to send message to qmaster using port 6444 on host "tools-grid-master.tools.eqiad.wmflabs": got send error.
Exiting.
error: commlib error: got select error (Connection refused)
error: unable to send message to qmaster using port 6444 on host "tools-grid-master.tools.eqiad.wmflabs": got send error
error: commlib error: got select error (Connection refused)
Unable to run job: unable to send message to qmaster using port 6444 on host "tools-grid-master.tools.eqiad.wmflabs": got send error.
Exiting.
error: commlib error: got select error (Connection refused)
error: unable to send message to qmaster using port 6444 on host "tools-grid-master.tools.eqiad.wmflabs": got send error
error: commlib error: got select error (Connection refused)
Unable to run job: unable to send message to qmaster using port 6444 on host "tools-grid-master.tools.eqiad.wmflabs": got send error.
Exiting.
tools-exec-1217: Sat Jan 9 17:50:07 UTC 2016

The previous one is on Jan 7 when this bug was reported:

tools-exec-1210: Thu Jan 7 09:40:08 UTC 2016
error: commlib error: got select error (Connection refused)
error: unable to send message to qmaster using port 6444 on host "tools-grid-master.tools.eqiad.wmflabs": got send error
error: commlib error: got select error (Connection refused)
Unable to run job: unable to send message to qmaster using port 6444 on host "tools-grid-master.tools.eqiad.wmflabs": got send error.
Exiting.
error: commlib error: got select error (Connection refused)
error: unable to send message to qmaster using port 6444 on host "tools-grid-master.tools.eqiad.wmflabs": got send error
error: commlib error: got select error (Connection refused)
Unable to run job: unable to send message to qmaster using port 6444 on host "tools-grid-master.tools.eqiad.wmflabs": got send error.
Exiting.
error: commlib error: got select error (Connection refused)
error: unable to send message to qmaster using port 6444 on host "tools-grid-master.tools.eqiad.wmflabs": got send error
error: commlib error: got select error (Connection refused)
Unable to run job: unable to send message to qmaster using port 6444 on host "tools-grid-master.tools.eqiad.wmflabs": got send error.
Exiting.
error: commlib error: got select error (Connection refused)
error: unable to send message to qmaster using port 6444 on host "tools-grid-master.tools.eqiad.wmflabs": got send error
error: commlib error: got select error (Connection refused)
Unable to run job: unable to send message to qmaster using port 6444 on host "tools-grid-master.tools.eqiad.wmflabs": got send error.
Exiting.
error: commlib error: got select error (Connection refused)
error: unable to send message to qmaster using port 6444 on host "tools-grid-master.tools.eqiad.wmflabs": got send error
error: commlib error: got select error (Connection refused)
Unable to run job: unable to send message to qmaster using port 6444 on host "tools-grid-master.tools.eqiad.wmflabs": got send error.
Exiting.
tools-exec-1215: Thu Jan 7 10:20:09 UTC 2016

In T123034#1923240, @doctaxon wrote:

But there is nothing more I can say about it.

/data/project/taxonbot

The cronjob beta-newday.tcl should start 0:00 UTC today, but it did not without any error report in beta-newday.out

The cronjob lkday.tcl should start 23:00 UTC January 6th and 7th, but it did not without any error report in lkday.out

but same time, 23:00 UTC, two other cronjobs did start: autoarchiv0.tcl and autoarchiv-voy.tcl

Running jobs didn't give back an error report this time in their .out log.

I hope, that is better, but I have very less informations to help you.

Yes, this is already much more useful: it tells us which project we're talking about, which cronjob (although it would have been even better if you had added the actual command from the cronjob), and which files should contain information about these jobs. Finally, it would be useful to know whether you can submit them to the queue outside of cron successfully, and whether these commands can be run safely at other times of the day.

For what it's worth: I tried running beta-newday with the same command as in the crontab, which started without any issues. That suggests either something transient with the grid (but then you should have received an e-mail with an error, I think), or something with cron. From the cron logs, it indeed seems the job has not fired:

Jan  9 23:58:01 tools-submit CRON[23166]: (tools.taxonbot) CMD (jsub -once -j y -quiet -v LC_ALL=en_US.UTF-8 -mem 1g -N aan ^I^I^I^Iautoarchiv0.tcl)
Jan 10 00:00:31 tools-submit CRON[25797]: (tools.taxonbot) CMD (jsub -once -j y -quiet -v LC_ALL=en_US.UTF-8 -mem 1g^I^I^I^I^I^I^Ilog-log.tcl)
Jan 10 00:00:32 tools-submit CRON[25846]: (tools.taxonbot) CMD (jsub -once -j y -quiet -v LC_ALL=en_US.UTF-8 -mem 1g ^I^I^I^I^I^I^Iaa-vm.tcl)
Jan 10 00:00:32 tools-submit CRON[25856]: (tools.taxonbot) CMD (jsub -once -j y -quiet -v LC_ALL=en_US.UTF-8 -mem 1g -N aan ^I^I^I^Iautoarchiv0.tcl)
Jan 10 00:00:32 tools-submit CRON[25860]: (tools.taxonbot) CMD (jsub -once -j y -quiet -v LC_ALL=en_US.UTF-8 -mem 1g ^I^I^I^I^I^I^Ic-uncat.tcl)
Jan 10 00:02:01 tools-submit CRON[26393]: (tools.taxonbot) CMD (jsub -once -j y -quiet -v LC_ALL=en_US.UTF-8 -mem 1g^I^I^I^I^I^I^Ilog-log.tcl)

So: not an SGE issue, but another issue altogether.

Maybe crontab is instable?

running the cronjob at 10.01.2016 12:39:35 UTC
but the cronjob should run 10.01.2016 00:00:00 UTC

crontab:
JSUB_OPTIONS=-once -j y -quiet -v LC_ALL=en_US.UTF-8 -mem 1g
0 0 * * * jsub -once -j y -quiet -v LC_ALL=en_US.UTF-8 -mem 1g beta-newday.tcl

beta-newday.out:
10.01.2016 12:39:35 +0000
edit {result Success pageid 63277 title {User talk:Doc Taxon} contentmodel wikitext oldrevid 164971 newrevid 164972 newtimestamp 2016-01-1
0T12:39:45Z}

(all timestamps are UTC)

No, that was me running it manually.

In T123034#1923254, @zhuyifei1999 wrote:

Not sure if it's tracked or not, but there was a similar outage yesterday (Jan 9), between 17:25:05 UTC and 17:50:07 UTC:

Thanks, that's super useful information. From looking at the logs:

The entire day, the scheduler had been confused about 'changed order/project':

01/09/2016 17:29:59|worker|tools-grid-master|W|scheduler sent an order for a changed user/project "tools.ws-cat-browser" (version: old 118957) new 118958
01/09/2016 17:29:59|worker|tools-grid-master|W|scheduler sent an order for a changed user/project "tools.oojs-ui" (version: old 119479) new 119480
01/09/2016 17:29:59|worker|tools-grid-master|W|scheduler sent an order for a changed user/project "tools.ib2test" (version: old 119916) new 119917
01/09/2016 17:30:04|worker|tools-grid-master|W|scheduler sent an order for a changed user/project "tools.ws-cat-browser" (version: old 118957) new 118958
01/09/2016 17:30:04|worker|tools-grid-master|W|scheduler sent an order for a changed user/project "tools.oojs-ui" (version: old 119479) new 119480
01/09/2016 17:30:04|worker|tools-grid-master|W|scheduler sent an order for a changed user/project "tools.ib2test" (version: old 119916) new 119917

Then at 17:30:

01/09/2016 17:30:04|worker|tools-grid-master|E|error writing object with key "USER:tools.huji" into berkeley database: (22) Invalid argument
01/09/2016 17:30:04|worker|tools-grid-master|W|aborting transaction (rollback)
01/09/2016 17:30:05| timer|tools-grid-master|E|Corrupted database detected. Freeing all resources to prepare for a reconnect with recovery.
01/09/2016 17:30:05| timer|tools-grid-master|E|error checkpointing berkeley db: (-30973) BDB0087 DB_RUNRECOVERY: Fatal error, run database recovery
01/09/2016 17:30:05| timer|tools-grid-master|E|trigger function of rule "default rule" in context "berkeleydb spooling" failed
[... nothing for 15 minutes. Does it get restarted here? ...]
01/09/2016 17:46:22|  main|tools-grid-master|W|local configuration tools-grid-master.tools.eqiad.wmflabs not defined - using global configuration
01/09/2016 17:46:22|  main|tools-grid-master|W|can't resolve host name "tools-master.tools.eqiad.wmflabs": undefined commlib error code
01/09/2016 17:46:22|  main|tools-grid-master|W|can't resolve host name "tools-shadow.tools.eqiad.wmflabs": undefined commlib error code
01/09/2016 17:46:23|  main|tools-grid-master|I|read job database with 964 entries in 1 seconds
01/09/2016 17:46:23|  main|tools-grid-master|W|removing reference to no longer existing job 2129060 of user "aka"
[... more references being removed ...]
01/09/2016 17:46:23|  main|tools-grid-master|I|qmaster hard descriptor limit is set to 8192
01/09/2016 17:46:23|  main|tools-grid-master|I|qmaster soft descriptor limit is set to 8192
01/09/2016 17:46:23|  main|tools-grid-master|I|qmaster will use max. 8172 file descriptors for communication
01/09/2016 17:46:23|  main|tools-grid-master|I|qmaster will accept max. 99 dynamic event clients
01/09/2016 17:46:23|  main|tools-grid-master|I|starting up GE 6.2u5 (lx26-amd64)
01/09/2016 17:46:24|worker|tools-grid-master|E|error writing object with key "USER:tools.anomiebot" into berkeley database: (22) Invalid argument
01/09/2016 17:46:24|worker|tools-grid-master|W|aborting transaction (rollback)
01/09/2016 17:46:37|worker|tools-grid-master|W|job 2005690.1 failed on host tools-exec-1406.eqiad.wmflabs rescheduling because: 01/09/2016 17:33:35 [600:16461]: exit_status of job start = 99
01/09/2016 17:46:37|worker|tools-grid-master|W|rescheduling job 2005690.1
01/09/2016 17:46:37|worker|tools-grid-master|W|scheduler tries to change tickets of a non running job 2005690 task 1(state 0)
01/09/2016 17:46:37|schedu|tools-grid-master|W|scheduler tries to change tickets of a non running job 2005690 task 1(state 0)
01/09/2016 17:47:23| timer|tools-grid-master|E|Corrupted database detected. Freeing all resources to prepare for a reconnect with recovery.
01/09/2016 17:47:23| timer|tools-grid-master|E|error checkpointing berkeley db: (-30973) BDB0087 DB_RUNRECOVERY: Fatal error, run database recovery
01/09/2016 17:47:23| timer|tools-grid-master|E|trigger function of rule "default rule" in context "berkeleydb spooling" failed

and then everything seems OK again...

valhallasw mentioned this in T123186: tools.taxonbot and tools.giftbot cronjobs not firing.Jan 10 2016, 12:58 PM

Luke081515 subscribed.Jan 12 2016, 5:54 PM

valhallasw mentioned this in T122638: GridEngine down due to bdb issues.Jan 12 2016, 8:03 PM

Any news? Still an issue?

valhallasw closed this task as Resolved.Feb 28 2016, 5:25 PM

valhallasw added a subtask: T122638: GridEngine down due to bdb issues.

• Phabricator_maintenance removed a subscriber: yuvipanda.Jun 7 2017, 6:45 PM