Page MenuHomePhabricator

GridEngine down due to bdb issues
Closed, ResolvedPublic

Description

Investigation ongoing, but an explosion of jobs submitted to gridengine caused it to crash.

Event Timeline

yuvipanda raised the priority of this task from to Unbreak Now!.
yuvipanda updated the task description. (Show Details)
yuvipanda added a subscriber: yuvipanda.

So it looks like the tool hcclab accidentally submitted about... 12,000,000 jobs (12 million jobs). This kills the gridengine. We are probably going to lose all job state info and start with a clean slate.

FWIW, from the DB transaction logs, the master seems to have valiantly struggled for about 90 minutes before it finally went down (and I expect the survival was made harder by the simultaneous LDAP and NFS issues). It might be worthwhile to add something to monitor really unusual growth of the BDB and alert.

I'm also wondering if more up-to-date backups of the job db would help.

I'm also wondering if more up-to-date backups of the job db would help.

For backups to be useful for recovery of actually running jobs, it'd have to be very frequent indeed - given the very dynamic nature of the job DB (it's mostly a running scoreboard) I expect anything over 5-6 minutes would be of limited value.

Apparently hcclab has caused issues before: T92614: Network port saturated for labstore1001, but it seems Ebekebe and Lbenedix were not contacted about this at that time.

http://wm-bot.wmflabs.org/logs/%23wikimedia-labs/20150313.txt

[11:28:34] <wikibugs>	 6Labs, 6operations: Network port saturated for labstore1001 - https://phabricator.wikimedia.org/T92614#1116297 (10yuvipanda) 3NEW
[11:57:51] <paravoid>	 YuviPanda: biggest user is 10.68.17.144, which has no reverse DNS
[11:58:28] <YuviPanda>	 paravoid: so I’ve been poking around nethogs for a while. looks like it’s coming from the tools exec nodes but not all of them
[11:58:47] <paravoid>	 yes, 11 & 14
[11:59:24] <YuviPanda>	 paravoid: it flips if you look long enough (used to be 07 and 08 when I started looking)
[11:59:29] <YuviPanda>	 I’m trying to find the offending script
[12:00:10] <paravoid>	 hcclab?
[12:01:01] <YuviPanda>	 paravoid: on which host?
[12:01:05] <paravoid>	 tools.hcclab
[12:01:18] <paravoid>	 11 & 14
[12:01:22] <YuviPanda>	 holy shit
[12:01:22] <paravoid>	 tons of
[12:01:23] <paravoid>	 52455    29445  0.5  0.0  32164  1400 ?        D    12:01   0:00 cat /data/project/hcclab/data/chunks/chunk-389.xml
[12:01:25] <YuviPanda>	 yeah
[12:01:30] <YuviPanda>	 let me kill them all
[12:01:36] <paravoid>	 that being 500M
[12:01:57] <paravoid>	 fwiw, iftop on labstore to find the network offender
[12:02:02] <paravoid>	 then iotop on the VMs themselves
[12:02:04] <paravoid>	 then ps
[12:02:09] <paravoid>	 that's all I used, it took me ~5mins
[12:03:03] <YuviPanda>	 paravoid: hmm, so I used nethogs, and then I looked at -07 and -08, but then there was nothing, and I went back and it had moved.
[12:03:15] <YuviPanda>	 paravoid: so that person had submitted a *lot* of jobs, and they were being scheduled everywhere...
[12:03:16] <shinken-wm>	 PROBLEM - Puppet failure on tools-exec-15 is CRITICAL: CRITICAL: 25.00% of data above the critical threshold [0.0]  
[12:03:31] <YuviPanda>	 like a thousand or more jobs...
[12:03:39] <paravoid>	 ouch
[12:04:42] <YuviPanda>	 paravoid: I’ve killed them all
[12:06:53] <YuviPanda>	 paravoid: I didn’t know about iftop, thanks :) I was installing and using nethogs...
[12:07:14] <YuviPanda>	 paravoid: and network traffic back to normal

What is unclear to me at this point is what commands were issued when and by whom. Was this a cronjob gone wrong? The tool owners haven't logged in to any tools-bastion recently.

What is unclear to me at this point is what commands were issued when and by whom. Was this a cronjob gone wrong? The tool owners haven't logged in to any tools-bastion recently.

Sadly, that's information that I was not able to parse from the transaction logs; the records in the BDB are in an undocumented binary format and while I was able to note that over 99.9% of the more than 12 million operations over 90 minutes were attached to the tool.hcclab user, I could not see what the operation was.

The issue seems to have coincided with a weird jump in jobids. This has happened before (at the beginning of december), and it looks roughly like this. (I'm using lighttpd-wikiviewstats because it's restarted every 20 seconds and failing -- which is a bug in webservicemonitor, but quite useful for this):

0 2015-12-29 20:16:02 tools-webgrid-lighttpd-1414.tools.eqiad.wmflabs tools.wikiviewstats lighttpd-wikiviewstats 547728
0 2015-12-29 20:16:16 tools-webgrid-lighttpd-1414.tools.eqiad.wmflabs tools.wikiviewstats lighttpd-wikiviewstats 547746
0 2015-12-29 20:16:31 tools-webgrid-lighttpd-1414.tools.eqiad.wmflabs tools.wikiviewstats lighttpd-wikiviewstats 547748
0 2015-12-29 20:16:48 tools-webgrid-lighttpd-1414.tools.eqiad.wmflabs tools.wikiviewstats lighttpd-wikiviewstats 547749
0 2015-12-29 20:18:45 tools-webgrid-lighttpd-1403.eqiad.wmflabs tools.wikiviewstats lighttpd-wikiviewstats 9999897
0 2015-12-29 20:18:59 tools-webgrid-lighttpd-1403.eqiad.wmflabs tools.wikiviewstats lighttpd-wikiviewstats 9999899
0 2015-12-29 20:19:13 tools-webgrid-lighttpd-1403.eqiad.wmflabs tools.wikiviewstats lighttpd-wikiviewstats 9999901
(...)
0 2015-12-29 20:42:53 tools-webgrid-lighttpd-1403.eqiad.wmflabs tools.wikiviewstats lighttpd-wikiviewstats 9999999
0 2015-12-29 20:43:13 tools-webgrid-lighttpd-1403.eqiad.wmflabs tools.wikiviewstats lighttpd-wikiviewstats 4

which actually shows another interesting tidbit: thre's 20 seconds between restarts, but between 547749 and 9999897 there's a two minute gap.

which actually shows another interesting tidbit: thre's 20 seconds between restarts, but between 547749 and 9999897 there's a two minute gap.

That is consistent with something eating up job ids at a very high pace; anything putting that high a load on the master would delay every other attempt to start jobs.

So.. good news and bad news. The good news: it seems we have a job explosion more often, and it doesn't typically bring down SGE. The bad news: it seems we have a job explosion more often.

This is a graph of the job id versus time of the last three months:

pasted_file (277×387 px, 10 KB)

The normal job id increase rate is fairly constant, but we have very short bursts where the job id jumps to ~9M, then resets to 0.

I'm not 100% certain the job id increase also means there are actual submissions, but it's definitely odd.

I'm still finding pieces of information that don't make a lot of sense on their own, but maybe they'll start making sense together. This is the network tx/rx for the master and shadow host for the last month:

pasted_file (308×586 px, 37 KB)

which suggests the shadow host (brown line) was actually handling all the traffic, rather than the master (purple line).

One thing is odd, however: there does not seem to be any odd traffic related to the earlier job-id spikes:

pasted_file (308×586 px, 41 KB)

but this could also be an effect of averaging the graphs. I'm not completely sure with which precision this data is available;

Inbetween note: the first occurrence of 'Corrupted' in the messages file is

12/30/2015 02:47:55| timer|tools-grid-master|E|Corrupted database detected. Freeing all resources to prepare for a reconnect with recovery.

and, even though we've started with a clean slate, we've had issues ever since (T123034 and https://wikitech.wikimedia.org/wiki/Incident_documentation/20160112-20160111-toollabs-SGE), and the log file is still full of corruption warnings.

Is there anything that changed in terms of NFS/storage/...? around 30 dec?

Actually, there was something happening before that. Initially, all messages are from tools-grid-shadow, but suddenly tools-grid-master shows up:

12/29/2015 20:17:54|worker|tools-grid-shadow|W|scheduler sent an order for a changed user/project "tools.ftl" (version: old 1007392) new 1007393
12/29/2015 20:17:54|  main|tools-grid-master|W|can't resolve host name "tools-exec-1201.tools.eqiad.wmflabs": undefined commlib error code 1007391

[...a bunch of mixed messages like these...]

12/29/2015 20:17:54|worker|tools-grid-shadow|W|scheduler sent an order for a changed user/project "tools.refswikipedia" (version: old 105422) new 105423
12/29/2015 20:17:56|  main|tools-grid-master|I|read job database with 887 entries in 2 seconds
12/29/2015 20:17:56|  main|tools-grid-master|W|removing reference to no longer existing job 512548 of user "tools.alessiobot"

[... master goes ahead and removes 72 references in total ...]

12/29/2015 20:17:56|  main|tools-grid-master|I|qmaster hard descriptor limit is set to 8192
12/29/2015 20:17:56|  main|tools-grid-master|I|qmaster soft descriptor limit is set to 8192
12/29/2015 20:17:56|  main|tools-grid-master|I|qmaster will use max. 8172 file descriptors for communication
12/29/2015 20:17:56|  main|tools-grid-master|I|qmaster will accept max. 99 dynamic event clients
12/29/2015 20:17:56|  main|tools-grid-master|I|starting up GE 6.2u5 (lx26-amd64)
12/29/2015 20:17:57|listen|tools-grid-master|E|commlib error: can't resolve host name (can't resolve rdata hostname "tools-exec-1210.eqiad.wmflabs")
12/29/2015 20:17:57|listen|tools-grid-master|E|commlib error: local host name error (remote rdata host name "tools-exec-1210.eqiad.wmflabs" is not equal to local resolved host name "(HOST_NOT_RESOLVABLE)")
12/29/2015 20:17:58| timer|tools-grid-shadow|W|failed to deliver job 166557.1 to queue "task@tools-exec-1216.eqiad.wmflabs"

[... some more messages like that ...]

12/29/2015 20:17:58| timer|tools-grid-shadow|E|got max. unheard timeout for target "execd" on host "tools-exec-1213.eqiad.wmflabs", can't deliver job "166175"
12/29/2015 20:18:03|listen|tools-grid-master|E|commlib error: can't resolve host name (can't resolve rdata hostname "tools-exec-1207.eqiad.wmflabs")
12/29/2015 20:18:03|listen|tools-grid-master|E|commlib error: local host name error (remote rdata host name "tools-exec-1207.eqiad.wmflabs" is not equal to local resolved host name "(HOST_NOT_RESOLVABLE)")

[...]

12/29/2015 20:21:51| timer|tools-grid-shadow|W|act_qmaster file has been changed by another qmaster
12/29/2015 20:21:51|  main|tools-grid-shadow|E|jvm thread is not running
12/29/2015 20:21:53|  main|tools-grid-shadow|E|error writing object with key "USER:tools.danmicholobot" into berkeley database: (22) Invalid argument
12/29/2015 20:21:53|  main|tools-grid-shadow|E|error writing object with key "USER:tools.gerrit-reports" into berkeley database: (22) Invalid argument

[...]

12/29/2015 22:08:21|  main|tools-grid-master|E|jvm thread is not running

[... nothing after 22:08 until 00:30 ...]

NEWWed Dec 30 00:30:23 UTC 2015
12/30/2015 00:39:21|  main|tools-grid-shadow|E|couldn't open database environment for server "local spooling", directory "/var/spool/gridengine/spooldb": (-30973) BDB0087 DB_RUNRECOVERY: Fatal error, run database recovery
12/30/2015 00:39:21|  main|tools-grid-shadow|E|startup of rule "default rule" in context "berkeleydb spooling" failed
12/30/2015 00:39:53|  main|tools-grid-master|E|couldn't open berkeley database "sge": (2) No such file or directory
12/30/2015 00:39:53|  main|tools-grid-master|E|startup of rule "default rule" in context "berkeleydb spooling" failed
12/30/2015 00:42:45|  main|tools-grid-master|E|couldn't open berkeley database "sge": (2) No such file or directory
12/30/2015 00:42:45|  main|tools-grid-master|E|startup of rule "default rule" in context "berkeleydb spooling" failed
12/30/2015 00:46:01|  main|tools-grid-master|E|couldn't open berkeley database "sge": (13) Permission denied
12/30/2015 00:46:01|  main|tools-grid-master|E|startup of rule "default rule" in context "berkeleydb spooling" failed
12/30/2015 00:46:35|  main|tools-grid-master|E|couldn't open berkeley database "sge": (22) Invalid argument
12/30/2015 00:46:35|  main|tools-grid-master|E|startup of rule "default rule" in context "berkeleydb spooling" failed
12/30/2015 00:47:22|  main|tools-grid-master|E|couldn't open berkeley database "sge": (22) Invalid argument
12/30/2015 00:47:22|  main|tools-grid-master|E|startup of rule "default rule" in context "berkeleydb spooling" failed

So: for some reason the grid master started around 20:18 on 29-12, and started happily writing to files that the shadow master had in use. And that timing corresponds to the jump in jobids I noticed:

0 2015-12-29 20:16:48 tools-webgrid-lighttpd-1414.tools.eqiad.wmflabs tools.wikiviewstats lighttpd-wikiviewstats 547749
0 2015-12-29 20:18:45 tools-webgrid-lighttpd-1403.eqiad.wmflabs tools.wikiviewstats lighttpd-wikiviewstats 9999897

where maybe one is the job counter on the master and one on the shadow host? I'm not entirely sure.

So, to continue, the other jumps to high ids happened at:

Fri, 19 Jun 2015 05:20:50 GMT
Tue, 01 Sep 2015 06:18:02 GMT
[sge message log starts here]
Thu, 24 Sep 2015 13:56:02 GMT
Sat, 26 Sep 2015 03:04:02 GMT
Tue, 24 Nov 2015 18:28:35 GMT
Wed, 02 Dec 2015 18:33:45 GMT
Thu, 03 Dec 2015 19:06:36 GMT
Fri, 04 Dec 2015 16:04:43 GMT
Fri, 04 Dec 2015 19:33:57 GMT
Wed, 16 Dec 2015 08:55:37 GMT
Tue, 29 Dec 2015 20:18:28 GMT
Wed, 30 Dec 2015 03:22:03 GMT

What all of them seem to have in common is some sort of server crash/restart cycle:

$ grep /data/project/.system/gridengine/spool/qmaster/messsages -e 'starting up GE'
09/24/2015 13:55:20|  main|tools-master|I|starting up GE 6.2u5 (lx26-amd64)
09/24/2015 14:49:06|  main|tools-master|I|starting up GE 6.2u5 (lx26-amd64)
09/24/2015 14:50:45|  main|tools-master|I|starting up GE 6.2u5 (lx26-amd64)
09/24/2015 14:51:44|  main|tools-master|I|starting up GE 6.2u5 (lx26-amd64)
09/24/2015 14:52:17|  main|tools-master|I|starting up GE 6.2u5 (lx26-amd64)
09/24/2015 14:53:33|  main|tools-master|I|starting up GE 6.2u5 (lx26-amd64)
09/24/2015 14:54:23|  main|tools-master|I|starting up GE 6.2u5 (lx26-amd64)
09/24/2015 14:54:55|  main|tools-master|I|starting up GE 6.2u5 (lx26-amd64)
09/24/2015 14:55:59|  main|tools-master|I|starting up GE 6.2u5 (lx26-amd64)
09/24/2015 14:57:08|  main|tools-master|I|starting up GE 6.2u5 (lx26-amd64)
09/24/2015 14:57:25|  main|tools-master|I|starting up GE 6.2u5 (lx26-amd64)
09/26/2015 03:03:55|  main|tools-master|I|starting up GE 6.2u5 (lx26-amd64)
09/26/2015 03:05:05|  main|tools-master|I|starting up GE 6.2u5 (lx26-amd64)
09/26/2015 03:12:21|  main|tools-master|I|starting up GE 6.2u5 (lx26-amd64)
11/24/2015 18:28:11|  main|tools-master|I|starting up GE 6.2u5 (lx26-amd64)
11/24/2015 18:29:24|  main|tools-master|I|starting up GE 6.2u5 (lx26-amd64)
11/24/2015 18:29:53|  main|tools-master|I|starting up GE 6.2u5 (lx26-amd64)
12/02/2015 18:33:45|  main|tools-master|I|starting up GE 6.2u5 (lx26-amd64)
12/03/2015 19:06:35|  main|tools-master|I|starting up GE 6.2u5 (lx26-amd64)
12/03/2015 19:20:17|  main|tools-master|I|starting up GE 6.2u5 (lx26-amd64)
12/03/2015 19:37:27|  main|tools-master|I|starting up GE 6.2u5 (lx26-amd64)
12/04/2015 16:04:42|  main|tools-grid-shadow|I|starting up GE 6.2u5 (lx26-amd64)
12/04/2015 19:33:56|  main|tools-grid-master|I|starting up GE 6.2u5 (lx26-amd64)
12/16/2015 08:55:36|  main|tools-grid-shadow|I|starting up GE 6.2u5 (lx26-amd64)
12/29/2015 20:17:56|  main|tools-grid-master|I|starting up GE 6.2u5 (lx26-amd64)
12/30/2015 02:46:55|  main|tools-grid-master|I|starting up GE 6.2u5 (lx26-amd64)
12/30/2015 03:21:55|  main|tools-grid-master|I|starting up GE 6.2u5 (lx26-amd64)

which, unfortunately, does not tell us which is the cause of which.

The self-enabling shadow makes me itchy, so I looked at tools-grid-shadow. It has the class role::labs::tools::shadow enabled, which includes toollabs::shadow, which includes gridengine::shadow_master, which – among many other things – has:

service { 'gridengine-shadow':
    ensure  => running,
    require => File['/etc/init/gridengine-shadow.conf', "${sgeroot}/default/common/shadow_masters"],
}

Yet there is no service running by that name and, more importantly, there are no Puppet messages about trying to restart it:

scfc@tools-grid-shadow:~$ sudo puppet agent -tv
Info: Retrieving plugin
Info: Loading facts in /var/lib/puppet/lib/facter/labsprojectfrommetadata.rb
Info: Loading facts in /var/lib/puppet/lib/facter/physicalcorecount.rb
Info: Loading facts in /var/lib/puppet/lib/facter/root_home.rb
Info: Loading facts in /var/lib/puppet/lib/facter/puppet_vardir.rb
Info: Loading facts in /var/lib/puppet/lib/facter/ganeti.rb
Info: Loading facts in /var/lib/puppet/lib/facter/lldp.rb
Info: Loading facts in /var/lib/puppet/lib/facter/initsystem.rb
Info: Loading facts in /var/lib/puppet/lib/facter/puppet_config_dir.rb
Info: Loading facts in /var/lib/puppet/lib/facter/pe_version.rb
Info: Loading facts in /var/lib/puppet/lib/facter/apt.rb
Info: Caching catalog for tools-grid-shadow.tools.eqiad.wmflabs
Info: Applying configuration version '1452695636'
Notice: Finished catalog run in 24.31 seconds
scfc@tools-grid-shadow:~$

Which apparently is due to:

scfc@tools-grid-shadow:~$ sudo service gridengine-shadow status; echo $?
gridengine-shadow start/running, process 978
0
scfc@tools-grid-shadow:~$

But:

scfc@tools-grid-shadow:~$ ps 978
  PID TTY      STAT   TIME COMMAND
scfc@tools-grid-shadow:~$

I'd prefer it if (in the mid to long term) could use the same setup for the grid master as we use for proxy, Redis, etc., i. e. that if the master is broken, it takes manual intervention to enable the shadow.

Change 264390 had a related patch set uploaded (by Yuvipanda):
gridengine: Do not have puppet autostart the master

https://gerrit.wikimedia.org/r/264390

After much shenanigans it was determined that Puppet in it's ignorance was starting a second master processes which we believe was conflicting (it may have started more than one I didn't test multiple puppet runs)

root@tools-grid-master:/var/spool/gridengine# ps -ef | grep grid
sgeadmin 11372 1 18 20:30 ? 00:11:11 /usr/lib/gridengine/sge_qmaster
root 15127 1 0 21:31 ? 00:00:00 /usr/lib/gridengine/sge_qmaster
root 15136 3492 0 21:31 pts/2 00:00:00 grep --color=auto grid

Notice: /Stage[main]/Gridengine::Master/Service[gridengine-master]/ensure: ensure changed 'stopped' to 'running'
Info: /Stage[main]/Gridengine::Master/Service[gridengine-master]: Unscheduling refresh on Service[gridengine-master]
Notice: /Stage[main]/Toollabs::Master/Gridengine::Collectors::Queues[webgrid-generic]/Gridengine::Collector[webgrid-generic]/Exec[collect-webgrid-generic-resource]/returns: executed successfully
Notice: /Stage[main]/Gridengine::Master/Gridengine::Resourcedir[checkpoints]/Exec[track-checkpoints]/returns: executed successfully
Notice: Finished catalog run in 25.31 seconds

root@tools-grid-master:/var/spool/gridengine# ps -ef | grep grid
sgeadmin 15161     1 19 21:31 ?        00:00:58 /usr/lib/gridengine/sge_qmaster
root     16313     1  0 21:36 ?        00:00:00 /usr/lib/gridengine/sge_qmaster
root     16573  3492  0 21:36 pts/2    00:00:00 grep --color=auto grid
root@tools-grid-master:/var/spool/gridengine# kill -9 16313
root@tools-grid-master:/var/spool/gridengine# kill -9 15161
root@tools-grid-master:/var/spool/gridengine# ps -ef | grep grid
root     16575  3492  0 21:36 pts/2    00:00:00 grep --color=auto grid
root@tools-grid-master:/var/spool/gridengine# service gridengine-master start
root@tools-grid-master:/var/spool/gridengine# ps -ef | grep grid
sgeadmin 16597     1  7 21:36 ?        00:00:00 /usr/lib/gridengine/sge_qmaster
root     16603  3492  0 21:36 pts/2    00:00:00 grep --color=auto grid
root@tools-grid-master:/var/spool/gridengine#

root@tools-grid-master:/var/spool/gridengine# puppet agent --disable "PUPPET STARTS A DUPE MASTER PROC"

We think maybe the lines:

01/15/2016 18:46:53| timer|tools-grid-master|E|Corrupted database detected. Freeing all resources to prepare for a reconnect with recovery.

are a symptom of overlapping masters trying to manage the same resources

First instance I see is:

12/30/2015 02:47:55| timer|tools-grid-master|E|Corrupted database detected. Freeing all resources to prepare for a reconnect with recovery.

And I haven't seen one now for a bit but the last is:

01/15/2016 18:46:53| timer|tools-grid-master|E|Corrupted database detected. Freeing all resources to prepare for a reconnect with recovery.

grep -i corrupted /var/spool/gridengine/qmaster/messages | tail -n 1

date

Fri Jan 15 22:07:26 UTC 2016

Change 264390 merged by Yuvipanda:
gridengine: Do not have puppet autostart the master

https://gerrit.wikimedia.org/r/264390

https://gerrit.wikimedia.org/r/264390 got merged a week ago - has this issue been seen again?
Is this task resolved? Or is more work needed?

It seems it's still happening:

01/16/2016 12:00:27|worker|tools-grid-master|E|error writing object with key "USER:tools.sbot" into berkeley database: (22) Invalid argument
01/16/2016 12:00:27|worker|tools-grid-master|W|aborting transaction (rollback)
01/16/2016 12:00:45| timer|tools-grid-master|E|Corrupted database detected. Freeing all resources to prepare for a reconnect with recovery.
01/16/2016 12:00:45| timer|tools-grid-master|E|error checkpointing berkeley db: (-30973) BDB0087 DB_RUNRECOVERY: Fatal error, run database recovery
01/16/2016 12:00:45| timer|tools-grid-master|E|trigger function of rule "default rule" in context "berkeleydb spooling" failed
01/16/2016 12:00:45|worker|tools-grid-master|E|Corrupted database detected. Freeing all resources to prepare for a reconnect with recovery.
01/16/2016 12:00:45|worker|tools-grid-master|E|error starting a transaction: (-30973) BDB0087 DB_RUNRECOVERY: Fatal error, run database recovery
01/16/2016 12:00:45|worker|tools-grid-master|E|transaction function of rule "default rule" in context "berkeleydb spooling" failed
01/16/2016 12:00:45|worker|tools-grid-master|E|cannot close transaction: There is no open transaction
01/16/2016 12:00:45|worker|tools-grid-master|E|transaction function of rule "default rule" in context "berkeleydb spooling" failed
01/16/2016 12:03:42|worker|tools-grid-master|W|unable to find job 2381286 from the scheduler order package
01/17/2016 15:05:18|worker|tools-grid-master|E|error writing object with key "USER:tools.kotools" into berkeley database: (22) Invalid argument
01/17/2016 15:05:18|worker|tools-grid-master|W|aborting transaction (rollback)
01/17/2016 15:05:45| timer|tools-grid-master|E|Corrupted database detected. Freeing all resources to prepare for a reconnect with recovery.
01/17/2016 15:05:45| timer|tools-grid-master|E|error checkpointing berkeley db: (-30973) BDB0087 DB_RUNRECOVERY: Fatal error, run database recovery
01/17/2016 15:05:45| timer|tools-grid-master|E|trigger function of rule "default rule" in context "berkeleydb spooling" failed
01/17/2016 15:06:00|worker|tools-grid-master|W|job 2416877.1 failed on host tools-exec-1218.eqiad.wmflabs assumedly after job because: job 2416877.1 died through signal INT (2)
01/17/2016 15:08:45|worker|tools-grid-master|E|The job -j of user(s) tools.toolschecker does not exist
01/17/2016 15:08:54|worker|tools-grid-master|W|unable to find job 2416994 from the scheduler order package
01/17/2016 15:08:54|worker|tools-grid-master|W|Skipping remaining 0 orders
01/17/2016 15:08:54|schedu|tools-grid-master|E|unable to find job 2416994 from the scheduler order package
01/17/2016 15:09:19|worker|tools-grid-master|E|error writing object with key "USER:tools.kotools" into berkeley database: (22) Invalid argument
01/17/2016 15:09:19|worker|tools-grid-master|W|aborting transaction (rollback)
01/17/2016 15:09:45| timer|tools-grid-master|E|Corrupted database detected. Freeing all resources to prepare for a reconnect with recovery.
01/17/2016 15:09:45| timer|tools-grid-master|E|error checkpointing berkeley db: (-30973) BDB0087 DB_RUNRECOVERY: Fatal error, run database recovery
01/17/2016 15:09:45| timer|tools-grid-master|E|trigger function of rule "default rule" in context "berkeleydb spooling" failed
01/18/2016 00:08:05|worker|tools-grid-master|E|error writing object with key "USER:tools.giftbot" into berkeley database: (22) Invalid argument
01/18/2016 00:08:05|worker|tools-grid-master|W|aborting transaction (rollback)
01/18/2016 00:08:45| timer|tools-grid-master|E|Corrupted database detected. Freeing all resources to prepare for a reconnect with recovery.
01/18/2016 00:08:45| timer|tools-grid-master|E|error checkpointing berkeley db: (-30973) BDB0087 DB_RUNRECOVERY: Fatal error, run database recovery
01/18/2016 00:08:45| timer|tools-grid-master|E|trigger function of rule "default rule" in context "berkeleydb spooling" failed
01/18/2016 04:01:05|worker|tools-grid-master|E|error writing object with key "USER:tools.danmicholobot" into berkeley database: (22) Invalid argument
01/18/2016 04:01:05|worker|tools-grid-master|W|aborting transaction (rollback)
01/18/2016 04:01:45| timer|tools-grid-master|E|Corrupted database detected. Freeing all resources to prepare for a reconnect with recovery.
01/18/2016 04:01:45| timer|tools-grid-master|E|error checkpointing berkeley db: (-30973) BDB0087 DB_RUNRECOVERY: Fatal error, run database recovery
01/18/2016 04:01:45| timer|tools-grid-master|E|trigger function of rule "default rule" in context "berkeleydb spooling" failed

etc. It's happening fairly regularly, still:

01/22/2016 18:55:36| timer|tools-grid-master|E|Corrupted database detected. Freeing all resources to prepare for a reconnect with recovery.
01/22/2016 18:59:36| timer|tools-grid-master|E|Corrupted database detected. Freeing all resources to prepare for a reconnect with recovery.
01/22/2016 19:03:36| timer|tools-grid-master|E|Corrupted database detected. Freeing all resources to prepare for a reconnect with recovery.
01/22/2016 19:07:36| timer|tools-grid-master|E|Corrupted database detected. Freeing all resources to prepare for a reconnect with recovery.
01/22/2016 19:11:36| timer|tools-grid-master|E|Corrupted database detected. Freeing all resources to prepare for a reconnect with recovery.

So we should probably consider doing the dump-and-reload cycle.

The restarts /have/ disappeared, so at least we lost that potential for issues.

yuvipanda claimed this task.

Sooo...

we announced downtime and messed around. Chase and Valhallasw found that dumping and reloading the queue bdb had no effect, but dumping and reloading the *config* bdb worked! No corruption messages since, so I'm going to mark this as resolved.

yuvipanda set Security to None.