Page MenuHomePhabricator

labstore1006 spontaneous reboot
Closed, ResolvedPublic

Description

Alert in Icinga about NFS being down (alert history).

$ uptime
 12:04:20 up  2:35,  1 user,  load average: 0.30, 0.45, 0.57

$ last
gtirloni pts/0        bast1002.wikimed Sat Mar  2 11:50   still logged in   
reboot   system boot  4.9.0-0.bpo.8-am Sat Mar  2 09:29 - 12:04  (02:35)    

wtmp begins Sat Mar  2 09:29:01 2019

Lots of garbage before the reboot but no specific reason in /var/log/syslog:

Mar  2 09:21:01 labstore1006 CRON[20317]: (dumpsgen) CMD (bash -c '/usr/bin/rsync -rt  --chmod=go-w stat1007.eqiad.wmnet::srv/dumps/pagecounts-ez/ /srv/dumps/xmldatadumps/public/other/pagecounts-ez/')
Mar  2 09:21:01 labstore1006 CRON[20318]: (prometheus) CMD (/usr/local/bin/prometheus-puppet-agent-stats --outfile /var/lib/prometheus/node.d/puppet_agent.prom)
Mar  2 09:22:01 labstore1006 CRON[20444]: (prometheus) CMD (/usr/local/bin/prometheus-puppet-agent-stats --outfile /var/lib/prometheus/node.d/puppet_agent.prom)
Mar  2 09:23:01 labstore1006 CRON[20585]: (prometheus) CMD (/usr/local/bin/prometheus-puppet-agent-stats --outfile /var/lib/prometheus/node.d/puppet_agent.prom)
^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@Mar  2 09:29:09 labstore1006 systemd-modules-load[356]: Inserted module 'nf_conntrack'
Mar  2 09:29:09 labstore1006 systemd-modules-load[356]: Inserted module 'fuse'
Mar  2 09:29:09 labstore1006 systemd-modules-load[356]: Inserted module 'ipmi_devintf'
Mar  2 09:29:09 labstore1006 systemd[1]: Starting Create Static Device Nodes in /dev...

/var/log/auth.log:

Mar  2 09:23:16 labstore1006 sudo:   nagios : TTY=unknown ; PWD=/ ; USER=root ; COMMAND=/usr/lib/nagios/plugins/check_ferm
Mar  2 09:23:16 labstore1006 sudo: pam_unix(sudo:session): session opened for user root by (uid=0)
Mar  2 09:23:16 labstore1006 sudo: pam_unix(sudo:session): session closed for user root
^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@Mar  2 09:29:09 labstore1006 CRON[665]: pam_unix(cron:session): session opened for user root by (uid=0)
Mar  2 09:29:09 labstore1006 systemd-logind[669]: New seat seat0.
Mar  2 09:29:09 labstore1006 systemd-logind[669]: Watching system buttons on /dev/input/event0 (Power Button)

/var/log/daemon:

Mar  2 09:19:10 labstore1006 systemd[1]: Starting Time & Date Service...
Mar  2 09:19:10 labstore1006 dbus[674]: [system] Successfully activated service 'org.freedesktop.timedate1'
Mar  2 09:19:10 labstore1006 systemd-timedated[20092]: /etc/localtime should be a symbolic link to a time zone data file in /usr/share/zoneinfo/.
Mar  2 09:19:10 labstore1006 systemd[1]: Started Time & Date Service.
Mar  2 09:29:09 labstore1006 systemd-modules-load[356]: Inserted module 'nf_conntrack'
Mar  2 09:29:09 labstore1006 systemd-modules-load[356]: Inserted module 'fuse'
Mar  2 09:29:09 labstore1006 systemd-modules-load[356]: Inserted module 'ipmi_devintf'

/var/log/kern.log:

Feb 28 06:25:03 labstore1006 kernel: [7478114.408796] Process accounting resumed
Mar  1 06:25:04 labstore1006 kernel: [7564515.091137] Process accounting resumed
Mar  2 06:25:03 labstore1006 kernel: [7650914.361789] Process accounting resumed
Mar  2 09:29:09 labstore1006 kernel: [    0.000000] microcode: microcode updated early to revision 0xb00002e, date = 2018-04-19
Mar  2 09:29:09 labstore1006 kernel: [    0.000000] Linux version 4.9.0-0.bpo.8-amd64 (debian-kernel@lists.debian.org) (gcc version 4.9.2 (Debian 4.9.2-10+deb8u1) ) #1 SMP Debian 4.9.110-3+deb9u5~deb8u1 (2018-10-03)

/var/log/messages:

Mar  2 09:14:41 labstore1006 puppet-agent-cronjob: INFO:debmonitor:Found 4 upgradable binary packages (including new dependencies)
Mar  2 09:14:41 labstore1006 puppet-agent-cronjob: INFO:debmonitor:Successfully sent the upgradable update to the DebMonitor server
Mar  2 09:29:09 labstore1006 kernel: [    0.000000] microcode: microcode updated early to revision 0xb00002e, date = 2018-04-19
Mar  2 09:29:09 labstore1006 kernel: [    0.000000] Linux version 4.9.0-0.bpo.8-amd64 (debian-kernel@lists.debian.org) (gcc version 4.9.2 (Debian 4.9.2-10+deb8u1) ) #1 SMP Debian 4.9.110-3+deb9u5~deb8
u1 (2018-10-03)

Last firmware messages:

/system1/log1/record60
  Targets
  Properties
    number=60
    severity=Caution
    date=[NOT SET]
    time=
    description=Option ROM POST Error: bay: 11 (SAS)

/system1/log1/record61
  Targets
  Properties
    number=61
    severity=Caution
    date=[NOT SET]
    time=
    description=Option ROM POST Error: bay: 12 (SAS)

/system1/log1/record62
  Targets
  Properties
    number=62
    severity=Caution
    date=03/02/2019
    time=09:27
    description=Option ROM POST Error: 1719-Slot 1 Drive Array - A controller failure event occurred prior to this power-up. (Previous lock up code = 0x12) Action: Install the latest controller firmware. If the problem persists, replace the controller.

/system1/log1/record63
  Targets
  Properties
    number=63
    severity=Caution
    date=[NOT SET]
    time=
    description=Option ROM POST Error: 1719-Slot 3 Drive Array - A controller failure event occurred prior to this power-up. (Previous lock up code = 0x12) Action: Install the latest controller firmware. If the problem persists, replace the controller.

Event Timeline

GTirloni created this task.Mar 2 2019, 12:11 PM
Restricted Application added a subscriber: Aklapper. · View Herald TranscriptMar 2 2019, 12:11 PM

Mentioned in SAL (#wikimedia-operations) [2019-03-02T12:12:18Z] <gtirloni> labstore1006 started nfsd T217473

The POST errors may be uncleared issues from when we installed the additional shelves here: T196651. They were originally installed incorrectly, which made the server unbootable. When it was corrected, errors had to be cleared and some never go away because it "remembers" being in a slightly more HA configuration with additional pathways to disks. That would be consistent with the vague POST errors here (but they might be totally different--see below!).

Our monitoring also doesn't work correctly with this configuration, which helps tons T199248. I don't imagine that there's any connection to the new microcode package (since it wasn't on the hit list for reboots that @jbond had--I think).

All that might be helpful to know, but, dmesg says:

[    0.244633] [Firmware Bug]: the BIOS has corrupted hw-PMU resources (MSR 38d is 330)

That might cause this. Fixing it will cause havoc on Toolforge when it is down, but there is a plan we can stage for that.

@ArielGlenn this was likely a brief outage for the dumps website. ^^

@Bstorm sorry for the late response labstore1006.wikimedia.org is not one i rebooted and its not mentioned in the task (https://phabricator.wikimedia.org/T216802) however i have copied in @MoritzMuehlenhoff just in case there where some not mentioned on the task

relevant output from zcat pacct.1.gz |lastcomm -f - | more is also unenlightening.

nrpe              F    nagios   __         0.00 secs Sat Mar  2 09:23
nrpe              F    nagios   __         0.00 secs Sat Mar  2 09:23
prometheus-pupp        promethe __         0.08 secs Sat Mar  2 09:24
sh               S     promethe __         0.00 secs Sat Mar  2 09:24
cron             SF    root     __         0.00 secs Sat Mar  2 09:24
nrpe              F    nagios   __         0.00 secs Sat Mar  2 09:24
ip                     nagios   __         0.00 secs Sat Mar  2 09:24
check_eth         F    nagios   __         0.00 secs Sat Mar  2 09:24
grep                   nagios   __         0.00 secs Sat Mar  2 09:24
check_eth         F    nagios   __         0.00 secs Sat Mar  2 09:24
grep                   nagios   __         0.00 secs Sat Mar  2 09:24
ethtool                nagios   __         0.00 secs Sat Mar  2 09:24
awk                    nagios   __         0.00 secs Sat Mar  2 09:24
check_eth         F    nagios   __         0.00 secs Sat Mar  2 09:24
ip                     nagios   __         0.00 secs Sat Mar  2 09:24
check_eth         F    nagios   __         0.00 secs Sat Mar  2 09:24
grep                   nagios   __         0.00 secs Sat Mar  2 09:24
check_eth         F    nagios   __         0.00 secs Sat Mar  2 09:24
grep                   nagios   __         0.00 secs Sat Mar  2 09:24
ip                     nagios   __         0.00 secs Sat Mar  2 09:24
check_eth         F    nagios   __         0.00 secs Sat Mar  2 09:24
grep                   nagios   __         0.00 secs Sat Mar  2 09:24
check_eth         F    nagios   __         0.00 secs Sat Mar  2 09:24
grep                   nagios   __         0.00 secs Sat Mar  2 09:24
ethtool                nagios   __         0.00 secs Sat Mar  2 09:24
awk                    nagios   __         0.00 secs Sat Mar  2 09:24
check_eth         F    nagios   __         0.00 secs Sat Mar  2 09:24
check_eth              nagios   __         0.00 secs Sat Mar  2 09:24
sh                     nagios   __         0.00 secs Sat Mar  2 09:24
nrpe              F    nagios   __         0.00 secs Sat Mar  2 09:24
nrpe              F    nagios   __         0.00 secs Sat Mar  2 09:24
nrpe              F    nagios   __         0.00 secs Sat Mar  2 09:24
check_conntrack        nagios   __         0.01 secs Sat Mar  2 09:24
sh                     nagios   __         0.00 secs Sat Mar  2 09:24
nrpe              F    nagios   __         0.00 secs Sat Mar  2 09:24
nrpe              F    nagios   __         0.00 secs Sat Mar  2 09:24
nrpe              F    nagios   __         0.00 secs Sat Mar  2 09:24
ps                     nagios   __         0.01 secs Sat Mar  2 09:24
check_procs            nagios   __         0.00 secs Sat Mar  2 09:24
sh                     nagios   __         0.00 secs Sat Mar  2 09:24
nrpe              F    nagios   __         0.00 secs Sat Mar  2 09:24
nrpe              F    nagios   __         0.00 secs Sat Mar  2 09:24
nrpe              F    nagios   __         0.00 secs Sat Mar  2 09:24
iptables         S     root     __         0.00 secs Sat Mar  2 09:24
sed                    root     __         0.00 secs Sat Mar  2 09:24
check_ferm        F    root     __         0.00 secs Sat Mar  2 09:24
check_ferm       S     root     __         0.00 secs Sat Mar  2 09:24
sudo             S     root     __         0.00 secs Sat Mar  2 09:24
sh                     nagios   __         0.00 secs Sat Mar  2 09:24
nrpe              F    nagios   __         0.00 secs Sat Mar  2 09:24
nrpe              F    nagios   __         0.00 secs Sat Mar  2 09:24
                                                                       <---   reboot here
                       root     __         0.00 secs Thu Jan  1 00:00
                       root     __         0.00 secs Thu Jan  1 00:00
accton           S     root     __         0.00 secs Sat Mar  2 09:29
lspci                  root     __         0.00 secs Sat Mar  2 09:29
cut                    root     __         0.00 secs Sat Mar  2 09:29
acct             S     root     __         0.00 secs Sat Mar  2 09:29
hp-health         F    root     __         0.00 secs Sat Mar  2 09:29
lspci                  root     __         0.01 secs Sat Mar  2 09:29
cut                    root     __         0.00 secs Sat Mar  2 09:29
hp-health         F    root     __         0.00 secs Sat Mar  2 09:29
mcelog                 root     __         0.00 secs Sat Mar  2 09:29
lspci                  root     __         0.01 secs Sat Mar  2 09:29
cut                    root     __         0.00 secs Sat Mar  2 09:29
hp-health         F    root     __         0.00 secs Sat Mar  2 09:29
mcelog           S     root     __         0.00 secs Sat Mar  2 09:29
mcelog           S     root     __         0.00 secs Sat Mar  2 09:29
lspci                  root     __         0.01 secs Sat Mar  2 09:29
cut                    root     __         0.00 secs Sat Mar  2 09:29
hp-health         F    root     __         0.00 secs Sat Mar  2 09:29
lspci                  root     __         0.00 secs Sat Mar  2 09:29
cut                    root     __         0.00 secs Sat Mar  2 09:29
hp-health         F    root     __         0.00 secs Sat Mar  2 09:29
lspci                  root     __         0.01 secs Sat Mar  2 09:29
cut                    root     __         0.00 secs Sat Mar  2 09:29
hp-health         F    root     __         0.00 secs Sat Mar  2 09:29
lspci                  root     __         0.02 secs Sat Mar  2 09:29
cut                    root     __         0.00 secs Sat Mar  2 09:29
hp-health         F    root     __         0.00 secs Sat Mar  2 09:29
lspci                  root     __         0.01 secs Sat Mar  2 09:29
cut                    root     __         0.00 secs Sat Mar  2 09:29
hp-health         F    root     __         0.00 secs Sat Mar  2 09:29
lspci                  root     __         0.00 secs Sat Mar  2 09:29
cut                    root     __         0.00 secs Sat Mar  2 09:29
hp-health         F    root     __         0.00 secs Sat Mar  2 09:29
modprobe         S     root     __         0.00 secs Sat Mar  2 09:29
lspci                  root     __         0.01 secs Sat Mar  2 09:29
cut                    root     __         0.00 secs Sat Mar  2 09:29
hp-health         F    root     __         0.00 secs Sat Mar  2 09:29
ls                     root     __         0.00 secs Sat Mar  2 09:29
lldpcli          S     root     __         0.00 secs Sat Mar  2 09:29
wc                     root     __         0.00 secs Sat Mar  2 09:29
hp-health         F    root     __         0.00 secs Sat Mar  2 09:29
dmidecode        S     root     __         0.00 secs Sat Mar  2 09:29
grep                   root     __         0.00 secs Sat Mar  2 09:29
pidof                  root     __         0.01 secs Sat Mar  2 09:29
sysctl                 root     __         0.00 secs Sat Mar  2 09:29
hp-health         F    root     __         0.00 secs Sat Mar  2 09:29
run-parts              root     __         0.00 secs Sat Mar  2 09:29
sysstat          S     root     __         0.00 secs Sat Mar  2 09:29
pidof                  root     __         0.00 secs Sat Mar  2 09:29
lspci                  root     __         0.00 secs Sat Mar  2 09:29
cut                    root     __         0.00 secs Sat Mar  2 09:29
hp-health         F    root     __         0.00 secs Sat Mar  2 09:29
systemd-user-se  S     root     __         0.00 secs Sat Mar  2 09:29
lspci                  root     __         0.02 secs Sat Mar  2 09:29
cut                    root     __         0.00 secs Sat Mar  2 09:29
hp-health         F    root     __         0.00 secs Sat Mar  2 09:29
ls                     root     __         0.00 secs Sat Mar  2 09:29
wc                     root     __         0.00 secs Sat Mar  2 09:29
hp-health         F    root     __         0.00 secs Sat Mar  2 09:29
hp-health         F    root     __         0.00 secs Sat Mar  2 09:29
systemd-cgroups  S     root     __         0.00 secs Sat Mar  2 09:29
systemd-cgroups  S     root     __         0.00 secs Sat Mar  2 09:29
systemd-cgroups  S     root     __         0.00 secs Sat Mar  2 09:29
systemd-cgroups  S     root     __         0.00 secs Sat Mar  2 09:29
systemd-cgroups  S     root     __         0.00 secs Sat Mar  2 09:29
systemd-cgroups  S     root     __         0.00 secs Sat Mar  2 09:29

labstore1006 is unrelated to the Westmere reboots, labsdb1006 was already running a microcode fixed for L1TF and SSBD for a while. There's nothing in system logs which indicate some OS failure, probably needs some DC ops investigation if there's some hw error logged.

Bstorm added a comment.Mar 4 2019, 4:55 PM

Yup. The firmware corruption message and the breaks/binary nonsense in the logs suggests a flat hardware issue to investigate. We can fail services over to the other partner (which will tends to run on high load, so that alert for load might have to be disabled) once DCops is in a position to do reboots and such.

Restricted Application added a project: Operations. · View Herald TranscriptMar 4 2019, 4:55 PM

Change 494273 had a related patch set uploaded (by Bstorm; owner: Bstorm):
[operations/puppet@production] dumps distribution: remove labstore1006 for failover

https://gerrit.wikimedia.org/r/494273

Change 494275 had a related patch set uploaded (by Bstorm; owner: Bstorm):
[operations/dns@master] dumps distrubution: reduce TTL for failover of dumps.wikimedia.org

https://gerrit.wikimedia.org/r/494275

Change 494275 merged by Bstorm:
[operations/dns@master] dumps distrubution: reduce TTL for failover of dumps.wikimedia.org

https://gerrit.wikimedia.org/r/494275

Change 494273 merged by Bstorm:
[operations/puppet@production] dumps distribution: remove labstore1006 for failover

https://gerrit.wikimedia.org/r/494273

Mentioned in SAL (#wikimedia-operations) [2019-03-04T18:25:12Z] <bstorm_> disabled notifications for high load on labstore1007 while failed over T217473

Change 494284 had a related patch set uploaded (by Bstorm; owner: Bstorm):
[operations/puppet@production] dumps distribution: swap do_acme for dumps server failover

https://gerrit.wikimedia.org/r/494284

Change 494286 had a related patch set uploaded (by Bstorm; owner: Bstorm):
[operations/dns@master] dumps distribution: fail over to labstore1007 for web access

https://gerrit.wikimedia.org/r/494286

Change 494286 merged by Bstorm:
[operations/dns@master] dumps distribution: fail over to labstore1007 for web access

https://gerrit.wikimedia.org/r/494286

Change 494284 merged by Bstorm:
[operations/puppet@production] dumps distribution: swap do_acme for dumps server failover

https://gerrit.wikimedia.org/r/494284

Mentioned in SAL (#wikimedia-operations) [2019-03-04T19:03:29Z] <bstorm_> dumps.wikimedia.org is now running off labstore1007 T217473

Mentioned in SAL (#wikimedia-cloud) [2019-03-04T19:07:19Z] <bstorm_> umounted /mnt/nfs/dumps-labstore1006.wikimedia.org for T217473

Bstorm assigned this task to Cmjohnson.Mar 4 2019, 7:14 PM
Bstorm added a subscriber: Cmjohnson.

Ok, this host should now be reasonably safe to work on for checking for firmware issues by DC Ops.

NOTE: I haven't downtimed this in icinga yet since I don't know the timeframe, so please downtime it (or ask WMCS to, if you need us) when you are ready to reboot it for checking things @Cmjohnson.

Mentioned in SAL (#wikimedia-cloud) [2019-03-04T19:37:28Z] <bstorm_> umounted /mnt/nfs/dumps-labstore1006.wikimedia.org across all VPS projects for T217473

Change 494303 had a related patch set uploaded (by ArielGlenn; owner: ArielGlenn):
[operations/puppet@production] move labstore1006 to a role that does no rsync fetches for now

https://gerrit.wikimedia.org/r/494303

Change 494303 merged by ArielGlenn:
[operations/puppet@production] move labstore1006 to a role that does no rsync fetches for now

https://gerrit.wikimedia.org/r/494303

The above changeset is live on labstore1006. I've commented out the crons that fetch from stat1007, and a run of puppet over there verifies that they won't come back until that patch is reverted. That should limit the cronspam to a reasonable level.

bd808 moved this task from Backlog to Dumps on the Data-Services board.Mar 5 2019, 4:16 PM

Mentioned in SAL (#wikimedia-operations) [2019-03-13T18:07:28Z] <bstorm_> downtime labstore1006 for troubleshooting T217473

Mentioned in SAL (#wikimedia-operations) [2019-03-13T18:09:01Z] <bstorm_> rebooting labstore1006 T217473

Cmjohnson removed a subscriber: Cmjohnson.

I updated all the F/W on this server. I am removing the dc ops tag. If this becomes a h/w issue please add back.

Change 496607 had a related patch set uploaded (by Bstorm; owner: Bstorm):
[operations/dns@master] dumps distribution: set dumps ttl to 5m to prep for failback

https://gerrit.wikimedia.org/r/496607

Change 496607 merged by Bstorm:
[operations/dns@master] dumps distribution: set dumps ttl to 5m to prep for failback

https://gerrit.wikimedia.org/r/496607

Change 496614 had a related patch set uploaded (by Bstorm; owner: Bstorm):
[operations/dns@master] dumps distribution: fail dumps back to labstore1006

https://gerrit.wikimedia.org/r/496614

Change 496614 merged by Bstorm:
[operations/dns@master] dumps distribution: fail dumps back to labstore1006

https://gerrit.wikimedia.org/r/496614

GTirloni removed a subscriber: GTirloni.Mar 21 2019, 9:06 PM
Bstorm closed this task as Resolved.May 22 2019, 3:27 PM

This seems ok for now following the firmware upgrades. I'm going to close it.