Page MenuHomePhabricator

Stale nic firmware files on some hosts
Closed, ResolvedPublic

Description

Looks like some hosts have stale files for nic firmware, I've poked on thanos-be2001 and indeed the timer for prometheus-nic-firmware-textfile showed n/a as if the timer never started, starting/stopping the related service seems to have fixed the issue (?) I have left the other hosts untouched

WARNING	2020-05-22 08:29:40	0d 15h 59m 22s	3/3	cluster={misc,thanos} file=nic_firmware.prom instance={thanos-be2001:9100,thanos-be2002:9100,thanos-be2003:9100,thanos-be2004:9100,thanos-fe2001:9100,thanos-fe2002:9100,thanos-fe2003:9100} job=node site=codfw	
Stale file for node-exporter textfile in eqiad

WARNING	2020-05-22 08:14:46	0d 11h 25m 22s	3/3	cluster=mysql file=nic_firmware.prom instance=db1140:9100 job=node site=eqiad

Event Timeline

Thanks for the report, I'll have to dig through Puppet logs I guess.

Here's all eqiad hosts that have an mtime more than 10 minutes in the past:
https://w.wiki/RYU

In codfw it's just the thanos-{fe,be} hosts.

ulsfo and esams are fine, eqsin has one host (cp5012).

P11283 contains syslog from one of the hosts where it didn't start properly.

We see the service unit being installed, a systemctl daemon-reload, the timer unit being installed, then three systemctl daemon-reloads interleaved with the timer unit being started (the timer's description is Periodic execution of prometheus-nic-firmware-textfile.service), then we see the Puppet exec block that @Joe and I added invoking a systemctl start on the service unit.

I compared logs from thanos-be2002 (where it didn't work) and dns1001 (where it did) and they're practically identical; the only difference that shows up on thanos-be2002 is these two lines in the first systemd daemon-reload:

May 19 15:05:18 thanos-be2002 systemd[1]: serial-getty@ttyS1.service: Current command vanished from the unit file, execution of the command list won't be resumed.
May 19 15:05:18 thanos-be2002 systemd[1]: getty@tty1.service: Current command vanished from the unit file, execution of the command list won't be resumed.

Hard to imagine that being related, though...

Change 598042 had a related patch set uploaded (by CDanis; owner: CDanis):
[operations/puppet@production] node_nic_firmware: add timer schedules

https://gerrit.wikimedia.org/r/598042

Change 598050 had a related patch set uploaded (by CDanis; owner: CDanis):
[operations/puppet@production] systemd::timer::job: unkludge OnUnitInactiveSec/OnUnitActiveSec

https://gerrit.wikimedia.org/r/598050

Change 598042 abandoned by CDanis:
node_nic_firmware: don't brick the timer on reboot

Reason:
superseded by I3d37ea50

https://gerrit.wikimedia.org/r/598042

Change 598050 merged by CDanis:
[operations/puppet@production] systemd::timer::job: unkludge OnUnitInactiveSec/OnUnitActiveSec

https://gerrit.wikimedia.org/r/598050

Mentioned in SAL (#wikimedia-operations) [2020-05-22T15:25:47Z] <cdanis> fixing prometheus-nic-firmware-textfile.service wherever it is broken T253374

CDanis closed this task as Resolved.EditedMay 22 2020, 3:36 PM

Fixed, and shouldn't happen again for this or other similar systemd::timer::jobs. See 598050's patch description for deets