Page MenuHomePhabricator

graphite1004 freezing
Closed, ResolvedPublic

Description

Since it was rebooted for a kernel downgrade (T297180), graphite1004 has frozen at least twice, requiring powercycles via mgmt. apache/uwsgi/syslog all give no indications to what the issue could be. There were no significant outliers when using https://wikitech.wikimedia.org/wiki/Graphite#Operations_troubleshooting to look at requests.

Screenshot 2021-12-07 at 18-44-13 Host overview - Grafana.png (1×3 px, 344 KB)

P18073 via @herron is what top looked like when it froze the second time.

@RLazarus also noticed that disk usage spikes upon the reboot, but it's unclear if that's expected or not.

Screenshot 2021-12-07 at 18-45-46 Host overview - Grafana.png (1×3 px, 333 KB)

Event Timeline

Legoktm triaged this task as Unbreak Now! priority.

Change 744906 had a related patch set uploaded (by Herron; author: Herron):

[operations/puppet@production] graphite1004: set profile::monitoring::is_critical: true

https://gerrit.wikimedia.org/r/744906

Change 744906 merged by Herron:

[operations/puppet@production] graphite1004: set profile::monitoring::is_critical: true

https://gerrit.wikimedia.org/r/744906

Thank you folks for investigating this! I am taking a look too and so far have failed to find anything of note

Change 745197 had a related patch set uploaded (by Filippo Giunchedi; author: Filippo Giunchedi):

[operations/puppet@production] syslog: add netconsole::server

https://gerrit.wikimedia.org/r/745197

Change 745205 had a related patch set uploaded (by Filippo Giunchedi; author: Filippo Giunchedi):

[operations/puppet@production] graphite: enable netconsole client

https://gerrit.wikimedia.org/r/745205

Change 745197 merged by Filippo Giunchedi:

[operations/puppet@production] syslog: add netconsole::server

https://gerrit.wikimedia.org/r/745197

For the record, for testing purposes I've manually enabled netconsole on graphite1004 and pointed it to centrallog1001. Once the patch series above are merged the same config will be in puppet too

@fgiunchedi two other things that it would be good to have your input on:

  1. How critical is it that graphite stays up? If it goes down again, should it page us (currently it will) and should we be waking people up (aka you :)) to investigate?
  2. Is the failover procedure still up to date? And is it reliable enough that if graphite goes down we should attempt it or it should be left to someone with more graphite experience (again, probably you)?

Mentioned in SAL (#wikimedia-operations) [2021-12-09T00:11:27Z] <rzl> rzl@graphite1004:~$ sudo shutdown -r now T297265

Mentioned in SAL (#wikimedia-operations) [2021-12-09T00:26:07Z] <rzl> graphite1004.mgmt: /admin1-> racadm serveraction powercycle (T297265)

$ sudo ipmi-sel
ID  | Date        | Time     | Name             | Type                        | Event
1   | Jun-01-2018 | 10:56:02 | SEL              | Event Logging Disabled      | Log Area Reset/Cleared
2   | Dec-19-2018 | 15:58:30 | Status           | Power Supply                | Power Supply input lost (AC/DC)
3   | Dec-19-2018 | 15:58:40 | Status           | Power Supply                | Power Supply input lost (AC/DC)
4   | Sep-10-2019 | 15:07:07 | Status           | Power Supply                | Power Supply input lost (AC/DC)
5   | Sep-10-2019 | 15:07:08 | PS Redundancy    | Power Supply                | Redundancy Lost
6   | Sep-10-2019 | 15:11:28 | PS Redundancy    | Power Supply                | Fully Redundant
7   | Sep-10-2019 | 15:11:32 | Status           | Power Supply                | Power Supply input lost (AC/DC)
8   | Oct-31-2019 | 11:48:07 | Status           | Power Supply                | Power Supply input lost (AC/DC)
9   | Oct-31-2019 | 11:48:09 | PS Redundancy    | Power Supply                | Redundancy Lost
10  | Oct-31-2019 | 11:51:27 | Status           | Power Supply                | Power Supply input lost (AC/DC)
11  | Oct-31-2019 | 11:51:29 | PS Redundancy    | Power Supply                | Fully Redundant
12  | Oct-31-2019 | 11:58:10 | Status           | Power Supply                | Power Supply input lost (AC/DC)
13  | Oct-31-2019 | 11:58:14 | PS Redundancy    | Power Supply                | Redundancy Lost
14  | Oct-31-2019 | 12:01:00 | Status           | Power Supply                | Power Supply input lost (AC/DC)
15  | Oct-31-2019 | 12:01:04 | PS Redundancy    | Power Supply                | Fully Redundant
16  | Sep-26-2021 | 00:21:40 | Mem ECC Warning  | Memory                      | transition to Non-Critical from OK
17  | Sep-26-2021 | 00:25:50 | Mem ECC Warning  | Memory                      | transition to Critical from less severe

Did we replace the memory after the events on Sept 26th?

Did we replace the memory after the events on Sept 26th?

I cannot find any Phabricator tickets mentioning graphite1004 that are about replacing memory.

Change 745352 had a related patch set uploaded (by Cwhite; author: Cwhite):

[operations/dns@master] discovery: move read traffic to graphite2003

https://gerrit.wikimedia.org/r/745352

Change 745353 had a related patch set uploaded (by Cwhite; author: Cwhite):

[operations/dns@master] wmnet: move writes to graphite2003

https://gerrit.wikimedia.org/r/745353

Change 745354 had a related patch set uploaded (by Cwhite; author: Cwhite):

[operations/puppet@production] profile: move statsd writes to graphite2003

https://gerrit.wikimedia.org/r/745354

Change 745355 had a related patch set uploaded (by Cwhite; author: Cwhite):

[operations/mediawiki-config@master] ProductionServices: use graphite2003 for statsd

https://gerrit.wikimedia.org/r/745355

Change 745356 had a related patch set uploaded (by Cwhite; author: Cwhite):

[operations/puppet@production] graphite: check graphite2003 metrics

https://gerrit.wikimedia.org/r/745356

Change 745352 merged by Cwhite:

[operations/dns@master] discovery: move read traffic to graphite2003

https://gerrit.wikimedia.org/r/745352

It would be good to rule out the memory, although the timestamps in the SEL and hangs don't line up closely. FWIW the host was reimaged on 10/23 too but seems to have been stable until T297180.

I'm wondering too if we should put the graphite hosts back into their last known good state which was before T297180? On the other hand if graphite2003 shows the same symptoms after failover that would clarify a hardware vs software issue.

Change 745356 merged by Cwhite:

[operations/puppet@production] graphite: check graphite2003 metrics

https://gerrit.wikimedia.org/r/745356

Change 745353 merged by Cwhite:

[operations/dns@master] wmnet: move writes to graphite2003

https://gerrit.wikimedia.org/r/745353

Mentioned in SAL (#wikimedia-operations) [2021-12-09T02:54:18Z] <cwhite> failover statsd ingest host to graphite2003 T297265

Change 745354 merged by Cwhite:

[operations/puppet@production] profile: move statsd writes to graphite2003

https://gerrit.wikimedia.org/r/745354

Change 745355 merged by jenkins-bot:

[operations/mediawiki-config@master] ProductionServices: use graphite2003 for statsd

https://gerrit.wikimedia.org/r/745355

Mentioned in SAL (#wikimedia-operations) [2021-12-09T03:31:59Z] <cwhite@deploy1002> Synchronized wmf-config/ProductionServices.php: fail over statsd to graphite2003 T297265 (duration: 01m 05s)

Mentioned in SAL (#wikimedia-operations) [2021-12-09T03:34:47Z] <cwhite> bounce navtiming on webperf1001 to pick up statsd changes T297265

Change 745359 had a related patch set uploaded (by Herron; author: Herron):

[operations/puppet@production] icinga: make graphite2003 paging

https://gerrit.wikimedia.org/r/745359

Change 745359 merged by Herron:

[operations/puppet@production] icinga: make graphite2003 paging

https://gerrit.wikimedia.org/r/745359

We are failed over to graphite2003 for now.

colewhite lowered the priority of this task from Unbreak Now! to High.Dec 9 2021, 4:09 AM

Thank you folks for taking care of this!

@fgiunchedi two other things that it would be good to have your input on:

  1. How critical is it that graphite stays up? If it goes down again, should it page us (currently it will) and should we be waking people up (aka you :)) to investigate?

I would like to be able to say "it isn't critical" some time in the future, at this time it is critical due to mediawiki metrics :(

  1. Is the failover procedure still up to date? And is it reliable enough that if graphite goes down we should attempt it or it should be left to someone with more graphite experience (again, probably you)?

Yes failover is up to date (as you found out by now!) as I did it recently for Bullseye upgrade

The temporary netconsole client on graphite1004 paid off, see https://phabricator.wikimedia.org/P18076 for logs from the host (journalctl -u netconsole on centrallog1001).

I looked at the stack trace and to me it looks like either a kernel bug (we've never run graphite with 5.10.0-8-amd64 as per thanos metrics link ) Or the hardware is faulty, the SSDs are kinda old but I believe we should be seeing different failures at least from one of the drives)

Change 745205 merged by Filippo Giunchedi:

[operations/puppet@production] graphite: enable netconsole client

https://gerrit.wikimedia.org/r/745205

it looks like either a kernel bug (we've never run graphite with 5.10.0-8-amd64

So it sounds like in T297180 the kernel was downgraded to the wrong version -- we downgraded from 5.10.70 to 5.10.46, but in fact that was also a new version for this host, and the rollback target should have been 4.9.258.

We could revert correctly to the known-good 4.9.258 and fail the service back to eqiad, or we could stay in codfw, wait and roll forward to the fixed version when it's available.

Important note: graphite2003 is in the same state, running 5.10.46. (thanos) We don't know what tickles this bug, so it might be fine, or it might start freezing in the same way at any time.

it looks like either a kernel bug (we've never run graphite with 5.10.0-8-amd64

So it sounds like in T297180 the kernel was downgraded to the wrong version -- we downgraded from 5.10.70 to 5.10.46, but in fact that was also a new version for this host, and the rollback target should have been 4.9.258.

The host ran 4.9.258 when it was on Stretch, for Bullseye (what it is using since a bit only 5.10.x kernels exist).

graphite1004 is from 2018 and probably has still it's pristine, original firmware. My suggestion would be to get DC ops to upgrade all firmware to recent versions. I don't believe the 5.10.46-5.10.70 bug manifested this bug in particular, it's more likely just a coincidence.

But in general; given that these hosts ran fine before with 5.10.70, we can also easily revert to that. The downgrade towards .46 was done out of caution for the conntrack bug which hit mx2001, but in comparison to the two crashes of 1004/2003 with .46 the conntrack one is still a hypothetical, while the other two are real...

Host rebooted by filippo@cumin1001 with reason: None

I've rolled back graphite2003 to 5.10.0-9-amd64, next steps as per IRC convo are to wait for graphite2003' stability, and consider upgrading firmware on graphite1004 since we might want that anyways

Host rebooted by filippo@cumin1001 with reason: None

Host rebooted by filippo@cumin1001 with reason: revert back to linux 5.10.0-9 since graphite2003 has been stable so far

fgiunchedi claimed this task.

Reverting to 5.10.0-9 has brought back stability, resolving. We still have T297433 to update firmware, which will happen when dcops can