Page MenuHomePhabricator

db1131 alerting due to network hiccup
Closed, ResolvedPublic

Description

We just had a page on db1131 to the SRE:

Notification Type: PROBLEM
Host: db1131
State: DOWN
Address: 10.64.32.6
Info: PING CRITICAL - Packet loss = 100%

Date/Time: Thu Nov 18 03:23:50 UTC 2021

It seems it was a network hiccup:

image.png (995×1 px, 72 KB)

image.png (788×879 px, 79 KB)

Open questions:

  • Investigate the source root of the network hiccup
  • Maybe we need to revisit the alerting for hosts if they start to send false alerts often.

Event Timeline

daemon logs during the hiccup:

Nov 18 03:23:15 db1131 systemd[1]: Created slice User Slice of UID 112.
Nov 18 03:23:15 db1131 systemd[1]: Starting User Runtime Directory /run/user/112...
Nov 18 03:23:15 db1131 systemd[1]: Started User Runtime Directory /run/user/112.
Nov 18 03:23:15 db1131 systemd[1]: Starting User Manager for UID 112...
Nov 18 03:23:16 db1131 systemd[14576]: gpgconf: running /usr/bin/gpg-agent failed (exitcode=2): General error
Nov 18 03:23:16 db1131 systemd[14576]: gpgconf: fatal error (exit status 1)
Nov 18 03:23:16 db1131 systemd[14576]: Reached target Timers.
Nov 18 03:23:16 db1131 systemd[14576]: Reached target Paths.
Nov 18 03:23:16 db1131 systemd[14576]: Listening on GnuPG cryptographic agent and passphrase cache (access for web browsers).
Nov 18 03:23:16 db1131 systemd[14576]: Listening on GnuPG cryptographic agent (ssh-agent emulation).
Nov 18 03:23:16 db1131 systemd[14576]: Listening on GnuPG network certificate management daemon.
Nov 18 03:23:16 db1131 systemd[14576]: Listening on GnuPG cryptographic agent and passphrase cache.
Nov 18 03:23:16 db1131 systemd[14576]: Listening on GnuPG cryptographic agent and passphrase cache (restricted).
Nov 18 03:23:16 db1131 systemd[14576]: Reached target Sockets.
Nov 18 03:23:16 db1131 systemd[14576]: Reached target Basic System.
Nov 18 03:23:16 db1131 systemd[14576]: Reached target Default.
Nov 18 03:23:16 db1131 systemd[14576]: Startup finished in 46ms.
Nov 18 03:23:16 db1131 systemd[1]: Started User Manager for UID 112.
Nov 18 03:23:16 db1131 systemd[1]: Started Session 49565 of user debmonitor.
Nov 18 03:24:00 db1131 systemd[1]: Started Regular job to collect puppet agent stats.
Nov 18 03:24:00 db1131 systemd[1]: prometheus_puppet_agent_stats.service: Succeeded.
Nov 18 03:25:01 db1131 systemd[1]: Started Regular job to collect puppet agent stats.
Nov 18 03:25:01 db1131 systemd[1]: prometheus_puppet_agent_stats.service: Succeeded.
Nov 18 03:25:09 db1131 systemd[1]: Started Update NIC firmware stats exported by node_exporter.
Nov 18 03:25:09 db1131 systemd[1]: prometheus-nic-firmware-textfile.service: Succeeded.
Nov 18 03:25:29 db1131 systemd[1]: session-49565.scope: Succeeded.

gpg-agent fatal'ed during the hiccup, can be the cause or a symptom

Oh the syslog is juicy:

Nov 18 03:22:06 db1131 kernel: [15945840.380917] tg3 0000:04:00.0 eno1: Link is down
Nov 18 03:23:05 db1131 mysqld[3170]: 2021-11-18  3:23:05 2602606439 [ERROR] Event Scheduler: [root@localhost][ops.wmf_slave_wikiuser_sleep] Unknown thread id: 2602605928
Nov 18 03:23:05 db1131 mysqld[3170]: 2021-11-18  3:23:05 2602606439 [Note] Event Scheduler: [root@localhost][ops.wmf_slave_wikiuser_sleep] At line 32 in ops.wmf_slave_wikiuser_sleep
Nov 18 03:23:05 db1131 mysqld[3170]: 2021-11-18  3:23:05 2602606439 [Note] Event Scheduler: [root@localhost].[ops.wmf_slave_wikiuser_sleep] event execution failed.
Nov 18 03:23:08 db1131 systemd[1]: Started Regular job to collect puppet agent stats.
Nov 18 03:23:08 db1131 systemd[1]: prometheus_puppet_agent_stats.service: Succeeded.
Nov 18 03:23:11 db1131 mysqld[3170]: 2021-11-18  3:23:11 1395380077 [ERROR] Slave I/O: error reconnecting to master 'repl@db1173.eqiad.wmnet:3306' - retry-time: 60  maximum-retries: 86400  message: Unknown MySQL server host 'db1173.eqiad.wmnet' (-3), Internal MariaDB error code: 2005
Nov 18 03:23:15 db1131 systemd[1]: Created slice User Slice of UID 112.
Nov 18 03:23:15 db1131 systemd[1]: Starting User Runtime Directory /run/user/112...
Nov 18 03:23:15 db1131 systemd[1]: Started User Runtime Directory /run/user/112.
Nov 18 03:23:15 db1131 systemd[1]: Starting User Manager for UID 112...
Nov 18 03:23:16 db1131 systemd[14576]: gpgconf: running /usr/bin/gpg-agent failed (exitcode=2): General error
Nov 18 03:23:16 db1131 systemd[14576]: gpgconf: fatal error (exit status 1)
Nov 18 03:23:16 db1131 systemd[14576]: Reached target Timers.
Nov 18 03:23:16 db1131 systemd[14576]: Reached target Paths.
Nov 18 03:23:16 db1131 systemd[14576]: Listening on GnuPG cryptographic agent and passphrase cache (access for web browsers).
Nov 18 03:23:16 db1131 systemd[14576]: Listening on GnuPG cryptographic agent (ssh-agent emulation).
Nov 18 03:23:16 db1131 systemd[14576]: Listening on GnuPG network certificate management daemon.
Nov 18 03:23:16 db1131 systemd[14576]: Listening on GnuPG cryptographic agent and passphrase cache.
Nov 18 03:23:16 db1131 systemd[14576]: Listening on GnuPG cryptographic agent and passphrase cache (restricted).
Nov 18 03:23:16 db1131 systemd[14576]: Reached target Sockets.
Nov 18 03:23:16 db1131 systemd[14576]: Reached target Basic System.
Nov 18 03:23:16 db1131 systemd[14576]: Reached target Default.
Nov 18 03:23:16 db1131 systemd[14576]: Startup finished in 46ms.
Nov 18 03:23:16 db1131 systemd[1]: Started User Manager for UID 112.
Nov 18 03:23:16 db1131 systemd[1]: Started Session 49565 of user debmonitor.
Nov 18 03:23:16 db1131 puppet-agent-cronjob: su: warning: cannot change directory to /nonexistent: No such file or directory
Nov 18 03:23:16 db1131 puppet-agent-cronjob: INFO:debmonitor:Found 552 installed binary packages
Nov 18 03:23:16 db1131 puppet-agent-cronjob: INFO:debmonitor:Found 9 upgradable binary packages (including new dependencies)
Nov 18 03:23:22 db1131 puppet-agent-cronjob: WARNING:urllib3.connectionpool:Retrying (Retry(total=2, connect=None, read=None, redirect=None, status=None)) after connection broken by 'NewConnectionError('<urllib3.connection.VerifiedHTTPSConnection object at 0x7fc272b066d8>: Failed to establish a new connection: [Errno -3] Temporary failure in name resolution')': /hosts/db1131.eqiad.wmnet/update
Nov 18 03:23:58 db1131 kernel: [15945952.439595] tg3 0000:04:00.0 eno1: Link is up at 1000 Mbps, full duplex
Nov 18 03:23:58 db1131 kernel: [15945952.439610] tg3 0000:04:00.0 eno1: Flow control is off for TX and off for RX
Nov 18 03:23:58 db1131 kernel: [15945952.439613] tg3 0000:04:00.0 eno1: EEE is disabled
Nov 18 03:24:00 db1131 systemd[1]: Started Regular job to collect puppet agent stats.
Nov 18 03:24:00 db1131 systemd[1]: prometheus_puppet_agent_stats.service: Succeeded.
Nov 18 03:24:02 db1131 kernel: [15945956.681518] tg3 0000:04:00.0 eno1: Link is down
Nov 18 03:24:09 db1131 kernel: [15945963.733365] tg3 0000:04:00.0 eno1: Link is up at 1000 Mbps, full duplex
Nov 18 03:24:09 db1131 kernel: [15945963.733377] tg3 0000:04:00.0 eno1: Flow control is off for TX and off for RX
Nov 18 03:24:09 db1131 kernel: [15945963.733381] tg3 0000:04:00.0 eno1: EEE is disabled
Nov 18 03:24:15 db1131 kernel: [15945969.461256] tg3 0000:04:00.0 eno1: Link is down
Nov 18 03:24:22 db1131 kernel: [15945977.010944] tg3 0000:04:00.0 eno1: Link is up at 1000 Mbps, full duplex
Nov 18 03:24:22 db1131 kernel: [15945977.010957] tg3 0000:04:00.0 eno1: Flow control is off for TX and off for RX
Nov 18 03:24:22 db1131 kernel: [15945977.010960] tg3 0000:04:00.0 eno1: EEE is disabled
Nov 18 03:24:24 db1131 kernel: [15945978.872003] tg3 0000:04:00.0 eno1: Link is down
Nov 18 03:24:28 db1131 kernel: [15945982.415605] tg3 0000:04:00.0 eno1: Link is up at 1000 Mbps, full duplex
Nov 18 03:24:28 db1131 kernel: [15945982.415618] tg3 0000:04:00.0 eno1: Flow control is off for TX and off for RX
Nov 18 03:24:28 db1131 kernel: [15945982.415621] tg3 0000:04:00.0 eno1: EEE is disabled
Nov 18 03:25:01 db1131 CRON[14784]: (root) CMD (command -v debian-sa1 > /dev/null && debian-sa1 1 1)

It seems the network interface went down.

It actually flapped several times (from kern.log):

Nov 18 01:35:21 db1131 kernel: [15939435.175199] tg3 0000:04:00.0 eno1: Link is down
Nov 18 01:35:40 db1131 kernel: [15939454.725701] tg3 0000:04:00.0 eno1: Link is up at 1000 Mbps, full duplex
Nov 18 01:35:40 db1131 kernel: [15939454.725711] tg3 0000:04:00.0 eno1: Flow control is off for TX and off for RX
Nov 18 01:35:40 db1131 kernel: [15939454.725713] tg3 0000:04:00.0 eno1: EEE is disabled

Nov 18 01:35:41 db1131 kernel: [15939455.149812] tg3 0000:04:00.0 eno1: Link is down
Nov 18 01:35:44 db1131 kernel: [15939458.756397] tg3 0000:04:00.0 eno1: Link is up at 1000 Mbps, full duplex
Nov 18 01:35:44 db1131 kernel: [15939458.756410] tg3 0000:04:00.0 eno1: Flow control is off for TX and off for RX
Nov 18 01:35:44 db1131 kernel: [15939458.756414] tg3 0000:04:00.0 eno1: EEE is disabled

Nov 18 02:58:30 db1131 kernel: [15944424.039480] tg3 0000:04:00.0 eno1: Link is down
Nov 18 02:58:33 db1131 kernel: [15944427.484025] tg3 0000:04:00.0 eno1: Link is up at 1000 Mbps, full duplex
Nov 18 02:58:33 db1131 kernel: [15944427.484041] tg3 0000:04:00.0 eno1: Flow control is off for TX and off for RX
Nov 18 02:58:33 db1131 kernel: [15944427.484044] tg3 0000:04:00.0 eno1: EEE is disabled

Nov 18 03:22:06 db1131 kernel: [15945840.380917] tg3 0000:04:00.0 eno1: Link is down
Nov 18 03:23:58 db1131 kernel: [15945952.439595] tg3 0000:04:00.0 eno1: Link is up at 1000 Mbps, full duplex
Nov 18 03:23:58 db1131 kernel: [15945952.439610] tg3 0000:04:00.0 eno1: Flow control is off for TX and off for RX
Nov 18 03:23:58 db1131 kernel: [15945952.439613] tg3 0000:04:00.0 eno1: EEE is disabled
Nov 18 03:24:02 db1131 kernel: [15945956.681518] tg3 0000:04:00.0 eno1: Link is down
Nov 18 03:24:09 db1131 kernel: [15945963.733365] tg3 0000:04:00.0 eno1: Link is up at 1000 Mbps, full duplex
Nov 18 03:24:09 db1131 kernel: [15945963.733377] tg3 0000:04:00.0 eno1: Flow control is off for TX and off for RX
Nov 18 03:24:09 db1131 kernel: [15945963.733381] tg3 0000:04:00.0 eno1: EEE is disabled

Nov 18 03:24:15 db1131 kernel: [15945969.461256] tg3 0000:04:00.0 eno1: Link is down
Nov 18 03:24:22 db1131 kernel: [15945977.010944] tg3 0000:04:00.0 eno1: Link is up at 1000 Mbps, full duplex
Nov 18 03:24:22 db1131 kernel: [15945977.010957] tg3 0000:04:00.0 eno1: Flow control is off for TX and off for RX
Nov 18 03:24:22 db1131 kernel: [15945977.010960] tg3 0000:04:00.0 eno1: EEE is disabled

Nov 18 03:24:24 db1131 kernel: [15945978.872003] tg3 0000:04:00.0 eno1: Link is down
Nov 18 03:24:28 db1131 kernel: [15945982.415605] tg3 0000:04:00.0 eno1: Link is up at 1000 Mbps, full duplex
Nov 18 03:24:28 db1131 kernel: [15945982.415618] tg3 0000:04:00.0 eno1: Flow control is off for TX and off for RX
Nov 18 03:24:28 db1131 kernel: [15945982.415621] tg3 0000:04:00.0 eno1: EEE is disabled

This is where my (non-existent) network knowledge ends, Can it be NIC having issues and needing replacement? A loose Ethernet cable?

Mentioned in SAL (#wikimedia-operations) [2021-11-18T05:08:03Z] <ladsgroup@cumin1001> dbctl commit (dc=all): 'Depooling db1131 due to network issues (T295952)', diff saved to https://phabricator.wikimedia.org/P17758 and previous config saved to /var/cache/conftool/dbconfig/20211118-050802-ladsgroup.json

Hi, ops-eqiad. It is possible to check Ethernet cable of db1131 (C8)? Thanks!

Change 739698 had a related patch set uploaded (by Ladsgroup; author: Amir Sarabadani):

[operations/puppet@production] mediawiki: Disable notification for db1131

https://gerrit.wikimedia.org/r/739698

Change 739698 merged by Ladsgroup:

[operations/puppet@production] mediawiki: Disable notification for db1131

https://gerrit.wikimedia.org/r/739698

Marostegui moved this task from Triage to In progress on the DBA board.

Raising this to high as this is the candidate master for s6.

@Cmjohnson @Jclark-ctr can you double check if the cable is loose?

Hopefully it's just a loose cable. If not, just a FYI - the out of warranty date is listed as November 13, so we may or may not be able to RMA parts with Dell for it. Thanks, Willy

@Cmjohnson @Jclark-ctr can you double check if the cable is loose?

@Marostegui the cable has been replaced, I am able to ssh into the server now. Please take a look

Thanks @Cmjohnson - I can access the host too. Before it was also accesible but we saw the network flapping a few times. I am going to leave the host depooled for the weekend and if the network hasn't flapped during Friday and the weekend I will start repooling it.
Thank you!

@Marostegui when i switched out cable earlier had left my laptop in another building i relayed information to chris. The connection activity led on nic was flashing abnormally after replacing cable it was more consistant with how it would usually flash. It did have connection prior to me changing cable also.

Oh sweet - thanks for the information John. So maybe it was indeed a bad cable. I will check on Monday if we had some other flapping since the cable change, and if all goes well I will repool and close this task.

Thanks for being so fast!!

Marostegui lowered the priority of this task from High to Medium.Thu, Nov 18, 7:48 PM

No more network flaps, so going to start repooling.

Change 740384 had a related patch set uploaded (by Marostegui; author: Marostegui):

[operations/puppet@production] db1131: Enable notifications

https://gerrit.wikimedia.org/r/740384

Change 740384 merged by Marostegui:

[operations/puppet@production] db1131: Enable notifications

https://gerrit.wikimedia.org/r/740384

Host repooled - thanks DCOps for the fast response!

Maybe we need to revisit the alerting for hosts if they start to send false alerts often.

@Ladsgroup: I'm not following, why would a networking problem be a 'false' alert?

Maybe we need to revisit the alerting for hosts if they start to send false alerts often.

@Ladsgroup: I'm not following, why would a networking problem be a 'false' alert?

Sorry for misunderstanding. I created the ticket right after the alert when we were not sure what is the root cause and a couple of days after we introduced these alerts, so my thinking was that if it alerts too often, wouldn't have a user impact (and it's self-healing), we should drop it. I have had a lot of flapping alerts that made me numb to alerts and that's not good.