ManagementSSHDown parse1002.eqiad.wmnet
Open, Needs TriagePublic
Actions

Assigned To

Authored By

	phaultfinder
	Mon, Apr 22, 10:35 AM

Description

Common information

dashboard: TODO
description: The management interface at parse1002.mgmt:22 has been unresponsive for multiple hours.
runbook: https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
summary: Unresponsive management for parse1002.mgmt:22

alertname: ManagementSSHDown
instance: parse1002.mgmt:22
job: probes/mgmt
module: ssh_banner
prometheus: ops
rack: A3
severity: task
site: eqiad
source: prometheus
team: dcops

Firing alerts

dashboard: TODO
description: The management interface at parse1002.mgmt:22 has been unresponsive for multiple hours.
runbook: https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
summary: Unresponsive management for parse1002.mgmt:22
alertname: ManagementSSHDown
instance: parse1002.mgmt:22
job: probes/mgmt
module: ssh_banner
prometheus: ops
rack: A3
severity: task
site: eqiad
source: prometheus
team: dcops
Source

Related Objects
Search...

		Status	Subtype	Assigned	Task
		Resolved	Release	brennen	T361396 1.43.0-wmf.2 deployment blockers
		Open		Jclark-ctr	T363086 ManagementSSHDown parse1002.eqiad.wmnet

Event Timeline

phaultfinder created this task.Mon, Apr 22, 10:35 AM

Maintenance_bot added a project: SRE.Mon, Apr 22, 11:29 AM

parse1002.eqiad.wmnet is down / unreachable but is still in the pool of hosts to deploy tool. That has caused the MediaWiki train to fail over night and is causing every MediaWiki deployment to error out due to a timeout when trying reach that host.

Can one please remove the host from the pool of MediaWiki target hosts? Thanks!

hashar renamed this task from ManagementSSHDown to ManagementSSHDown parse1002.eqiad.wmnet.Tue, Apr 23, 8:23 AM

hashar added a parent task: T361396: 1.43.0-wmf.2 deployment blockers.Tue, Apr 23, 9:05 AM

In T363086#9734314, @hashar wrote:

parse1002.eqiad.wmnet is down / unreachable but is still in the pool of hosts to deploy tool. That has caused the MediaWiki train to fail over night and is causing every MediaWiki deployment to error out due to a timeout when trying reach that host.

Can one please remove the host from the pool of MediaWiki target hosts? Thanks!

I 've just ran puppet node deactivate parse1002.eqiad.wmnet and forced a puppet run to have the node removed.

In the long run, we need to make the hack that caused this irrelevant and remove it. It's an architecturally problematic hack.

hashar mentioned this in T361396: 1.43.0-wmf.2 deployment blockers.Tue, Apr 23, 9:09 AM

Mentioned in SAL (#wikimedia-operations) [2024-04-23T10:43:40Z] <jayme> kubectl cordon parse1002.eqiad.wmnet - T363086

@akosiaris @hashar reset idrac with no change i will need to reboot server and hook crash cart up to it. Please advise if i am able to reboot.

In T363086#9735726, @Jclark-ctr wrote:

@akosiaris @hashar reset idrac with no change i will need to reboot server and hook crash cart up to it. Please advise if i am able to reboot.

On our side, the node is disabled and we can't do anything to it, you have my go ahead.

Server is out of warranty preformed reboot came up with no issues, Swapped idrac cable and updated idrac firmware. seems to be up and running now. @akosiaris

Cool. Thanks.

I 've just uncordoned it, it should receive mediawiki payloads in the next deployment. I 've also checked and it's again a scap target for kubernetes-workers group.

I am resolving, hopefully we won't see a recurrence.

scap does the docker pull on any of the k8s worker as defined by the kubernetes-workers group and parse1002 is n that group:

deploy1002$ grep -R parse1002 /etc/dsh/group
/etc/dsh/group/kubernetes-workers:parse1002.eqiad.wmnet

The the host is apparently down again since today scap failed to get it to pull the MediaWiki k8s image due to an ssh timeout:

/usr/bin/sudo /usr/local/sbin/mediawiki-image-download 2024-04-26-093629-publish (ran as mwdeploy@parse1002.eqiad.wmnet) returned [255]: ssh: connect to host parse1002.eqiad.wmnet port 22: Connection timed out

Removing assignee that was automatically set by Phabricator when the task got marked as resolved.

Mentioned in SAL (#wikimedia-operations) [2024-04-26T11:28:31Z] <claime> Deactivating puppet for parse1002 - T363086

Mentioned in SAL (#wikimedia-operations) [2024-04-26T11:29:14Z] <claime> Forcing puppet run on deploy server - T363086

Mentioned in SAL (#wikimedia-operations) [2024-04-26T11:33:27Z] <claime> Forcing puppet run on O:alerting_host - T363086

Mentioned in SAL (#wikimedia-operations) [2024-04-26T11:53:56Z] <claime> Silencing all alerts matching parse1002.* for 4 days - T363086

Downtimed, silence ID e5915daa-08f1-45f6-b805-fee5078d64da

Wargo mentioned this in T362856: ContentTranslationPublishRequirements does not always prevent unqualified user from publishing translation.Fri, Apr 26, 7:07 PM

Jclark-ctr moved this task from Backlog to Hardware Failure / Troubleshoot on the ops-eqiad board.Mon, Apr 29, 5:50 PM

Jclark-ctr closed this task as Resolved.Mon, Apr 29, 5:54 PM

Jclark-ctr claimed this task.

Jclark-ctr reopened this task as Open.Mon, Apr 29, 6:23 PM

Jclark-ctr mentioned this in T363566: ManagementSSHDown.

@Clement_Goubert @akosiaris since this failed again i did reset idrac again and is back up right now. Idrac is not showing anything and is out of warranty. with my limited access can check and see if there any errors in dmesg or log files?

@Jclark-ctr

syslog doesn't have anything, these are the last few lines

2024-04-25T19:17:00.091655+00:00 parse1002 systemd[1]: Starting Export confd Prometheus metrics...
2024-04-25T19:17:00.209393+00:00 parse1002 systemd[1]: confd_prometheus_metrics.service: Succeeded.
2024-04-25T19:17:00.209655+00:00 parse1002 systemd[1]: Finished Export confd Prometheus metrics.
2024-04-25T19:17:01.982996+00:00 parse1002 CRON[1612839]: (root) CMD (   cd / && run-parts --report /etc/cron.hourly)

the systemd journal doesn't have anything more telling, those are the last few lines

Apr 25 19:17:33 parse1002 sudo[1622784]: pam_unix(sudo:session): session opened for user root(uid=0) by (uid=110)
Apr 25 19:17:33 parse1002 sudo[1622784]: pam_unix(sudo:session): session closed for user root
Apr 25 19:17:33 parse1002 sudo[1622803]: prometheus : PWD=/ ; USER=root ; COMMAND=/usr/sbin/ipmi-dcmi --get-system-power-statistics --config-file /tmp/ipmi_exporter-9292496e52f506478413e605ff7e0718
Apr 25 19:17:33 parse1002 sudo[1622803]: pam_unix(sudo:session): session opened for user root(uid=0) by (uid=110)
Apr 25 19:17:33 parse1002 sudo[1622803]: pam_unix(sudo:session): session closed for user root
Apr 25 19:17:33 parse1002 sudo[1622816]: prometheus : PWD=/ ; USER=root ; COMMAND=/usr/sbin/ipmi-sel --info --config-file /tmp/ipmi_exporter-d279198d79306040f6233c73b14ec381
Apr 25 19:17:33 parse1002 sudo[1622816]: pam_unix(sudo:session): session opened for user root(uid=0) by (uid=110)
Apr 25 19:17:33 parse1002 sudo[1622816]: pam_unix(sudo:session): session closed for user root
Apr 25 19:17:34 parse1002 ulogd[1140]: [fw-in-drop] IN=eno1 OUT= MAC=b0:4f:13:b3:22:ae:e4:3d:1a:54:40:c7:08:00 SRC=10.64.16.95 DST=10.2.2.75 LEN=40 TOS=00 PREC=0x00 TTL=62 ID=0 DF PROTO=TCP SPT=44301 DPT=4450 SEQ=2591393864 ACK=0 WINDOW=0 RST URGP=0 MARK=0
Apr 25 19:17:38 parse1002 ulogd[1140]: [fw-in-drop] IN=eno1 OUT= MAC=b0:4f:13:b3:22:ae:e4:3d:1a:54:40:c7:08:00 SRC=10.194.156.43 DST=10.2.2.81 LEN=40 TOS=00 PREC=0x00 TTL=59 ID=0 DF PROTO=TCP SPT=42382 DPT=4446 SEQ=1238033852 ACK=0 WINDOW=0 RST URGP=0 MARK=0
Apr 25 19:17:38 parse1002 ulogd[1140]: [fw-in-drop] IN=eno1 OUT= MAC=b0:4f:13:b3:22:ae:e4:3d:1a:54:40:c7:08:00 SRC=10.192.32.71 DST=10.2.2.81 LEN=40 TOS=00 PREC=0x00 TTL=60 ID=0 DF PROTO=TCP SPT=34352 DPT=4446 SEQ=2259236438 ACK=0 WINDOW=0 RST URGP=0 MARK=0
Apr 25 19:17:38 parse1002 ulogd[1140]: [fw-in-drop] IN=eno1 OUT= MAC=b0:4f:13:b3:22:ae:e4:3d:1a:54:40:c7:08:00 SRC=10.192.32.71 DST=10.2.2.81 LEN=40 TOS=00 PREC=0x00 TTL=60 ID=0 DF PROTO=TCP SPT=34352 DPT=4446 SEQ=2259236438 ACK=0 WINDOW=0 RST URGP=0 MARK=0

None of the above are unheard of.

and in kern.log there isn't anything either

2024-04-25T19:06:58.640261+00:00 parse1002 kernel: IPv6: ADDRCONF(NETDEV_CHANGE): cali1964e5ada51: link becomes ready
2024-04-25T19:07:25.524214+00:00 parse1002 kernel: IPv6: ADDRCONF(NETDEV_CHANGE): cali291be25919c: link becomes ready
2024-04-25T19:08:35.680253+00:00 parse1002 kernel: IPv6: ADDRCONF(NETDEV_CHANGE): califeb8dc02e7f: link becomes ready
2024-04-29T18:19:56.428179+00:00 parse1002 kernel: microcode: microcode updated early to revision 0x5003604, date = 2023-03-17

if there is some evidence of what has happened, it's probably not in the OS of the box.

SEL is a bit weird, I see

$ sudo ipmi-sel
ID  | Date        | Time     | Name             | Type                        | Event
1   | Jul-29-2023 | 00:24:45 | SEL              | Event Logging Disabled      | Log Area Reset/Cleared

I would expect way more in SEL after 1 year.

Idrac is still up after almost 24 hours. i did move IDRAC port on switch to a different group of ports will monitor it

JMeybohm mentioned this in T363971: scap should not run mediawiki-image-download on pooled=inactive servers.Thu, May 2, 8:10 AM

@akosiaris idrac has stayed up for 4 days now possibly me relocating to a different port helped it. We wont know until it is put in use again. this server is out of warranty if it fails again we could look at swapping it with another decom server?

ManagementSSHDown parse1002.eqiad.wmnetOpen, Needs TriagePublicActions