Page MenuHomePhabricator

ManagementSSHDown parse1002.eqiad.wmnet
Open, Needs TriagePublic

Description

Common information

  • alertname: ManagementSSHDown
  • instance: parse1002.mgmt:22
  • job: probes/mgmt
  • module: ssh_banner
  • prometheus: ops
  • rack: A3
  • severity: task
  • site: eqiad
  • source: prometheus
  • team: dcops

Firing alerts


  • dashboard: TODO
  • description: The management interface at parse1002.mgmt:22 has been unresponsive for multiple hours.
  • runbook: https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
  • summary: Unresponsive management for parse1002.mgmt:22
  • alertname: ManagementSSHDown
  • instance: parse1002.mgmt:22
  • job: probes/mgmt
  • module: ssh_banner
  • prometheus: ops
  • rack: A3
  • severity: task
  • site: eqiad
  • source: prometheus
  • team: dcops
  • Source

Event Timeline

parse1002.eqiad.wmnet is down / unreachable but is still in the pool of hosts to deploy tool. That has caused the MediaWiki train to fail over night and is causing every MediaWiki deployment to error out due to a timeout when trying reach that host.

Can one please remove the host from the pool of MediaWiki target hosts? Thanks!

hashar renamed this task from ManagementSSHDown to ManagementSSHDown parse1002.eqiad.wmnet.Tue, Apr 23, 8:23 AM

parse1002.eqiad.wmnet is down / unreachable but is still in the pool of hosts to deploy tool. That has caused the MediaWiki train to fail over night and is causing every MediaWiki deployment to error out due to a timeout when trying reach that host.

Can one please remove the host from the pool of MediaWiki target hosts? Thanks!

I 've just ran puppet node deactivate parse1002.eqiad.wmnet and forced a puppet run to have the node removed.

In the long run, we need to make the hack that caused this irrelevant and remove it. It's an architecturally problematic hack.

Mentioned in SAL (#wikimedia-operations) [2024-04-23T10:43:40Z] <jayme> kubectl cordon parse1002.eqiad.wmnet - T363086

@akosiaris @hashar reset idrac with no change i will need to reboot server and hook crash cart up to it. Please advise if i am able to reboot.

@akosiaris @hashar reset idrac with no change i will need to reboot server and hook crash cart up to it. Please advise if i am able to reboot.

On our side, the node is disabled and we can't do anything to it, you have my go ahead.

Server is out of warranty preformed reboot came up with no issues, Swapped idrac cable and updated idrac firmware. seems to be up and running now. @akosiaris

Cool. Thanks.

I 've just uncordoned it, it should receive mediawiki payloads in the next deployment. I 've also checked and it's again a scap target for kubernetes-workers group.

akosiaris claimed this task.

I am resolving, hopefully we won't see a recurrence.

scap does the docker pull on any of the k8s worker as defined by the kubernetes-workers group and parse1002 is n that group:

deploy1002$ grep -R parse1002 /etc/dsh/group
/etc/dsh/group/kubernetes-workers:parse1002.eqiad.wmnet

The the host is apparently down again since today scap failed to get it to pull the MediaWiki k8s image due to an ssh timeout:

/usr/bin/sudo /usr/local/sbin/mediawiki-image-download 2024-04-26-093629-publish (ran as mwdeploy@parse1002.eqiad.wmnet) returned [255]: ssh: connect to host parse1002.eqiad.wmnet port 22: Connection timed out

Removing assignee that was automatically set by Phabricator when the task got marked as resolved.

Mentioned in SAL (#wikimedia-operations) [2024-04-26T11:28:31Z] <claime> Deactivating puppet for parse1002 - T363086

Mentioned in SAL (#wikimedia-operations) [2024-04-26T11:29:14Z] <claime> Forcing puppet run on deploy server - T363086

Mentioned in SAL (#wikimedia-operations) [2024-04-26T11:33:27Z] <claime> Forcing puppet run on O:alerting_host - T363086

Mentioned in SAL (#wikimedia-operations) [2024-04-26T11:53:56Z] <claime> Silencing all alerts matching parse1002.* for 4 days - T363086

Downtimed, silence ID e5915daa-08f1-45f6-b805-fee5078d64da

Jclark-ctr claimed this task.

@Clement_Goubert @akosiaris since this failed again i did reset idrac again and is back up right now. Idrac is not showing anything and is out of warranty. with my limited access can check and see if there any errors in dmesg or log files?

@Jclark-ctr

syslog doesn't have anything, these are the last few lines

2024-04-25T19:17:00.091655+00:00 parse1002 systemd[1]: Starting Export confd Prometheus metrics...
2024-04-25T19:17:00.209393+00:00 parse1002 systemd[1]: confd_prometheus_metrics.service: Succeeded.
2024-04-25T19:17:00.209655+00:00 parse1002 systemd[1]: Finished Export confd Prometheus metrics.
2024-04-25T19:17:01.982996+00:00 parse1002 CRON[1612839]: (root) CMD (   cd / && run-parts --report /etc/cron.hourly)

the systemd journal doesn't have anything more telling, those are the last few lines

Apr 25 19:17:33 parse1002 sudo[1622784]: pam_unix(sudo:session): session opened for user root(uid=0) by (uid=110)
Apr 25 19:17:33 parse1002 sudo[1622784]: pam_unix(sudo:session): session closed for user root
Apr 25 19:17:33 parse1002 sudo[1622803]: prometheus : PWD=/ ; USER=root ; COMMAND=/usr/sbin/ipmi-dcmi --get-system-power-statistics --config-file /tmp/ipmi_exporter-9292496e52f506478413e605ff7e0718
Apr 25 19:17:33 parse1002 sudo[1622803]: pam_unix(sudo:session): session opened for user root(uid=0) by (uid=110)
Apr 25 19:17:33 parse1002 sudo[1622803]: pam_unix(sudo:session): session closed for user root
Apr 25 19:17:33 parse1002 sudo[1622816]: prometheus : PWD=/ ; USER=root ; COMMAND=/usr/sbin/ipmi-sel --info --config-file /tmp/ipmi_exporter-d279198d79306040f6233c73b14ec381
Apr 25 19:17:33 parse1002 sudo[1622816]: pam_unix(sudo:session): session opened for user root(uid=0) by (uid=110)
Apr 25 19:17:33 parse1002 sudo[1622816]: pam_unix(sudo:session): session closed for user root
Apr 25 19:17:34 parse1002 ulogd[1140]: [fw-in-drop] IN=eno1 OUT= MAC=b0:4f:13:b3:22:ae:e4:3d:1a:54:40:c7:08:00 SRC=10.64.16.95 DST=10.2.2.75 LEN=40 TOS=00 PREC=0x00 TTL=62 ID=0 DF PROTO=TCP SPT=44301 DPT=4450 SEQ=2591393864 ACK=0 WINDOW=0 RST URGP=0 MARK=0
Apr 25 19:17:38 parse1002 ulogd[1140]: [fw-in-drop] IN=eno1 OUT= MAC=b0:4f:13:b3:22:ae:e4:3d:1a:54:40:c7:08:00 SRC=10.194.156.43 DST=10.2.2.81 LEN=40 TOS=00 PREC=0x00 TTL=59 ID=0 DF PROTO=TCP SPT=42382 DPT=4446 SEQ=1238033852 ACK=0 WINDOW=0 RST URGP=0 MARK=0
Apr 25 19:17:38 parse1002 ulogd[1140]: [fw-in-drop] IN=eno1 OUT= MAC=b0:4f:13:b3:22:ae:e4:3d:1a:54:40:c7:08:00 SRC=10.192.32.71 DST=10.2.2.81 LEN=40 TOS=00 PREC=0x00 TTL=60 ID=0 DF PROTO=TCP SPT=34352 DPT=4446 SEQ=2259236438 ACK=0 WINDOW=0 RST URGP=0 MARK=0
Apr 25 19:17:38 parse1002 ulogd[1140]: [fw-in-drop] IN=eno1 OUT= MAC=b0:4f:13:b3:22:ae:e4:3d:1a:54:40:c7:08:00 SRC=10.192.32.71 DST=10.2.2.81 LEN=40 TOS=00 PREC=0x00 TTL=60 ID=0 DF PROTO=TCP SPT=34352 DPT=4446 SEQ=2259236438 ACK=0 WINDOW=0 RST URGP=0 MARK=0

None of the above are unheard of.

and in kern.log there isn't anything either

2024-04-25T19:06:58.640261+00:00 parse1002 kernel: IPv6: ADDRCONF(NETDEV_CHANGE): cali1964e5ada51: link becomes ready
2024-04-25T19:07:25.524214+00:00 parse1002 kernel: IPv6: ADDRCONF(NETDEV_CHANGE): cali291be25919c: link becomes ready
2024-04-25T19:08:35.680253+00:00 parse1002 kernel: IPv6: ADDRCONF(NETDEV_CHANGE): califeb8dc02e7f: link becomes ready
2024-04-29T18:19:56.428179+00:00 parse1002 kernel: microcode: microcode updated early to revision 0x5003604, date = 2023-03-17

if there is some evidence of what has happened, it's probably not in the OS of the box.

SEL is a bit weird, I see

$ sudo ipmi-sel
ID  | Date        | Time     | Name             | Type                        | Event
1   | Jul-29-2023 | 00:24:45 | SEL              | Event Logging Disabled      | Log Area Reset/Cleared

I would expect way more in SEL after 1 year.

Idrac is still up after almost 24 hours. i did move IDRAC port on switch to a different group of ports will monitor it

@akosiaris idrac has stayed up for 4 days now possibly me relocating to a different port helped it. We wont know until it is put in use again. this server is out of warranty if it fails again we could look at swapping it with another decom server?