Page MenuHomePhabricator

ManagementSSHDown parse1002.eqiad.wmnet
Closed, ResolvedPublic

Description

Common information

  • alertname: ManagementSSHDown
  • instance: parse1002.mgmt:22
  • job: probes/mgmt
  • module: ssh_banner
  • prometheus: ops
  • rack: A3
  • severity: task
  • site: eqiad
  • source: prometheus
  • team: dcops

Firing alerts


  • dashboard: TODO
  • description: The management interface at parse1002.mgmt:22 has been unresponsive for multiple hours.
  • runbook: https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
  • summary: Unresponsive management for parse1002.mgmt:22
  • alertname: ManagementSSHDown
  • instance: parse1002.mgmt:22
  • job: probes/mgmt
  • module: ssh_banner
  • prometheus: ops
  • rack: A3
  • severity: task
  • site: eqiad
  • source: prometheus
  • team: dcops
  • Source

Details

Other Assignee
Jclark-ctr

Event Timeline

parse1002.eqiad.wmnet is down / unreachable but is still in the pool of hosts to deploy tool. That has caused the MediaWiki train to fail over night and is causing every MediaWiki deployment to error out due to a timeout when trying reach that host.

Can one please remove the host from the pool of MediaWiki target hosts? Thanks!

hashar renamed this task from ManagementSSHDown to ManagementSSHDown parse1002.eqiad.wmnet.Apr 23 2024, 8:23 AM

parse1002.eqiad.wmnet is down / unreachable but is still in the pool of hosts to deploy tool. That has caused the MediaWiki train to fail over night and is causing every MediaWiki deployment to error out due to a timeout when trying reach that host.

Can one please remove the host from the pool of MediaWiki target hosts? Thanks!

I 've just ran puppet node deactivate parse1002.eqiad.wmnet and forced a puppet run to have the node removed.

In the long run, we need to make the hack that caused this irrelevant and remove it. It's an architecturally problematic hack.

Mentioned in SAL (#wikimedia-operations) [2024-04-23T10:43:40Z] <jayme> kubectl cordon parse1002.eqiad.wmnet - T363086

@akosiaris @hashar reset idrac with no change i will need to reboot server and hook crash cart up to it. Please advise if i am able to reboot.

@akosiaris @hashar reset idrac with no change i will need to reboot server and hook crash cart up to it. Please advise if i am able to reboot.

On our side, the node is disabled and we can't do anything to it, you have my go ahead.

Server is out of warranty preformed reboot came up with no issues, Swapped idrac cable and updated idrac firmware. seems to be up and running now. @akosiaris

Cool. Thanks.

I 've just uncordoned it, it should receive mediawiki payloads in the next deployment. I 've also checked and it's again a scap target for kubernetes-workers group.

akosiaris claimed this task.

I am resolving, hopefully we won't see a recurrence.

scap does the docker pull on any of the k8s worker as defined by the kubernetes-workers group and parse1002 is n that group:

deploy1002$ grep -R parse1002 /etc/dsh/group
/etc/dsh/group/kubernetes-workers:parse1002.eqiad.wmnet

The the host is apparently down again since today scap failed to get it to pull the MediaWiki k8s image due to an ssh timeout:

/usr/bin/sudo /usr/local/sbin/mediawiki-image-download 2024-04-26-093629-publish (ran as mwdeploy@parse1002.eqiad.wmnet) returned [255]: ssh: connect to host parse1002.eqiad.wmnet port 22: Connection timed out

Removing assignee that was automatically set by Phabricator when the task got marked as resolved.

Mentioned in SAL (#wikimedia-operations) [2024-04-26T11:28:31Z] <claime> Deactivating puppet for parse1002 - T363086

Mentioned in SAL (#wikimedia-operations) [2024-04-26T11:29:14Z] <claime> Forcing puppet run on deploy server - T363086

Mentioned in SAL (#wikimedia-operations) [2024-04-26T11:33:27Z] <claime> Forcing puppet run on O:alerting_host - T363086

Mentioned in SAL (#wikimedia-operations) [2024-04-26T11:53:56Z] <claime> Silencing all alerts matching parse1002.* for 4 days - T363086

Downtimed, silence ID e5915daa-08f1-45f6-b805-fee5078d64da

Jclark-ctr claimed this task.

@Clement_Goubert @akosiaris since this failed again i did reset idrac again and is back up right now. Idrac is not showing anything and is out of warranty. with my limited access can check and see if there any errors in dmesg or log files?

@Jclark-ctr

syslog doesn't have anything, these are the last few lines

2024-04-25T19:17:00.091655+00:00 parse1002 systemd[1]: Starting Export confd Prometheus metrics...
2024-04-25T19:17:00.209393+00:00 parse1002 systemd[1]: confd_prometheus_metrics.service: Succeeded.
2024-04-25T19:17:00.209655+00:00 parse1002 systemd[1]: Finished Export confd Prometheus metrics.
2024-04-25T19:17:01.982996+00:00 parse1002 CRON[1612839]: (root) CMD (   cd / && run-parts --report /etc/cron.hourly)

the systemd journal doesn't have anything more telling, those are the last few lines

Apr 25 19:17:33 parse1002 sudo[1622784]: pam_unix(sudo:session): session opened for user root(uid=0) by (uid=110)
Apr 25 19:17:33 parse1002 sudo[1622784]: pam_unix(sudo:session): session closed for user root
Apr 25 19:17:33 parse1002 sudo[1622803]: prometheus : PWD=/ ; USER=root ; COMMAND=/usr/sbin/ipmi-dcmi --get-system-power-statistics --config-file /tmp/ipmi_exporter-9292496e52f506478413e605ff7e0718
Apr 25 19:17:33 parse1002 sudo[1622803]: pam_unix(sudo:session): session opened for user root(uid=0) by (uid=110)
Apr 25 19:17:33 parse1002 sudo[1622803]: pam_unix(sudo:session): session closed for user root
Apr 25 19:17:33 parse1002 sudo[1622816]: prometheus : PWD=/ ; USER=root ; COMMAND=/usr/sbin/ipmi-sel --info --config-file /tmp/ipmi_exporter-d279198d79306040f6233c73b14ec381
Apr 25 19:17:33 parse1002 sudo[1622816]: pam_unix(sudo:session): session opened for user root(uid=0) by (uid=110)
Apr 25 19:17:33 parse1002 sudo[1622816]: pam_unix(sudo:session): session closed for user root
Apr 25 19:17:34 parse1002 ulogd[1140]: [fw-in-drop] IN=eno1 OUT= MAC=b0:4f:13:b3:22:ae:e4:3d:1a:54:40:c7:08:00 SRC=10.64.16.95 DST=10.2.2.75 LEN=40 TOS=00 PREC=0x00 TTL=62 ID=0 DF PROTO=TCP SPT=44301 DPT=4450 SEQ=2591393864 ACK=0 WINDOW=0 RST URGP=0 MARK=0
Apr 25 19:17:38 parse1002 ulogd[1140]: [fw-in-drop] IN=eno1 OUT= MAC=b0:4f:13:b3:22:ae:e4:3d:1a:54:40:c7:08:00 SRC=10.194.156.43 DST=10.2.2.81 LEN=40 TOS=00 PREC=0x00 TTL=59 ID=0 DF PROTO=TCP SPT=42382 DPT=4446 SEQ=1238033852 ACK=0 WINDOW=0 RST URGP=0 MARK=0
Apr 25 19:17:38 parse1002 ulogd[1140]: [fw-in-drop] IN=eno1 OUT= MAC=b0:4f:13:b3:22:ae:e4:3d:1a:54:40:c7:08:00 SRC=10.192.32.71 DST=10.2.2.81 LEN=40 TOS=00 PREC=0x00 TTL=60 ID=0 DF PROTO=TCP SPT=34352 DPT=4446 SEQ=2259236438 ACK=0 WINDOW=0 RST URGP=0 MARK=0
Apr 25 19:17:38 parse1002 ulogd[1140]: [fw-in-drop] IN=eno1 OUT= MAC=b0:4f:13:b3:22:ae:e4:3d:1a:54:40:c7:08:00 SRC=10.192.32.71 DST=10.2.2.81 LEN=40 TOS=00 PREC=0x00 TTL=60 ID=0 DF PROTO=TCP SPT=34352 DPT=4446 SEQ=2259236438 ACK=0 WINDOW=0 RST URGP=0 MARK=0

None of the above are unheard of.

and in kern.log there isn't anything either

2024-04-25T19:06:58.640261+00:00 parse1002 kernel: IPv6: ADDRCONF(NETDEV_CHANGE): cali1964e5ada51: link becomes ready
2024-04-25T19:07:25.524214+00:00 parse1002 kernel: IPv6: ADDRCONF(NETDEV_CHANGE): cali291be25919c: link becomes ready
2024-04-25T19:08:35.680253+00:00 parse1002 kernel: IPv6: ADDRCONF(NETDEV_CHANGE): califeb8dc02e7f: link becomes ready
2024-04-29T18:19:56.428179+00:00 parse1002 kernel: microcode: microcode updated early to revision 0x5003604, date = 2023-03-17

if there is some evidence of what has happened, it's probably not in the OS of the box.

SEL is a bit weird, I see

$ sudo ipmi-sel
ID  | Date        | Time     | Name             | Type                        | Event
1   | Jul-29-2023 | 00:24:45 | SEL              | Event Logging Disabled      | Log Area Reset/Cleared

I would expect way more in SEL after 1 year.

Idrac is still up after almost 24 hours. i did move IDRAC port on switch to a different group of ports will monitor it

@akosiaris idrac has stayed up for 4 days now possibly me relocating to a different port helped it. We wont know until it is put in use again. this server is out of warranty if it fails again we could look at swapping it with another decom server?

Re-resolving, let's see how it fares this time around.

taavi subscribed.

And it's down again. I ran sudo puppet node deactivate parse1002.eqiad.wmnet again to have it removed from the scap mediawiki image pulling list.

this server is out of warranty if it fails again we could look at swapping it with another decom server?

@Jclark-ctr, this failed again, should we look at swapping it with another decom? Is there anything available?

VRiley-WMF updated Other Assignee, added: Jclark-ctr.

Investigated this unit with the assistance of Dell. After some troubleshooting and pulling logs, they will be sending out a new motherboard as a replacement (tomorrow). Will update this ticket as changes are made.

Mentioned in SAL (#wikimedia-sre) [2024-05-29T18:16:39Z] <rzl> evacuate cordoned node parse1002: kubectl -n linkrecommendation delete pod linkrecommendation-internal-load-datasets-28616700-7gsqs; kubectl -n linkrecommendation delete pod linkrecommendation-internal-load-datasets-28616700-xl7t4; kubectl -n toolhub delete pod toolhub-main-crawler-28616760-jrhbb # T363086

The motherboard for parse1002 has been replaced with a brand new one that Dell has shipped out. All the cables have been hooked back into it.

After some more troubleshooting, we reset the iDRAC to factory settings and now we can log into the machine via iDRAC.

Thank you @VRiley-WMF ! Tell us when we can bring it back in the cluster.

@Clement_Goubert I was just able to provision the server and I believe you should be able to add it. Let us know if there is any other issues with it!

Thanks. For information tracking, I put the server back to Active in netbox, ran the sre.dns.netbox cookbook, running homer 'cr*eqiad*' commit right now to restore BGP connectivity.

Mentioned in SAL (#wikimedia-operations) [2024-05-31T15:43:17Z] <claime> pooling and uncordoning parse1002 - T363086

Server back in the cluster, resolving. Thanks again :)