Page MenuHomePhabricator

Temperature Inlet Temp issue on clouddumps1001:9290
Closed, ResolvedPublic

Assigned To
Authored By
phaultfinder
Jan 14 2025, 7:24 PM
Referenced Files
F59619503: Screenshot 2025-05-02 at 13.22.50.png
May 2 2025, 11:24 AM
F59579857: Screenshot 2025-04-30 at 11.34.13.png
Apr 30 2025, 9:35 AM
F59355701: Screenshot 2025-04-23 at 11.18.05.png
Apr 23 2025, 9:22 AM
F59289653: Screenshot 2025-04-18 at 9.17.47 AM.png
Apr 18 2025, 1:20 PM
F59281535: Screenshot 2025-04-17 at 16.25.47.png
Apr 17 2025, 2:26 PM
F59236759: nDcjLThB.png
Apr 15 2025, 3:50 PM
F59236746: WRtIUNdE.png
Apr 15 2025, 3:50 PM
F58929716: Screenshot 2025-03-27 at 12.45.06.png
Mar 27 2025, 11:47 AM

Description

Common information

  • alertname: Temperature
  • cluster: wmcs
  • id: 3
  • instance: clouddumps1001:9290
  • job: ipmi
  • name: Inlet Temp
  • prometheus: ops
  • severity: critical
  • site: eqiad
  • source: prometheus
  • team: wmcs

Firing alerts


Details

Event Timeline

There are a very large number of changes, so older changes are hidden. Show Older Changes

ganeti1044 has been relocated to U30 in the same rack with the same connections. Please let us know if there are still issues with the tempatures.

Draining ganeti1029.eqiad.wmnet of running VMs

Draining ganeti1029.eqiad.wmnet of running VMs

Draining ganeti1029.eqiad.wmnet of running VMs

Draining ganeti1030.eqiad.wmnet of running VMs

The alert fired again a few minutes ago, then went back to normal:

Screenshot 2025-02-27 at 10.36.13.png (1×2 px, 345 KB)

I would like to check in on this and see if there has been any further warnings?

@VRiley-WMF No more warnings since 16:00 UTC yesterday!

Screenshot 2025-02-28 at 16.46.15.png (1×2 px, 397 KB)

I will resolve this again, I will reopen if the alert triggers again.

@fnegri @VRiley-WMF did this need to be reopened.

idrac shows The system inlet temperature is greater than the upper warning threshold. Sun 02 Mar 2025 05:30:16

Screenshot 2025-03-03 at 12.01.17 PM.png (774×2 px, 150 KB)

fyi 2 fans are only spinning at 75%

The alert has not fired since Feb 27. The alert is based on ipmi_temperature_state, and that seems to trigger when the temperature goes beyond 37 celsius (see graph below).

I'm reopening the task for further investigation, because even after moving the server next to it, the average Inlet Temp of clouddumps1001 is still hovering around 35 celsius, which is not enough to trigger the alert but @RobH said seems too high in the comment above.

Here's a graph of the past 30 days showing both ipmi_temperature_celsius and ipmi_temperature_state:

Screenshot 2025-03-04 at 12.53.15.png (828×2 px, 355 KB)

The temperature remains very close to the threshold, and the alert has been firing intermittently since my previous comment.

Screenshot 2025-03-10 at 18.14.37.png (952×2 px, 413 KB)

Understood, I'm currently investigating this

Is there a timeframe for us to take this server down?

I think this one is tricky to depool, there are some notes at https://wikitech.wikimedia.org/wiki/Dumps/Dump_servers#Maintenance/downtime

It's probably a good exercise to try and route all traffic to clouddumps1002 because I think no one has done it recently :)

Routing the traffic to the other host would also clarify if the high temperature is somehow related to the user load on this server.

aborrero added a parent task: Unknown Object (Task).Mar 11 2025, 3:39 PM

Claiming this to take care of the depool, I will reassign to @VRiley-WMF when this is safe to shut down.

fnegri changed the task status from Open to In Progress.Mar 11 2025, 3:52 PM
fnegri added a subscriber: BTullis.

I did some more research on how to depool this host, and found there's an additional tip here: https://wikitech.wikimedia.org/wiki/Portal:Data_Services/Admin/Dumps#Cloud_VPS_NFS

To failover the Cloud VPS NFS server to the other node you need to:

  • Update the dumps_dist_active_vps Hiera variable
  • Wait for a full Puppet cycle or two for most VMs to pick it up

But I don't feel confident that changing that Hiera is enough to route all traffic to the other host. Also, how will NFS clients react when the host gets shut down? Will they try reconnecting to the other host, or will they just hang causing a CloudVPS/Toolforge outage?

Finally, I don't have a clear picture of which clients are connecting to clouddumps hosts. Is it only Cloud VPS VMs using profile::wmcs::nfsclient, or there's more? I see there's an Nginx server listening as well.

I'm out all of next week so I'm reassigning this to @Andrew — it would be good if someone from Data Engineering could also double-check the depool procedure, maybe @BTullis?

Finally, I don't have a clear picture of which clients are connecting to clouddumps hosts. Is it only Cloud VPS VMs using profile::wmcs::nfsclient, or there's more?

The stat* physical hosts also mount the NFS volumes:

$ ssh stat1009.eqiad.wmnet
$ mount | grep nfs
clouddumps1002.wikimedia.org:/ on /mnt/nfs/dumps-clouddumps1002.wikimedia.org type nfs4 (ro,relatime,vers=4.2,rsize=8192,wsize=8192,namlen=255,hard,proto=tcp,timeo=14,retrans=2,sec=sys,clientaddr=10.64.21.17,local_lock=none,addr=208.80.154.71)
clouddumps1001.wikimedia.org:/ on /mnt/nfs/dumps-clouddumps1001.wikimedia.org type nfs4 (ro,relatime,vers=4.2,rsize=8192,wsize=8192,namlen=255,hard,proto=tcp,timeo=14,retrans=2,sec=sys,clientaddr=10.64.21.17,local_lock=none,addr=208.80.154.142)

These are the "Production NFS" clients from https://wikitech.wikimedia.org/wiki/Portal:Data_Services/Admin/Dumps#Interfaces

I see there's an Nginx server listening as well.

In normal operation one of the 2 hosts serves NFS and the other serves https://dumps.wikimedia.org and the rsync mirrors. I think the nginx process always runs on both hosts and the dumps.wikimedia.org CNAME DNS record is used to control which is the active host at any given time.

clouddumps1001 is also mounted from an-launcher1002, and this path is used for running a few DAGs.

I also found this reference in PAWS: https://gitlab.wikimedia.org/repos/cloud/paws/-/blob/main/paws/production.yaml?ref_type=heads#L18-22

Without this, dumps becomes inaccessible and can hang the host

I checked that this is still in use this by logging into https://hub-paws.wmcloud.org

image.png (1×1 px, 145 KB)

I did a quick check of which hostnames have this share mounted.

btullis@clouddumps1001:~$ for a in $(ss -tano state established sport nfs|awk '{print $4}'|cut -f1 -d":"|grep -v Address|sort|uniq); do host $a | grep -v alias | awk '{print $5}'; done
wdqs2009.codfw.wmnet.
wcqs2001.codfw.wmnet.
wdqs2010.codfw.wmnet.
stat1010.eqiad.wmnet.
wdqs1024.eqiad.wmnet.
an-launcher1002.eqiad.wmnet.
stat1009.eqiad.wmnet.
stat1011.eqiad.wmnet.
stat1008.eqiad.wmnet.
nat.cloudgw.eqiad1.wikimediacloud.org.
mail.toolsbeta.wmcloud.org.
instance-toolsbeta-mail-2.toolsbeta.wmcloud.org.
beta.math.wmflabs.org.
instance-math24.math.wmcloud.org.
mardi.math.wmflabs.org.
login.tools.wmflabs.org.
dev.toolforge.org.
instance-tools-bastion-12.tools.wmcloud.org.
instance-tools-checker-5.tools.wmcloud.org.
checker.tools.wmflabs.org.
instance-tools-bastion-13.tools.wmcloud.org.
bastion.toolforge.org.
instance-tools-mail-4.tools.wmcloud.org.
mail.tools.wmcloud.org.
instance-tools-sgebastion-10.tools.wmcloud.org.
login-buster.toolforge.org.
nat.cloudgw.codfw1dev.wikimediacloud.org.

PAWS doesn't show up on that list, so I presume that it's NAT'd by nat.cloudgw.eqiad1.wikimediacloud.org.

So from the Data-Platform-SRE side, I think that we would be OK with downtime. Processes trying to access this path will just hang, waiting for it to return.

But I think that the NFS clients in WMCS might have more of an issue, as @fnegri indicated.

fnegri added subscribers: dcaro, aborrero.

I'm back from holidays and I'm re-assigning this task to myself. @aborrero is quite confident that updating the hiera keys dumps_dist_active_vps and dumps_dist_active_web as described at Dumps#Failover should be enough for a safe failover.

@dcaro is concerned about active NFS mounts from pods, those might require a restart of the NFS server (unless Puppet is doing it automatically).

A similar failover was done in T321313, related patches:

Change #1131051 had a related patch set uploaded (by FNegri; author: FNegri):

[operations/puppet@production] Failover all dumps traffic to clouddumps1002

https://gerrit.wikimedia.org/r/1131051

Change #1131051 merged by FNegri:

[operations/puppet@production] Failover all dumps traffic to clouddumps1002

https://gerrit.wikimedia.org/r/1131051

@dcaro is concerned about active NFS mounts from pods, those might require a restart of the NFS server (unless Puppet is doing it automatically).

This is what Puppet did on NFS workers, the symlinks were updated but the service was not restarted:

2025-03-26T13:19:32.034169+00:00 tools-k8s-worker-nfs-43 puppet-agent[355401]: Applying configuration version '(673839ca92) jelto - aptrepo: upgrade gitlab-ce and
gitlab-runner to 17.8'
2025-03-26T13:19:38.036936+00:00 tools-k8s-worker-nfs-43 puppet-agent[355401]: (/Stage[main]/Profile::Wmcs::Nfsclient/File[/public/dumps/public]/target) target cha
nged '/mnt/nfs/dumps-clouddumps1001.wikimedia.org/' to '/mnt/nfs/dumps-clouddumps1002.wikimedia.org/'
2025-03-26T13:19:38.041939+00:00 tools-k8s-worker-nfs-43 puppet-agent[355401]: (/Stage[main]/Profile::Wmcs::Nfsclient/File[/public/dumps/incr]/target) target changed '/mnt/nfs/dumps-clouddumps1001.wikimedia.org/other/incr' to '/mnt/nfs/dumps-clouddumps1002.wikimedia.org/other/incr'
2025-03-26T13:19:38.046222+00:00 tools-k8s-worker-nfs-43 puppet-agent[355401]: (/Stage[main]/Profile::Wmcs::Nfsclient/File[/public/dumps/pagecounts-all-sites]/targ
et) target changed '/mnt/nfs/dumps-clouddumps1001.wikimedia.org/other/pagecounts-all-sites' to '/mnt/nfs/dumps-clouddumps1002.wikimedia.org/other/pagecounts-all-sites'
2025-03-26T13:19:38.049072+00:00 tools-k8s-worker-nfs-43 puppet-agent[355401]: (/Stage[main]/Profile::Wmcs::Nfsclient/File[/public/dumps/pagecounts-raw]/target) ta
rget changed '/mnt/nfs/dumps-clouddumps1001.wikimedia.org/other/pagecounts-raw' to '/mnt/nfs/dumps-clouddumps1002.wikimedia.org/other/pagecounts-raw'
2025-03-26T13:19:38.052357+00:00 tools-k8s-worker-nfs-43 puppet-agent[355401]: (/Stage[main]/Profile::Wmcs::Nfsclient/File[/public/dumps/pageviews]/target) target
changed '/mnt/nfs/dumps-clouddumps1001.wikimedia.org/other/pageviews' to '/mnt/nfs/dumps-clouddumps1002.wikimedia.org/other/pageviews'
2025-03-26T13:19:39.359024+00:00 tools-k8s-worker-nfs-43 puppet-agent[355401]: Applied catalog in 7.44 seconds

@dcaro is concerned about active NFS mounts from pods, those might require a restart of the NFS server (unless Puppet is doing it automatically).

This is what Puppet did on NFS workers, the symlinks were updated but the service was not restarted:

2025-03-26T13:19:32.034169+00:00 tools-k8s-worker-nfs-43 puppet-agent[355401]: Applying configuration version '(673839ca92) jelto - aptrepo: upgrade gitlab-ce and
gitlab-runner to 17.8'
2025-03-26T13:19:38.036936+00:00 tools-k8s-worker-nfs-43 puppet-agent[355401]: (/Stage[main]/Profile::Wmcs::Nfsclient/File[/public/dumps/public]/target) target cha
nged '/mnt/nfs/dumps-clouddumps1001.wikimedia.org/' to '/mnt/nfs/dumps-clouddumps1002.wikimedia.org/'
2025-03-26T13:19:38.041939+00:00 tools-k8s-worker-nfs-43 puppet-agent[355401]: (/Stage[main]/Profile::Wmcs::Nfsclient/File[/public/dumps/incr]/target) target changed '/mnt/nfs/dumps-clouddumps1001.wikimedia.org/other/incr' to '/mnt/nfs/dumps-clouddumps1002.wikimedia.org/other/incr'
2025-03-26T13:19:38.046222+00:00 tools-k8s-worker-nfs-43 puppet-agent[355401]: (/Stage[main]/Profile::Wmcs::Nfsclient/File[/public/dumps/pagecounts-all-sites]/targ
et) target changed '/mnt/nfs/dumps-clouddumps1001.wikimedia.org/other/pagecounts-all-sites' to '/mnt/nfs/dumps-clouddumps1002.wikimedia.org/other/pagecounts-all-sites'
2025-03-26T13:19:38.049072+00:00 tools-k8s-worker-nfs-43 puppet-agent[355401]: (/Stage[main]/Profile::Wmcs::Nfsclient/File[/public/dumps/pagecounts-raw]/target) ta
rget changed '/mnt/nfs/dumps-clouddumps1001.wikimedia.org/other/pagecounts-raw' to '/mnt/nfs/dumps-clouddumps1002.wikimedia.org/other/pagecounts-raw'
2025-03-26T13:19:38.052357+00:00 tools-k8s-worker-nfs-43 puppet-agent[355401]: (/Stage[main]/Profile::Wmcs::Nfsclient/File[/public/dumps/pageviews]/target) target
changed '/mnt/nfs/dumps-clouddumps1001.wikimedia.org/other/pageviews' to '/mnt/nfs/dumps-clouddumps1002.wikimedia.org/other/pageviews'
2025-03-26T13:19:39.359024+00:00 tools-k8s-worker-nfs-43 puppet-agent[355401]: Applied catalog in 7.44 seconds

It's not that the service needs need restarting (that would not help), but if there's any processes (ex. pods) that have a file open from before the link changed that handle will still point to the old server, so when the old nfs server goes away, that file handle will get stuck, and might require the pods from restarting (they might just fail after 5 min or so by themselves).

It's not that the service not need restarting, but if there's any processes (ex. pods) that have a file open from before the link changed that handle will still point to the old server

I see, is there any way of finding out how many processes have a handle pointing to the old server? I guess we can just wait a few hours and hope that most of those will be closed by then.

I'm also a bit concerned by the fact there are still many open connections to the "depooled" server and I have no idea how many of those are "open but idle" and how many are in active use:

fnegri@clouddumps1001:~$ ss -tano state established sport nfs |wc -l
122

fnegri@clouddumps1001:~$ for a in $(ss -tano state established sport nfs|awk '{print $4}'|cut -f1 -d":"|grep -v Address|sort|uniq); do host $a | grep -v alias | awk '{print $5}'; done
wdqs2009.codfw.wmnet.
wcqs2001.codfw.wmnet.
wdqs2010.codfw.wmnet.
stat1010.eqiad.wmnet.
wdqs1024.eqiad.wmnet.
an-launcher1002.eqiad.wmnet.
stat1009.eqiad.wmnet.
stat1011.eqiad.wmnet.
stat1008.eqiad.wmnet.
nat.cloudgw.eqiad1.wikimediacloud.org.
instance-toolsbeta-mail-2.toolsbeta.wmcloud.org.
mail.toolsbeta.wmcloud.org.
beta.math.wmflabs.org.
instance-math24.math.wmcloud.org.
mardi.math.wmflabs.org.
login.tools.wmflabs.org.
dev.toolforge.org.
instance-tools-bastion-12.tools.wmcloud.org.
instance-tools-checker-5.tools.wmcloud.org.
checker.tools.wmflabs.org.
bastion.toolforge.org.
instance-tools-bastion-13.tools.wmcloud.org.
mail.tools.wmcloud.org.
instance-tools-mail-4.tools.wmcloud.org.
instance-tools-sgebastion-10.tools.wmcloud.org.
login-buster.toolforge.org.
nat.cloudgw.codfw1dev.wikimediacloud.org.

There are still some spikes of network activity on clouddumps1001, I will wait until tomorrow to see if that continues overnight or not:

Screenshot 2025-03-26 at 15.25.43.png (540×1 px, 116 KB)

Network activity did not decrease much after I merged my patch:

Screenshot 2025-03-27 at 12.45.06.png (530×1 px, 111 KB)

NFS connections did not decrease either:

fnegri@clouddumps1001:~$ ss -tano state established sport nfs |wc -l
122

@VRiley-WMF sorry for the lack of updates, I had to prioritize other things. This server is still alerting, and I didn't manage to depool it completely, there are still some connections going to it.

After discussing with @Andrew we think we're ready to shut it down anyway so that you can to try to swap the fans, but we would like to coordinate with you so that the host does not remain offline for more than a few hours.

Would you be available this week or the next, at a time between 15:00 UTC and 17:00 UTC? If that does not work for you we can find other slots!

I'm available for this activity at anytime. 15:00 UTC - 17:00 UTC works for me.

Great! I talked with @VRiley-WMF and the plan is:

  1. I'm gonna shut down the server tomorrow for about 1 hour, to check if there's any unexpected impact, then take it back online.
  2. If the 1-hour downtime does not cause immediate trouble, we'll schedule a longer 6-hour downtime for @VRiley-WMF to do more troubleshooting (possibly later tomorrow or later this week).

Icinga downtime and Alertmanager silence (ID=591bcb32-8025-4bce-af2c-49d023d1b4ca) set by fnegri@cumin1002 for 1 day, 0:00:00 on 1 host(s) and their services with reason: down for maintenance

clouddumps1001.wikimedia.org

I'm gonna shut down the server tomorrow for about 1 hour, to check if there's any unexpected impact, then take it back online.

I did this and it caused some issues on both Cloud VPS and Toolforge, I created T391369: If the inactive clouddumps host goes down, it causes a ripple effect on Cloud VPS and Toolforge.

We need to fix that before we can shut down clouddumps1001 for longer.

fnegri changed the task status from In Progress to Stalled.Apr 9 2025, 1:27 PM

@fnegri I had looked at this briefly kinda looks like might be a bad intake sensor and might not be over heating comparing this with clouddumps1002 all other sensor readings look identical except inlet air We should comp air inlet air temps from any neighboring device before assuming that is bad though. If we are able to relocate the server we should move lower in rack.

WRtIUNdE.png (289×600 px, 35 KB)

nDcjLThB.png (280×900 px, 53 KB)

Thanks @Jclark-ctr, do you think there is a way to disable the sensor so that it will not trigger the alert? We could also silence that alert indefinitely, only for this specific host, but it's a bit of hack.

I still want to fix T391369: If the inactive clouddumps host goes down, it causes a ripple effect on Cloud VPS and Toolforge because we might have to take this server offline for other reasons in the future.

We should comp air inlet air temps from any neighboring device before assuming that is bad though.

The next two servers in the rack (an-worker1150 and clouddb1016) also have a similar Inlet Temp, so maybe it's not a bad sensor.

https://grafana.wikimedia.org/goto/C3i1lGJHR?orgId=1

Screenshot 2025-04-17 at 16.25.47.png (1×2 px, 416 KB)

fnegri changed the task status from Stalled to In Progress.Apr 17 2025, 2:33 PM
fnegri lowered the priority of this task from High to Medium.

@fnegri. I installed several blanking panels on the hot aisle to help prevent hot air bleeding into cold aisle between Row B and Row C. After stepping back to assess the broader airflow pattern, I noticed that the overhead air vent near it the fin may also be contributing to the issue. the Far left one is pointed in the wrong direction and is the one that would be pointed directly at that rack. Lets see if this makes a difference

Screenshot 2025-04-18 at 9.17.47 AM.png (1×1 px, 2 MB)

Also updated idrac firmware from 5.00.20.00 to 7.00.00.181

No alerts for 4 days and temps and fan speeds have dropped closing this ticket for Temp
The system inlet temperature is greater than the upper warning threshold. Thu Apr 17 2025 20:45:16

There was definitely an improvement, but Inlet Temp for clouddumps1001 remains about 4°C above the next two hosts in the rack (an-worker1150 and clouddb1016). Those two hosts saw a bigger improvement from your intervention, see the chart below. I think we can resolve the task for now, because all 3 hosts are no longer close to the alert threshold.

https://grafana.wikimedia.org/goto/VrFvy-JHg?orgId=1

Screenshot 2025-04-23 at 11.18.05.png (952×3 px, 481 KB)

Reopening as unfortunately the alert is still flapping. It looks like the whole rack's temperature ramped up in the last few hours. Here's a graph showing the Inlet Temp of the top 3 servers in the rack during the past 30 days.

Screenshot 2025-04-30 at 11.34.13.png (954×2 px, 612 KB)

https://grafana.wikimedia.org/goto/Yy-mQnxNg?orgId=1

Made some more adjustments will see how this effects cooling

fnegri moved this task from In progress to Done on the cloud-services-team (FY2024/2025-Q3-Q4) board.

Thanks @Jclark-ctr, looking good:

Screenshot 2025-05-02 at 13.22.50.png (960×2 px, 553 KB)

I'll mark this as Resolved, and let you know in case the alert triggers again.

Mentioned in SAL (#wikimedia-cloud) [2025-05-23T09:00:01Z] <dhinus> failover dumps_dist_active_vps to clouddumps1001 (T383723)