Page MenuHomePhabricator

swift hosts (thanos-fe1001, ms-be2012) with failed prometheus-ipmi-exporter services
Closed, DeclinedPublic

Description

Found a couple of swift hosts (thanos-fe1001, ms-be2012) with failed prometheus-ipmi-exporter services today. looks like a race with swift-proxy which eventually binds 9290 as a source port.

Jun 23 15:59:08 ms-fe2012 prometheus-ipmi-exporter[971859]: time="2022-06-23T15:59:08Z" level=fatal msg="listen tcp :9290: bind: address already in use" source="main.go:150"
ms-fe2012:~# lsof -i | grep 9290
swift-pro    866                      swift   66u  IPv4 1836545862      0t0  TCP ms-fe2012.codfw.wmnet:9290->ms-fe2009.codfw.wmnet:11211 (ESTABLISHED)

Event Timeline

herron triaged this task as Medium priority.Jun 23 2022, 5:18 PM
herron created this task.

Looks like we customize the ephemeral port range on the swift hosts to 1024-65535, maybe we can push up the range swift-proxy chooses source ports from to help prevent this?

Change 808040 had a related patch set uploaded (by Herron; author: Herron):

[operations/puppet@production] swift: update ephemeral port range from 1024-65535 to 10240-65535

https://gerrit.wikimedia.org/r/808040

This issue seems to be related to deploying ipmi-exporter fleetwide for the first while the host/swift were already running for a while and thus the exporter port was in use as an ephemeral port.

Under normal operations AFAIK on reboot the exporters start and grab the port, and we haven't seen this issue (exporters unable to start) on thanos/swift hosts on a regular basis at least. Technically the issue can still happen if we're restarting exporters and the kernel assigns the ephemeral ports in the meantime, though it is a pretty small window. Given the above IMHO we're good to leave things as is, also given that we don't deploy new exporters fleetwide that often anymore

Change 808040 abandoned by Herron:

[operations/puppet@production] swift: update ephemeral port range from 1024-65535 to 10240-65535

Reason:

didn't move forward with this please see task

https://gerrit.wikimedia.org/r/808040

lmata subscribed.

Discussed in the today's team meeting, boldly declining. Please re-open if you feel differently.