Page MenuHomePhabricator

Requests to prometheus pushgateway are timing out
Closed, ResolvedPublic

Description

We've seen a few calls to prometheus-gateway.discovery.wmnet failing because of timeouts. The host itself looks offline or unreachable.

curl: (28) Failed to connect to prometheus-pushgateway.discovery.wmnet port 80: Connection timed out

Event Timeline

colewhite triaged this task as Unbreak Now! priority.Apr 19 2024, 3:36 PM
colewhite edited projects, added Observability-Metrics; removed observability.
colewhite subscribed.
-------------------------------------------------------------------------------
Record:      10
Date/Time:   04/19/2024 08:01:45
Source:      system
Severity:    Critical
Description: The system memory has faced an uncorrectable multi-bit memory errors in the non-execution path of a memory device at the location DIMM_B3.
-------------------------------------------------------------------------------

... lots of self heal operations ...

-------------------------------------------------------------------------------
Record:      492
Date/Time:   04/19/2024 08:08:20
Source:      system
Severity:    Critical
Description: Multi-bit memory errors detected on a memory device at location(s) DIMM_B3.
-------------------------------------------------------------------------------

Seems like a bad stick of memory.

Change #1022027 had a related patch set uploaded (by Cwhite; author: Cwhite):

[operations/dns@master] promote prometheus1006 as pushgateway primary

https://gerrit.wikimedia.org/r/1022027

Change #1022028 had a related patch set uploaded (by Cwhite; author: Cwhite):

[operations/puppet@production] prometheus: promote prometheus1006 to pushgateway duty

https://gerrit.wikimedia.org/r/1022028

Change #1022028 merged by Cwhite:

[operations/puppet@production] prometheus: promote prometheus1006 to pushgateway duty

https://gerrit.wikimedia.org/r/1022028

Change #1022027 merged by Cwhite:

[operations/dns@master] promote prometheus1006 as pushgateway primary

https://gerrit.wikimedia.org/r/1022027

colewhite claimed this task.

Pushgateway was moved to prometheus1006.