Page MenuHomePhabricator

Page on cloudweb/horizon down
Closed, ResolvedPublic

Description

Yesterday as part of T376277: Reimage cloudweb hosts to trixie both cloudweb1003 and cloudweb1004 were reimaged in quick succession, and on Trixie they were failing healthchecks from pybal. As a result horizon (and other cloudweb services, list TBD) were down. Yet we didn't get paged and we should have

Event Timeline

Availability as seen by network probes:

2025-12-02-113309_3680x1113_scrot.png (1×3 px, 158 KB)

I dug into this a little, currently:

  • the service::catalog entry for labweb-ssl is page: false because that would page SRE, not WMCS. Proper fix is resolving (by yours truly) T399807: Allow team customization for service::catalog probes
  • there is a paging probe for striker in profile::wmcs::striker::monitoring, however that probes port 443 (i.e. striker directly) not envoy on 7443

Change #1215114 had a related patch set uploaded (by Filippo Giunchedi; author: Filippo Giunchedi):

[operations/puppet@production] hieradata: enable paging for labweb-ssl service and route to wmcs

https://gerrit.wikimedia.org/r/1215114

taavi triaged this task as High priority.Thu, Dec 4, 11:45 AM

Change #1215114 merged by Filippo Giunchedi:

[operations/puppet@production] hieradata: enable paging for labweb-ssl service and route to wmcs

https://gerrit.wikimedia.org/r/1215114

taavi assigned this task to fgiunchedi.