Yesterday as part of T376277: Reimage cloudweb hosts to trixie both cloudweb1003 and cloudweb1004 were reimaged in quick succession, and on Trixie they were failing healthchecks from pybal. As a result horizon (and other cloudweb services, list TBD) were down. Yet we didn't get paged and we should have
Description
Description
Details
Details
Related Changes in Gerrit:
| Subject | Repo | Branch | Lines +/- | |
|---|---|---|---|---|
| hieradata: enable paging for labweb-ssl service and route to wmcs | operations/puppet | production | +2 -1 |
| Status | Subtype | Assigned | Task | ||
|---|---|---|---|---|---|
| Resolved | fgiunchedi | T411470 Page on cloudweb/horizon down | |||
| Resolved | fgiunchedi | T399807 Allow team customization for service::catalog probes |
Event Timeline
Comment Actions
I dug into this a little, currently:
- the service::catalog entry for labweb-ssl is page: false because that would page SRE, not WMCS. Proper fix is resolving (by yours truly) T399807: Allow team customization for service::catalog probes
- there is a paging probe for striker in profile::wmcs::striker::monitoring, however that probes port 443 (i.e. striker directly) not envoy on 7443
Comment Actions
Change #1215114 had a related patch set uploaded (by Filippo Giunchedi; author: Filippo Giunchedi):
[operations/puppet@production] hieradata: enable paging for labweb-ssl service and route to wmcs
Comment Actions
Change #1215114 merged by Filippo Giunchedi:
[operations/puppet@production] hieradata: enable paging for labweb-ssl service and route to wmcs
