Page MenuHomePhabricator

Add monitoring+alerting for codesearch-write-config
Closed, ResolvedPublic

Description

Follow-up from T294915: CodeSearch's "deployed" profile hasn't yet worked out that WikiLambda is now branched for production, we need monitoring and alerting if the codesearch-write-config systemd timer fails.

Event Timeline

Legoktm assigned this task to taavi.
10:34:20 <legoktm> majavah: also, is it possible to have alerting if a systemd unit fails, like we do in prod?
10:35:01 <legoktm> specifically the codesearch-write-config unit (https://phabricator.wikimedia.org/T294915) but even if it was all units that would be fine too
10:35:37 <legoktm> if not I'll have the codesearch web app export the systemd status as a bool in our current metrics endpoint
10:36:27 <legoktm> T294958
10:36:28 <+stashbot> T294958: Add monitoring+alerting for codesearch-write-config - https://phabricator.wikimedia.org/T294958
10:50:11 <majavah> legoktm: prometheus-node-exporter collects systemd stats, so that should be set now
10:50:36 <legoktm> majavah: for all units or just that specific one?
10:50:50 <majavah> just that specific one
10:51:19 <legoktm> would it be excessive/problematic/difficult to do it for all?
10:51:46 <majavah> T287309
10:51:46 <+stashbot> T287309: Some systemd services appear to be broken on all VMs - https://phabricator.wikimedia.org/T287309
10:52:10 <legoktm> ack, fair enough
10:52:13 <legoktm> thanks :)