We need those log lines - if we're skipping some, we should know ASAP.
Description
Description
Status | Subtype | Assigned | Task | ||
---|---|---|---|---|---|
Resolved | Jgreen | T91508 [Epic] overhaul fundraising cluster monitoring | |||
Invalid | None | T197892 fundraising monitoring fixes (EPIC) | |||
Declined | None | T176924 Create alerts for rsyslog rate limiting |
Event Timeline
Comment Actions
As of this AM we're collecting syslog total message count, and dropped message count, to prometheus. They're on the main fundraising dashboard here: https://grafana.wikimedia.org/dashboard/db/fundraising-overview?refresh=1m&orgId=1. We haven't figured out alerting from prometheus/grafana yet, but that's the logical next step.
Comment Actions
Also at this stage it would be impractical to alert because the alarm would be going off constantly for the civicrm host, where queue consumers are still pushing too much log traffic.
Comment Actions
Darn, I'd hoped the Civi host would be fine after cutting the message rate in half, but it looks like there are still spikes of rate-limiting