Page MenuHomePhabricator

monitoring & alerting for purged
Closed, ResolvedPublic

Description

  • kafka consumer group lag
  • purged-local backlog number
  • purged_event_lag metric

Event Timeline

ema triaged this task as Medium priority.Jun 26 2020, 9:47 AM
ema moved this task from Triage to Caching on the Traffic board.

Change 608019 had a related patch set uploaded (by Ema; owner: Ema):
[operations/puppet@production] purged: alert in case of high event lag

https://gerrit.wikimedia.org/r/608019

Change 608019 merged by Ema:
[operations/puppet@production] purged: alert in case of high event lag

https://gerrit.wikimedia.org/r/c/operations/puppet/ /608019

Change 608564 had a related patch set uploaded (by Ema; owner: Ema):
[operations/puppet@production] purged: alert if local backlog grows past the given limits

https://gerrit.wikimedia.org/r/c/operations/puppet/ /608564

Change 608564 merged by Ema:
[operations/puppet@production] purged: alert if local backlog grows past the given limits

https://gerrit.wikimedia.org/r/c/operations/puppet/ /608564

@CDanis all done except for rdkafka_consumer_topics_partitions_consumer_lag, there's silence on grafana.wikimedia.org/explore when looking for that metric, even going back one month. Let me know if you think, for the scope of this ticket, that event-lag and local-backlog are enough.

CDanis claimed this task.

I was thinking about this metric:
kafka_burrow_partition_lag{topic=~".*\\.resource-purge",group=~"cp.*"} -- grafana explore links for eqiad and codfw

However the current monitoring is likely sufficient.