Page MenuHomePhabricator

Create dashboards/alerts for new Cirrus Streaming Updater
Closed, ResolvedPublic

Description

AC:

  • Create a working dashboard for the Cirrus Streaming Updater. A first attempt is here .
  • Link relevant dashboards together
  • Create alerts
  • Decide alert urgency and paging strategy. As @dcausse pointed out today, the downstream services like ChangeProp are closely watched by mainline SREs, as they are used by a lot more than just search functionality. When the Search Update Pipeline stops using ChangeProp and starts using Flink, we lose mainline SRE visibility. They also don't have experience with Flink, so help will be limited. That means we'll have to watch closer and react more quickly.
  • Link all dashboards in cirrus-streaming-updater documentation.

Related Objects

StatusSubtypeAssignedTask
ResolvedGehel
ResolvedGehel
ResolvedOttomata
Resolvedgmodena
OpenNone
Resolvedgmodena
Resolvedbking
Resolvedbking
Resolvedbking
ResolvedGehel
ResolvedMatthewVernon
Resolvedbking
Resolvedbking
ResolvedEBernhardson
ResolvedEBernhardson
Resolveddcausse
InvalidNone
InvalidNone
Resolveddcausse
ResolvedEBernhardson
DuplicateNone
OpenNone
Resolvedbking
Resolvedbking

Event Timeline

bking triaged this task as Medium priority.
bking moved this task from Incoming to In Progress on the Data-Platform-SRE board.
RKemper updated the task description. (Show Details)
bking renamed this task from Create dashboards/alerts for new Search Update Pipeline to Create dashboards/alerts for new Cirrus Streaming Updater.Oct 27 2023, 2:59 PM
Gehel raised the priority of this task from Medium to High.Dec 19 2023, 4:58 PM
Gehel moved this task from Incoming to Observability on the Data-Platform-SRE board.
Gehel removed bking as the assignee of this task.Dec 19 2023, 5:00 PM

Per today's Weds mtg, we need to add links to other dashboards from our SUP dashboard. You can add a "Markdown cell" to the dashboard to include the other dashboards.

Here's a list of metrics and alerts based on them:

  • alert if the combined kafka message-in-rate (see grafana panel) is 0 for more than 5 minutes, for the following topics
    • eqiad.cirrussearch.update_pipeline.update.rc0
    • codfw.cirrussearch.update_pipeline.update.rc0
  • alert if the combined kafka message-in-rate (see grafana panel) is 0.1 for more than 5 minutes, for the following topics
    • eqiad.cirrussearch.update_pipeline.fetch_error.rc0
    • codfw.cirrussearch.update_pipeline.fetch_error.rc0

Change #1042396 had a related patch set uploaded (by Bking; author: Bking):

[operations/alerts@master] team-search-platform: Add kafka topic alerts for new search pipeline

https://gerrit.wikimedia.org/r/1042396

Change #1042396 abandoned by Bking:

[operations/alerts@master] team-search-platform: Add kafka topic alerts for new search pipeline

Reason:

let's not add and change the files at the same time; we'll start by adding to search platform instead.

https://gerrit.wikimedia.org/r/1042396

Change #1043198 had a related patch set uploaded (by Bking; author: Bking):

[operations/alerts@master] team-search-platform: Add kafka topic alerts for new search pipeline

https://gerrit.wikimedia.org/r/1043198

bking changed the task status from Open to In Progress.Jun 13 2024, 9:59 PM
bking updated Other Assignee, added: RKemper.

Change #1043198 merged by jenkins-bot:

[operations/alerts@master] team-search-platform: Add kafka topic alerts for new search pipeline

https://gerrit.wikimedia.org/r/1043198

I believe the last merge satisfies the requirements for this ticket. As such, I'm closing this one out. If you feel we are still missing alerts or dashboard, feel free to re-open this task or create a new one.