Page MenuHomePhabricator

Should wdqs LVS checks page
Open, HighPublic

Description

During a recent incident, We didn't get paged until the entire service was unavailable, even though there where early warning signs of the issue in icinga. this task is to evaluate if the early alerts should have also paged. Or alternatively if we should stop paging all together for this service

Alerts which started the incident

<+icinga-wm> PROBLEM - LVS wdqs eqiad port 80/tcp - Wikidata Query Service IPv4 #page on wdqs.svc.eqiad.wmnet 
          is CRITICAL: CRITICAL - Socket timeout after 10 seconds 
          https://wikitech.wikimedia.org/wiki/LVS%23Diagnosing_problems

Early warning alert

PROBLEM - PyBal backends health check on lvs1019 is CRITICAL: PYBAL CRITICAL - CRITICAL -
                   wdqs-heavy-queries_8888: Servers wdqs1013.eqiad.wmnet, wdqs1004.eqiad.wmnet are marked down but
                   pooled: wdqs-ssl_443: Servers wdqs1004.eqiad.wmnet, wdqs1013.eqiad.wmnet are marked down but
                   pooled: wdqs_80: Servers wdqs1004.eqiad.wmnet, wdqs1013.eqiad.wmnet are marked down but pooled
                   https://wikitech.wikimedia.org/wiki/PyBal
Acceptance Criteria
  • Evaluate whether we can/should disable paging on pybal host marked down but pooled as well

Event Timeline

We'll talk about this in the Weds meeting this week

bking removed bking as the assignee of this task.Oct 12 2022, 1:46 PM
bking subscribed.
Gehel triaged this task as High priority.Nov 21 2022, 4:39 PM
Gehel edited projects, added Discovery-Search; removed Discovery-Search (Current work).
Gehel moved this task from needs triage to Ops / SRE on the Discovery-Search board.