Page MenuHomePhabricator

Should wdqs LVS checks page
Open, Needs TriagePublic

Description

During a recent incident, We didn't get paged until the entire service was unavailable, even though there where early warning signs of the issue in icinga. this task is to evaluate if the early alerts should have also paged. Or alternatively if we should stop paging all together for this service

Alerts which started the incident

<+icinga-wm> PROBLEM - LVS wdqs eqiad port 80/tcp - Wikidata Query Service IPv4 #page on wdqs.svc.eqiad.wmnet 
          is CRITICAL: CRITICAL - Socket timeout after 10 seconds 
          https://wikitech.wikimedia.org/wiki/LVS%23Diagnosing_problems

Early warning alert

PROBLEM - PyBal backends health check on lvs1019 is CRITICAL: PYBAL CRITICAL - CRITICAL -
                   wdqs-heavy-queries_8888: Servers wdqs1013.eqiad.wmnet, wdqs1004.eqiad.wmnet are marked down but
                   pooled: wdqs-ssl_443: Servers wdqs1004.eqiad.wmnet, wdqs1013.eqiad.wmnet are marked down but
                   pooled: wdqs_80: Servers wdqs1004.eqiad.wmnet, wdqs1013.eqiad.wmnet are marked down but pooled
                   https://wikitech.wikimedia.org/wiki/PyBal