Page MenuHomePhabricator

PyBal backends health check on lvs2014 is CRITICAL: PYBAL CRITICAL
Open, HighPublic

Description

Filing this for production errors and pages today

7:29:04 PM <icinga-wm> PROBLEM - PyBal backends health check on lvs2014 is CRITICAL: PYBAL CRITICAL - CRITICAL - appservers-https_443: Servers mw2313.codfw.wmnet, mw2409.codfw.wmnet, mw2438.codfw.wmnet, mw2331.codfw.wmnet, mw2392.codfw.wmnet, mw2414.codfw.wmnet, mw2312.codfw.wmnet, mw2375.codfw.wmnet, mw2338.codfw.wmnet, mw2325.codfw.wmnet, mw2393.codfw.wmnet, mw2314.codfw.wmnet, mw2386.codfw.wmnet, mw2408.codfw.wmnet, mw2387.codfw.wmnet, mw2269.codfw.wmnet, mw

500 error spike
https://grafana.wikimedia.org/d/O_OXJyTVk/home-w-wiki-status?orgId=1&from=1707353807985&to=1707357407985

Screenshot 2024-02-07 at 7.36.19 PM.png (1×2 px, 319 KB)

Event Timeline

lmata triaged this task as High priority.Feb 8 2024, 1:37 AM
lmata updated the task description. (Show Details)

Took a quick look at this - seems like a large latency excursion on the appserver side (https://grafana.wikimedia.org/goto/_0N8q7hIk?orgId=1), which correlates with a large spike in reads to db section s6 (https://grafana.wikimedia.org/goto/mAWw372Iz?orgId=1). Seems to have recovered fairly promptly, whatever this was.

Edit: This has the feel of a thundering herd on cache miss (or I guess invalidation, given the synchronized behavior). I see a very large correlated spike in misses for a specific key group (flaggedrevs_includes_synced, though there may be others an order of magnitude down that are harder to see) in https://grafana.wikimedia.org/goto/CcgUe7hIz?orgId=1, but that of course does not imply causation.