There has been at least 2 stalls recently on db1075. The first happened just after datacenter switchover codfw->eqiad, so we didn't give it too much importance, other than possibly caused by cold caches.
A new one happened around 2020-10-29 11:01, which also happened to coincide with a connectivity issue. However, while those 2 events (sudden increase in traffic) could be a trigger, the issue may be not directly related to those.
Checking tendril, I see a heavy query hitting at the time of the last event:
Hits Tmax Tavg Tsum Hosts Users Schemas 2425 21 8 21,210 db1075 wikiuser ruwikinews SELECT /* DynamicPageListHooks::renderDynamicPageList */ page_namespace, page_title, c1.cl_timestamp FROM `page` LEFT JOIN `flaggedpages` ON ((page_id = fp_page_id)) INNER JOIN `categorylinks` `c1` ON ((page_id = c1.cl_from) AND (c1.cl_to='????????????')) INNER JOIN `categorylinks` `c2` ON ((page_id = c2.cl_from) AND (c2.cl_to='?????????')) LEFT OUTER JOIN `categorylinks` `c3` ON ((page_id = c3.cl_from) AND (c3.cl_to='??_???????????')) LEFT OUTER JOIN `categorylinks` `c4` ON ((page_id = c4.cl_from) AND (c4.cl_to='?????????_???????_??_?????')) LEFT OUTER JOIN `categorylinks` `c5` ON ((page_id = c5.cl_from) AND (c5.cl_to='????????_???????')) LEFT OUTER JOIN `categorylinks` `c6` ON ((page_id = c6.cl_from) AND (c6.cl_to='???????????_???????')) WHERE page_namespace = 0 AND (fp_stable IS NOT NULL) AND page_is_redirect = 0 AND c3.cl_to IS NULL AND c4.cl_to IS NULL AND c5.cl_to IS NULL AND c6.cl_to IS NULL ORDER BY page_id DESC LIMIT 18 /* a66f1dbef76187d62eb21f46256602f1 db1075 ruwikinews 2s */
Only ruwikinews activity was detected as slow at the time of the last incident, which could be related to an inneficient code path.
Looking at db1075, high innodb activity was seen at the time of the issue:
with spikes of rows read, with no large change in user requests.
My working theory is an infrequent but badly optimized query overloading the server. Making it private as it currently contains security-sensitive information, as well as an input path to DOS the databases.