I've took a quick look at db1065 right after it paged. The page was an NRPE timeout but:
- there were 2224 open connections
- of which many were not yet connected: | 13549850842 | unauthenticated user | connecting host | NULL | Connect | NULL | login
- of which 36 connections where open since few minutes (between 125 and 311 seconds) and in Copying to tmp table
- of which most were of the type
SELECT /* ApiQueryContributors::execute */ rev_page AS `page`,rev_user AS `user`,MAX(rev_user_text) AS `username` FROM `revision` WHERE rev_page = '16283969' AND (rev_user != 0) AND ((rev_deleted & 4) = 0) GROUP BY rev_page, rev_user ORDER BY rev_user LIMIT 501
- when I checked on Icinga there are many checks with notification disabled for this host, doesn't look right to me if it's in production as it looks like from mediawiki-config (weight 50 and role API)
- it could all be unrelated given that there were another couple of NRPE timeouts at the same time (times in UTC+1)
Mon 22:29:53 icinga-wm| PROBLEM - mobileapps endpoints health on scb2004 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. Mon 22:30:44 icinga-wm| PROBLEM - mobileapps endpoints health on scb1002 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. Mon 22:31:38 icinga-wm| PROBLEM - MariaDB Slave IO: s1 on db1065 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. Mon 22:31:53 icinga-wm| RECOVERY - mobileapps endpoints health on scb2004 is OK: All endpoints are healthy Mon 22:31:53 icinga-wm| RECOVERY - mobileapps endpoints health on scb1002 is OK: All endpoints are healthy Mon 22:32:12 * | volans looking Mon 22:32:29 icinga-wm| RECOVERY - MariaDB Slave IO: s1 on db1065 is OK: OK slave_io_state Slave_IO_Running: Yes
But given the anomaly of connected clients and long queries I'm opening this for further investigation.