Page MenuHomePhabricator

db1065 paged for NRPE timeout
Closed, ResolvedPublic

Description

I've took a quick look at db1065 right after it paged. The page was an NRPE timeout but:

  • there were 2224 open connections
    • of which many were not yet connected: | 13549850842 | unauthenticated user | connecting host | NULL | Connect | NULL | login
    • of which 36 connections where open since few minutes (between 125 and 311 seconds) and in Copying to tmp table
      • of which most were of the type
SELECT /* ApiQueryContributors::execute  */  rev_page AS `page`,rev_user AS `user`,MAX(rev_user_text) AS `username`  FROM `revision` WHERE rev_page = '16283969' AND (rev_user != 0) AND ((rev_deleted & 4) = 0) GROUP BY rev_page, rev_user ORDER BY rev_user LIMIT 501
  • when I checked on Icinga there are many checks with notification disabled for this host, doesn't look right to me if it's in production as it looks like from mediawiki-config (weight 50 and role API)
  • it could all be unrelated given that there were another couple of NRPE timeouts at the same time (times in UTC+1)
Mon 22:29:53   icinga-wm| PROBLEM - mobileapps endpoints health on scb2004 is CRITICAL: CHECK_NRPE: Socket timeout after 10
                           seconds.
 Mon 22:30:44   icinga-wm| PROBLEM - mobileapps endpoints health on scb1002 is CRITICAL: CHECK_NRPE: Socket timeout after 10
                           seconds.
 Mon 22:31:38   icinga-wm| PROBLEM - MariaDB Slave IO: s1 on db1065 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds.
 Mon 22:31:53   icinga-wm| RECOVERY - mobileapps endpoints health on scb2004 is OK: All endpoints are healthy
 Mon 22:31:53   icinga-wm| RECOVERY - mobileapps endpoints health on scb1002 is OK: All endpoints are healthy
 Mon 22:32:12          * | volans looking
 Mon 22:32:29   icinga-wm| RECOVERY - MariaDB Slave IO: s1 on db1065 is OK: OK slave_io_state Slave_IO_Running: Yes

But given the anomaly of connected clients and long queries I'm opening this for further investigation.

Event Timeline

jcrespo claimed this task.

There has been several api issues in the last weeks.

While the report is certainly helpful, I am resolving this because the long-term fixes are identified and being worked on on both: T132416 and T149421. While the current state is far from ideal, it rarely has user impact (redundancy and automatic depooling mitigates issues until underlying problems are solved).