db1065 paged for NRPE timeout
Closed, ResolvedPublic
Actions

Assigned To

Authored By

	Volans
	Oct 31 2016, 9:50 PM

Description

I've took a quick look at db1065 right after it paged. The page was an NRPE timeout but:

there were 2224 open connections
- of which many were not yet connected: | 13549850842 | unauthenticated user | connecting host | NULL | Connect | NULL | login
- of which 36 connections where open since few minutes (between 125 and 311 seconds) and in Copying to tmp table
  - of which most were of the type

SELECT /* ApiQueryContributors::execute  */  rev_page AS `page`,rev_user AS `user`,MAX(rev_user_text) AS `username`  FROM `revision` WHERE rev_page = '16283969' AND (rev_user != 0) AND ((rev_deleted & 4) = 0) GROUP BY rev_page, rev_user ORDER BY rev_user LIMIT 501

when I checked on Icinga there are many checks with notification disabled for this host, doesn't look right to me if it's in production as it looks like from mediawiki-config (weight 50 and role API)
it could all be unrelated given that there were another couple of NRPE timeouts at the same time (times in UTC+1)

Mon 22:29:53   icinga-wm| PROBLEM - mobileapps endpoints health on scb2004 is CRITICAL: CHECK_NRPE: Socket timeout after 10
                           seconds.
 Mon 22:30:44   icinga-wm| PROBLEM - mobileapps endpoints health on scb1002 is CRITICAL: CHECK_NRPE: Socket timeout after 10
                           seconds.
 Mon 22:31:38   icinga-wm| PROBLEM - MariaDB Slave IO: s1 on db1065 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds.
 Mon 22:31:53   icinga-wm| RECOVERY - mobileapps endpoints health on scb2004 is OK: All endpoints are healthy
 Mon 22:31:53   icinga-wm| RECOVERY - mobileapps endpoints health on scb1002 is OK: All endpoints are healthy
 Mon 22:32:12          * | volans looking
 Mon 22:32:29   icinga-wm| RECOVERY - MariaDB Slave IO: s1 on db1065 is OK: OK slave_io_state Slave_IO_Running: Yes

But given the anomaly of connected clients and long queries I'm opening this for further investigation.

Related Objects

Mentioned Here: T132416: Rampant differences in indexes on enwiki.revision across the DB cluster
T149421: Long running mediawiki web requests impacts service availability, specially databases

Event Timeline

Volans created this task.Oct 31 2016, 9:50 PM

Restricted Application added a subscriber: Aklapper. · View Herald TranscriptOct 31 2016, 9:50 PM

There has been several api issues in the last weeks.

While the report is certainly helpful, I am resolving this because the long-term fixes are identified and being worked on on both: T132416 and T149421. While the current state is far from ideal, it rarely has user impact (redundancy and automatic depooling mitigates issues until underlying problems are solved).

db1065 paged for NRPE timeoutClosed, ResolvedPublicActions

Description

Related Objects

Event Timeline

db1065 paged for NRPE timeout
Closed, ResolvedPublic
Actions