clouddb1017 wmf-pt-kill dying, unauthed connections and other mysteries.
Closed, ResolvedPublic
Actions

Assigned To

None

Authored By

	Andrew
	Sep 9 2021, 2:40 AM

Description

Icinga alerted for replica lag on clouddb1017 (analytics s1) and wmf-pt-kill for the same:
Notification Type: PROBLEM

Service: Check systemd state
Host: clouddb1017
Address: 10.64.32.61
State: CRITICAL

Date/Time: Thu Sept 9 02:28:37 UTC 2021

Notes URLs: https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state

Acknowledged by :

Additional Info:

CRITICAL - degraded: The following units failed: wmf-pt-kill@s1.service

Details

Subject	Repo	Branch	Lines +/-
dbproxy1019: Depool clouddb1013	operations/puppet	production	+4 -4
dbproxy1018: Depool clouddb1017	operations/puppet	production	+3 -3
wikireplicas: reduce the innodb_buffer_pool_size for s1	operations/puppet	production	+2 -2

Customize query in gerrit

Related Objects

Mentioned Here: T281732: Check into the configuration, cause and usefulness of memory alerts for multiinstance replicas

Event Timeline

Andrew created this task.Sep 9 2021, 2:40 AM

Restricted Application added a subscriber: Aklapper. · View Herald TranscriptSep 9 2021, 2:40 AM

CRITICAL - degraded: The following units failed: wmf-pt-kill@s1.service

WARNING slave_io_state Slave_IO_Running: No

CRITICAL slave_sql_lag Replication lag: 983.20 seconds

WARNING slave_sql_state Slave_SQL_Running: No

pt-kill failed with Error getting SHOW PROCESSLIST: DBD::mysql::st execute failed: MySQL server has gone away [for Statement "SHOW FULL PROCESSLIST"] at /usr/bin/wmf-pt-kill line 6973. (for posterity)

The logs for s1 are full of:

Aug 10 02:37:52 clouddb1017 mysqld[18655]: 2021-08-10  2:37:52 26333357 [Warning] Aborted connection 26333357 to db: 'unconnected' user: 'unauthenticated' host: '10.64.37.27' (This conn
Aug 10 02:37:52 clouddb1017 mysqld[18655]: 2021-08-10  2:37:52 26333359 [Warning] Aborted connection 26333359 to db: 'unconnected' user: 'unauthenticated' host: '10.64.37.27' (This conn
Aug 10 02:37:55 clouddb1017 mysqld[18655]: 2021-08-10  2:37:55 26333372 [Warning] Aborted connection 26333372 to db: 'unconnected' user: 'unauthenticated' host: '10.64.37.27' (This conn
Aug 10 02:37:55 clouddb1017 mysqld[18655]: 2021-08-10  2:37:55 26333376 [Warning] Aborted connection 26333376 to db: 'unconnected' user: 'unauthenticated' host: '10.64.37.27' (This conn
Aug 10 02:37:58 clouddb1017 mysqld[18655]: 2021-08-10  2:37:58 26333385 [Warning] Aborted connection 26333385 to db: 'unconnected' user: 'unauthenticated' host: '10.64.37.27' (This conn
Aug 10 02:37:58 clouddb1017 mysqld[18655]: 2021-08-10  2:37:58 26333386 [Warning] Aborted connection 26333386 to db: 'unconnected' user: 'unauthenticated' host: '10.64.37.27' (This conn
Aug 10 02:38:01 clouddb1017 mysqld[18655]: 2021-08-10  2:38:01 26333394 [Warning] Aborted connection 26333394 to db: 'unconnected' user: 'unauthenticated' host: '10.64.37.27' (This conn
Aug 10 02:38:01 clouddb1017 mysqld[18655]: 2021-08-10  2:38:01 26333398 [Warning] Aborted connection 26333398 to db: 'unconnected' user: 'unauthenticated' host: '10.64.37.27' (This conn
Aug 10 02:38:04 clouddb1017 mysqld[18655]: 2021-08-10  2:38:04 26333409 [Warning] Aborted connection 26333409 to db: 'unconnected' user: 'unauthenticated' host: '10.64.37.27' (This conn
Aug 10 02:38:04 clouddb1017 mysqld[18655]: 2021-08-10  2:38:04 26333411 [Warning] Aborted connection 26333411 to db: 'unconnected' user: 'unauthenticated' host: '10.64.37.27' (This conn
Aug 10 02:38:07 clouddb1017 mysqld[18655]: 2021-08-10  2:38:07 26333419 [Warning] Aborted connection 26333419 to db: 'unconnected' user: 'unauthenticated' host: '10.64.37.27' (This conn
Aug

And replication has stopped.

MariaDB [(none)]> show slave status \G
*************************** 1. row ***************************
                Slave_IO_State:
                   Master_Host: db1154.eqiad.wmnet
                   Master_User: repl
                   Master_Port: 3311
                 Connect_Retry: 60
               Master_Log_File: db1154-bin.000626
           Read_Master_Log_Pos: 129229134
                Relay_Log_File: clouddb1017-relay-bin.000630
                 Relay_Log_Pos: 304
         Relay_Master_Log_File: db1154-bin.000940
              Slave_IO_Running: No
             Slave_SQL_Running: No
               Replicate_Do_DB:
           Replicate_Ignore_DB:
            Replicate_Do_Table:
        Replicate_Ignore_Table:
       Replicate_Wild_Do_Table:
   Replicate_Wild_Ignore_Table:
                    Last_Errno: 0
                    Last_Error:
                  Skip_Counter: 0
           Exec_Master_Log_Pos: 4
               Relay_Log_Space: 286564727
               Until_Condition: None
                Until_Log_File:
                 Until_Log_Pos: 0
            Master_SSL_Allowed: Yes
            Master_SSL_CA_File:
            Master_SSL_CA_Path:
               Master_SSL_Cert:
             Master_SSL_Cipher:
                Master_SSL_Key:
         Seconds_Behind_Master: NULL
 Master_SSL_Verify_Server_Cert: No
                 Last_IO_Errno: 0
                 Last_IO_Error:
                Last_SQL_Errno: 0
                Last_SQL_Error:
   Replicate_Ignore_Server_Ids:
              Master_Server_Id: 0
                Master_SSL_Crl:
            Master_SSL_Crlpath:
                    Using_Gtid: Slave_Pos
                   Gtid_IO_Pos:
       Replicate_Do_Domain_Ids:
   Replicate_Ignore_Domain_Ids:
                 Parallel_Mode: conservative
                     SQL_Delay: 0
           SQL_Remaining_Delay: NULL
       Slave_SQL_Running_State:
              Slave_DDL_Groups: 0
Slave_Non_Transactional_Groups: 0
    Slave_Transactional_Groups: 0
1 row in set (0.000 sec)

Mentioned in SAL (#wikimedia-cloud) [2021-09-09T03:08:33Z] <andrewbogott> stopping maintain-dbusers on labstore1004 for help diagnosing T290630

Mentioned in SAL (#wikimedia-operations) [2021-09-09T03:12:29Z] <bstorm> attempting to start replication on clouddb1017 s1 T290630

Mentioned in SAL (#wikimedia-cloud) [2021-09-09T03:15:45Z] <bstorm> resetting swap on clouddb1017 T290630

Ok, replication is caught up on clouddb1017, but wmf-pt-kill for s1 died again. We are still seeing an endless stream of those unauthed connections closing on, apparently, all instances.

• Bstorm renamed this task from clouddb1017 replag alerts to clouddb1017 wmf-pt-kill dying, unauthed connections and other mysteries..Sep 9 2021, 3:20 AM

• Bstorm updated the task description. (Show Details)

• Bstorm added a subscriber: • Marostegui.

So this may simply be a matter of swap getting used a lot for reasons discussed here T281732#7115588

Change 719758 had a related patch set uploaded (by Bstorm; author: Bstorm):

[operations/puppet@production] wikireplicas: reduce the innodb_buffer_pool_size for s1 analytics

https://gerrit.wikimedia.org/r/719758

gerritbot added a project: Patch-For-Review.Sep 9 2021, 4:04 AM

The pile of connections that keep showing up on all replicas somehow feels self-inflicted, but I haven't sorted out exactly where that's coming from yet.

Change 719758 merged by Bstorm:

[operations/puppet@production] wikireplicas: reduce the innodb_buffer_pool_size for s1

https://gerrit.wikimedia.org/r/719758

Maintenance_bot removed a project: Patch-For-Review.Sep 9 2021, 5:10 PM

Found this from the time of these alerts, btw:

[Thu Sep  9 02:16:00 2021] mysqld[15794]: segfault at 18 ip 0000564af899b6f5 sp 00007f8cd56e2f20 error 4
[Thu Sep  9 02:16:00 2021] Code: 5d c3 90 66 90 55 48 89 e5 41 55 41 89 d5 41 54 49 89 f4 53 48 89 fb 48 83 ec 08 48 8b 07 ff 90 00 01 00 00 80 7b 76 00 75 3b <49> 8b 54 24 18 48 85 d2 74 0d 41 0f b6 8c 24 96 00 00 00 f7 d1 20

Mentioned in SAL (#wikimedia-cloud) [2021-09-09T22:03:18Z] <bstorm> restarted the prometheus-mysqld-exporter@s1 service as it was not working T290630

@Marostegui if you might know anything about where all those connections I mentioned in T290630#7341031 might be coming from, I'm still not sure about that. We stopped maintain-dbusers and the logs kept coming. It seems likely that it is at least not that particular thing causing them. It's rather constant chatter on *all* the wikireplicas instances from what I can tell.