Page MenuHomePhabricator

clouddb1017 wmf-pt-kill dying, unauthed connections and other mysteries.
Open, Needs TriagePublic

Description

Icinga alerted for replica lag on clouddb1017 (analytics s1) and wmf-pt-kill for the same:
Notification Type: PROBLEM

Service: Check systemd state
Host: clouddb1017
Address: 10.64.32.61
State: CRITICAL

Date/Time: Thu Sept 9 02:28:37 UTC 2021

Notes URLs: https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state

Acknowledged by :

Additional Info:

CRITICAL - degraded: The following units failed: wmf-pt-kill@s1.service

Event Timeline

CRITICAL - degraded: The following units failed: wmf-pt-kill@s1.service
WARNING slave_io_state Slave_IO_Running: No
CRITICAL slave_sql_lag Replication lag: 983.20 seconds
WARNING slave_sql_state Slave_SQL_Running: No

pt-kill failed with Error getting SHOW PROCESSLIST: DBD::mysql::st execute failed: MySQL server has gone away [for Statement "SHOW FULL PROCESSLIST"] at /usr/bin/wmf-pt-kill line 6973. (for posterity)

The logs for s1 are full of:

Aug 10 02:37:52 clouddb1017 mysqld[18655]: 2021-08-10  2:37:52 26333357 [Warning] Aborted connection 26333357 to db: 'unconnected' user: 'unauthenticated' host: '10.64.37.27' (This conn
Aug 10 02:37:52 clouddb1017 mysqld[18655]: 2021-08-10  2:37:52 26333359 [Warning] Aborted connection 26333359 to db: 'unconnected' user: 'unauthenticated' host: '10.64.37.27' (This conn
Aug 10 02:37:55 clouddb1017 mysqld[18655]: 2021-08-10  2:37:55 26333372 [Warning] Aborted connection 26333372 to db: 'unconnected' user: 'unauthenticated' host: '10.64.37.27' (This conn
Aug 10 02:37:55 clouddb1017 mysqld[18655]: 2021-08-10  2:37:55 26333376 [Warning] Aborted connection 26333376 to db: 'unconnected' user: 'unauthenticated' host: '10.64.37.27' (This conn
Aug 10 02:37:58 clouddb1017 mysqld[18655]: 2021-08-10  2:37:58 26333385 [Warning] Aborted connection 26333385 to db: 'unconnected' user: 'unauthenticated' host: '10.64.37.27' (This conn
Aug 10 02:37:58 clouddb1017 mysqld[18655]: 2021-08-10  2:37:58 26333386 [Warning] Aborted connection 26333386 to db: 'unconnected' user: 'unauthenticated' host: '10.64.37.27' (This conn
Aug 10 02:38:01 clouddb1017 mysqld[18655]: 2021-08-10  2:38:01 26333394 [Warning] Aborted connection 26333394 to db: 'unconnected' user: 'unauthenticated' host: '10.64.37.27' (This conn
Aug 10 02:38:01 clouddb1017 mysqld[18655]: 2021-08-10  2:38:01 26333398 [Warning] Aborted connection 26333398 to db: 'unconnected' user: 'unauthenticated' host: '10.64.37.27' (This conn
Aug 10 02:38:04 clouddb1017 mysqld[18655]: 2021-08-10  2:38:04 26333409 [Warning] Aborted connection 26333409 to db: 'unconnected' user: 'unauthenticated' host: '10.64.37.27' (This conn
Aug 10 02:38:04 clouddb1017 mysqld[18655]: 2021-08-10  2:38:04 26333411 [Warning] Aborted connection 26333411 to db: 'unconnected' user: 'unauthenticated' host: '10.64.37.27' (This conn
Aug 10 02:38:07 clouddb1017 mysqld[18655]: 2021-08-10  2:38:07 26333419 [Warning] Aborted connection 26333419 to db: 'unconnected' user: 'unauthenticated' host: '10.64.37.27' (This conn
Aug

And replication has stopped.

MariaDB [(none)]> show slave status \G
*************************** 1. row ***************************
                Slave_IO_State:
                   Master_Host: db1154.eqiad.wmnet
                   Master_User: repl
                   Master_Port: 3311
                 Connect_Retry: 60
               Master_Log_File: db1154-bin.000626
           Read_Master_Log_Pos: 129229134
                Relay_Log_File: clouddb1017-relay-bin.000630
                 Relay_Log_Pos: 304
         Relay_Master_Log_File: db1154-bin.000940
              Slave_IO_Running: No
             Slave_SQL_Running: No
               Replicate_Do_DB:
           Replicate_Ignore_DB:
            Replicate_Do_Table:
        Replicate_Ignore_Table:
       Replicate_Wild_Do_Table:
   Replicate_Wild_Ignore_Table:
                    Last_Errno: 0
                    Last_Error:
                  Skip_Counter: 0
           Exec_Master_Log_Pos: 4
               Relay_Log_Space: 286564727
               Until_Condition: None
                Until_Log_File:
                 Until_Log_Pos: 0
            Master_SSL_Allowed: Yes
            Master_SSL_CA_File:
            Master_SSL_CA_Path:
               Master_SSL_Cert:
             Master_SSL_Cipher:
                Master_SSL_Key:
         Seconds_Behind_Master: NULL
 Master_SSL_Verify_Server_Cert: No
                 Last_IO_Errno: 0
                 Last_IO_Error:
                Last_SQL_Errno: 0
                Last_SQL_Error:
   Replicate_Ignore_Server_Ids:
              Master_Server_Id: 0
                Master_SSL_Crl:
            Master_SSL_Crlpath:
                    Using_Gtid: Slave_Pos
                   Gtid_IO_Pos:
       Replicate_Do_Domain_Ids:
   Replicate_Ignore_Domain_Ids:
                 Parallel_Mode: conservative
                     SQL_Delay: 0
           SQL_Remaining_Delay: NULL
       Slave_SQL_Running_State:
              Slave_DDL_Groups: 0
Slave_Non_Transactional_Groups: 0
    Slave_Transactional_Groups: 0
1 row in set (0.000 sec)

Mentioned in SAL (#wikimedia-cloud) [2021-09-09T03:08:33Z] <andrewbogott> stopping maintain-dbusers on labstore1004 for help diagnosing T290630

Mentioned in SAL (#wikimedia-operations) [2021-09-09T03:12:29Z] <bstorm> attempting to start replication on clouddb1017 s1 T290630

Mentioned in SAL (#wikimedia-cloud) [2021-09-09T03:15:45Z] <bstorm> resetting swap on clouddb1017 T290630

Ok, replication is caught up on clouddb1017, but wmf-pt-kill for s1 died again. We are still seeing an endless stream of those unauthed connections closing on, apparently, all instances.

Bstorm renamed this task from clouddb1017 replag alerts to clouddb1017 wmf-pt-kill dying, unauthed connections and other mysteries..Thu, Sep 9, 3:20 AM
Bstorm updated the task description. (Show Details)
Bstorm added a subscriber: Marostegui.

So this may simply be a matter of swap getting used a lot for reasons discussed here T281732#7115588

Change 719758 had a related patch set uploaded (by Bstorm; author: Bstorm):

[operations/puppet@production] wikireplicas: reduce the innodb_buffer_pool_size for s1 analytics

https://gerrit.wikimedia.org/r/719758

The pile of connections that keep showing up on all replicas somehow feels self-inflicted, but I haven't sorted out exactly where that's coming from yet.

Change 719758 merged by Bstorm:

[operations/puppet@production] wikireplicas: reduce the innodb_buffer_pool_size for s1

https://gerrit.wikimedia.org/r/719758

Found this from the time of these alerts, btw:

[Thu Sep  9 02:16:00 2021] mysqld[15794]: segfault at 18 ip 0000564af899b6f5 sp 00007f8cd56e2f20 error 4
[Thu Sep  9 02:16:00 2021] Code: 5d c3 90 66 90 55 48 89 e5 41 55 41 89 d5 41 54 49 89 f4 53 48 89 fb 48 83 ec 08 48 8b 07 ff 90 00 01 00 00 80 7b 76 00 75 3b <49> 8b 54 24 18 48 85 d2 74 0d 41 0f b6 8c 24 96 00 00 00 f7 d1 20

Mentioned in SAL (#wikimedia-cloud) [2021-09-09T22:03:18Z] <bstorm> restarted the prometheus-mysqld-exporter@s1 service as it was not working T290630

@Marostegui if you might know anything about where all those connections I mentioned in T290630#7341031 might be coming from, I'm still not sure about that. We stopped maintain-dbusers and the logs kept coming. It seems likely that it is at least not that particular thing causing them. It's rather constant chatter on *all* the wikireplicas instances from what I can tell.

I wonder if they are some kind of messy health check from a proxy layer....

Change 720187 had a related patch set uploaded (by Marostegui; author: Marostegui):

[operations/puppet@production] dbproxy1018: Depool clouddb1017

https://gerrit.wikimedia.org/r/720187

Change 720187 merged by Marostegui:

[operations/puppet@production] dbproxy1018: Depool clouddb1017

https://gerrit.wikimedia.org/r/720187

Change 720189 had a related patch set uploaded (by Marostegui; author: Marostegui):

[operations/puppet@production] dbproxy1019: Depool clouddb1013

https://gerrit.wikimedia.org/r/720189

Change 720189 merged by Marostegui:

[operations/puppet@production] dbproxy1019: Depool clouddb1013

https://gerrit.wikimedia.org/r/720189

s1 restarted on clouddb1017 and clouddb1013