Page MenuHomePhabricator

db1140 s2 replication disconnected temporarily
Closed, ResolvedPublic

Description

db1140 is a backup source host:

[07:22:56]  <+icinga-wm>	PROBLEM - MariaDB Slave IO: s2 on db1140 is CRITICAL: CRITICAL slave_io_state Slave_IO_Running: No, Errno: 2026, Errmsg: error reconnecting to master repl@db1122.eqiad.wmnet:3306 - retry-time: 60 maximum-retries: 86400 message: SSL connection error00000000:lib(0):func(0):reason(0)

Looks like there was a glitch:

Feb 26 07:18:08 db1140 mysqld[3685]: 2020-02-26  7:18:08 139851380549376 [ERROR] Error reading packet from server: Lost connection to MySQL server during query (server_errno=2013)
Feb 26 07:18:08 db1140 mysqld[3685]: 2020-02-26  7:18:08 139851380549376 [Note] Slave I/O thread: Failed reading log event, reconnecting to retry, log 'db1122-bin.001369' at position 786291425; GTID position '171966670-171966670-2410812544,0-180359173-4858865027,180359173-180359173-70817914,171966574-171966574-2221092918,171970567-17
Feb 26 07:18:11 db1140 mysqld[3685]: 2020-02-26  7:18:11 139851380549376 [ERROR] Slave I/O: error reconnecting to master 'repl@db1122.eqiad.wmnet:3306' - retry-time: 60  maximum-retries: 86400  message: SSL connection error: error:00000000:lib(0):func(0):reason(0), Internal MariaDB error code: 2026
Feb 26 07:23:25 db1140 mysqld[3685]: 2020-02-26  7:23:25 139851380549376 [Note] Slave: connected to master 'repl@db1122.eqiad.wmnet:3306',replication resumed in log 'db1122-bin.001369' at position 786291425

I am surprised with that internal error, I think this requires some investigation

Event Timeline

Marostegui triaged this task as Medium priority.Feb 26 2020, 7:30 AM
Marostegui moved this task from Triage to In progress on the DBA board.

Something happened to its master (db1122), there's a glitch of everything on it: https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&from=1582659145265&to=1582702345265&var-dc=eqiad%20prometheus%2Fops&var-server=db1122&var-port=9104

Maybe network issue there? I am not able to see anything on logs on the master (ie: interface going down temporarily or something like that).
@ayounsi can you check on the switch port graphs/logs to check if there is something there?

From those graphs there is another glitch that apparently happened around 22:06 yesterday, but that one didn't cause s2 replication to stall on db1140, from what I can see.

So, given that db1122 is the primary master, I cannot see any other slave apart from db1140 failing to connect:

Feb 26 07:18:11 db1140 mysqld[3685]: 2020-02-26  7:18:11 139851380549376 [ERROR] Slave I/O: error reconnecting to master 'repl@db1122.eqiad.wmnet:3306' - retry-time: 60  maximum-retries: 86400  message: SSL connection error: error:00000000:lib(0):func(0):reason(0), Internal MariaDB error code: 2026
Feb 26 07:18:22 icinga2001 icinga: SERVICE ALERT: db1140;MariaDB Slave IO: s2;CRITICAL;SOFT;1;CRITICAL slave_io_state Slave_IO_Running: No, Errno: 2026, Errmsg: error reconnecting to master 'repl@db1122.eqiad.wmnet:3306' - retry-time: 60  maximum-retries: 86400  message: SSL connection error: error:00000000:lib(0):func(0):reason(0)
Feb 26 07:18:44 icinga1001 icinga: SERVICE ALERT: db1140;MariaDB Slave IO: s2;CRITICAL;SOFT;1;CRITICAL slave_io_state Slave_IO_Running: No, Errno: 2026, Errmsg: error reconnecting to master 'repl@db1122.eqiad.wmnet:3306' - retry-time: 60  maximum-retries: 86400  message: SSL connection error: error:00000000:lib(0):func(0):reason(0)
Feb 26 07:20:10 icinga2001 icinga: SERVICE ALERT: db1140;MariaDB Slave IO: s2;CRITICAL;SOFT;2;CRITICAL slave_io_state Slave_IO_Running: No, Errno: 2026, Errmsg: error reconnecting to master 'repl@db1122.eqiad.wmnet:3306' - retry-time: 60  maximum-retries: 86400  message: SSL connection error: error:00000000:lib(0):func(0):reason(0)
Feb 26 07:20:52 icinga1001 icinga: SERVICE ALERT: db1140;MariaDB Slave IO: s2;CRITICAL;SOFT;2;CRITICAL slave_io_state Slave_IO_Running: No, Errno: 2026, Errmsg: error reconnecting to master 'repl@db1122.eqiad.wmnet:3306' - retry-time: 60  maximum-retries: 86400  message: SSL connection error: error:00000000:lib(0):func(0):reason(0)
Feb 26 07:21:56 icinga2001 icinga: SERVICE ALERT: db1140;MariaDB Slave IO: s2;CRITICAL;HARD;3;CRITICAL slave_io_state Slave_IO_Running: No, Errno: 2026, Errmsg: error reconnecting to master 'repl@db1122.eqiad.wmnet:3306' - retry-time: 60  maximum-retries: 86400  message: SSL connection error: error:00000000:lib(0):func(0):reason(0)
Feb 26 07:22:56 icinga1001 icinga: SERVICE ALERT: db1140;MariaDB Slave IO: s2;CRITICAL;HARD;3;CRITICAL slave_io_state Slave_IO_Running: No, Errno: 2026, Errmsg: error reconnecting to master 'repl@db1122.eqiad.wmnet:3306' - retry-time: 60  maximum-retries: 86400  message: SSL connection error: error:00000000:lib(0):func(0):reason(0)
Feb 26 07:22:56 icinga1001 icinga: SERVICE NOTIFICATION: irc;db1140;MariaDB Slave IO: s2;CRITICAL;notify-service-by-irc;CRITICAL slave_io_state Slave_IO_Running: No, Errno: 2026, Errmsg: error reconnecting to master repl@db1122.eqiad.wmnet:3306 - retry-time: 60  maximum-retries: 86400  message: SSL connection error: error:00000000:lib(0):func(0):reason(0)

first error:

Feb 26 07:15:09 db1140 mysqld[3685]: 2020-02-26  7:15:09 139851380549376 [ERROR] Error reading packet from server: Lost connection to MySQL server during query (server_errno=2013)

Nothing of notice on systemd, except the login of root from cumin a few minutes before, but nothing out of the ordinary (snapshots of the other instances, s3 and x1).

There was an increase in load at that time, which led to an increase of network errors: https://grafana.wikimedia.org/d/000000377/host-overview?orgId=1&refresh=5m&var-server=db1140&var-datasource=eqiad%20prometheus%2Fops&var-cluster=mysql&fullscreen&panelId=11&from=1582695612845&to=1582706412845 maybe it is just a consequence of "speeding up" s3 backups, leading of too much work done by the host.

I will reprogram the backups so not 2 run at the same time on db1140, that is my best guess and mitigation.

Thanks @ayounsi - looks like this confirms that the backups generation caused some temporary issues here.

jcrespo claimed this task.

Closing this as resolved, as this backup overlap should not happen again due to work on T244958.

As a general idea, we should purchase all new DB hosts with 10G cards, and work towards putting at least future source hosts on a 10G network CC @ayounsi . That will be around 1U 5 more hosts per datacenter "short term" (for some definition of short term in DBA time :-P). Backup workflows do really need the extra bandwidth in bursts.