Page MenuHomePhabricator

labsdb1004 replication broken: in memory tables from s51290__dpl_p
Closed, ResolvedPublic

Description

Hi,

labsdb1004 has replication broken with:

mysql:root@localhost [s51290__dpl_p]> show slave status\G
*************************** 1. row ***************************
               Slave_IO_State: Waiting for master to send event
                  Master_Host: labsdb1005.eqiad.wmnet
                  Master_User: repl
                  Master_Port: 3306
                Connect_Retry: 60
              Master_Log_File: log.164307
          Read_Master_Log_Pos: 3443652
               Relay_Log_File: labsdb1004-relay-bin.000002
                Relay_Log_Pos: 689
        Relay_Master_Log_File: log.164197
             Slave_IO_Running: Yes
            Slave_SQL_Running: No
              Replicate_Do_DB:
          Replicate_Ignore_DB:
           Replicate_Do_Table:
       Replicate_Ignore_Table:
      Replicate_Wild_Do_Table:
  Replicate_Wild_Ignore_Table: s51412\_\_data.%,s51071\_\_templatetiger\_p.%,s52721\_\_pagecount\_stats\_p.%
                   Last_Errno: 1114
                   Last_Error: Could not execute Write_rows_v1 event on table s51290__dpl_p.i_psub; The table 'i_psub' is full, Error_code: 1114; handler error HA_ERR_RECORD_FILE_FULL; the event's master log log.164197, end_log_pos 72676728
                 Skip_Counter: 0
          Exec_Master_Log_Pos: 72676577
              Relay_Log_Space: 12850941488
              Until_Condition: None
               Until_Log_File:
                Until_Log_Pos: 0
           Master_SSL_Allowed: Yes
           Master_SSL_CA_File:
           Master_SSL_CA_Path:
              Master_SSL_Cert:
            Master_SSL_Cipher:
               Master_SSL_Key:
        Seconds_Behind_Master: NULL
Master_SSL_Verify_Server_Cert: No
                Last_IO_Errno: 0
                Last_IO_Error:
               Last_SQL_Errno: 1114
               Last_SQL_Error: Could not execute Write_rows_v1 event on table s51290__dpl_p.i_psub; The table 'i_psub' is full, Error_code: 1114; handler error HA_ERR_RECORD_FILE_FULL; the event's master log log.164197, end_log_pos 72676728
  Replicate_Ignore_Server_Ids:
             Master_Server_Id: 171975945
               Master_SSL_Crl:
           Master_SSL_Crlpath:
                   Using_Gtid: Slave_Pos
                  Gtid_IO_Pos: 0-171975945-12881250735

This is the affected table:

mysql:root@localhost [s51290__dpl_p]> show create table s51290__dpl_p.i_psub;
+--------+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| Table  | Create Table                                                                                                                                                                                                                                                          |
+--------+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| i_psub | CREATE TABLE `i_psub` (
  `ps_id` int(10) unsigned NOT NULL DEFAULT '0',
  `ps_title` varchar(255) COLLATE utf8mb4_bin NOT NULL DEFAULT '',
  PRIMARY KEY (`ps_id`),
  UNIQUE KEY `u_i_psub` (`ps_title`)
) ENGINE=MEMORY DEFAULT CHARSET=utf8mb4 COLLATE=utf8mb4_bin |
+--------+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+

Current disk status:

root@labsdb1004:~# df -hT
Filesystem                          Type      Size  Used Avail Use% Mounted on
udev                                devtmpfs   10M     0   10M   0% /dev
tmpfs                               tmpfs     6.3G  638M  5.7G  10% /run
/dev/dm-0                           ext3      9.1G  4.5G  4.1G  53% /
tmpfs                               tmpfs      16G  4.0K   16G   1% /dev/shm
tmpfs                               tmpfs     5.0M     0  5.0M   0% /run/lock
tmpfs                               tmpfs      16G     0   16G   0% /sys/fs/cgroup
/dev/sda1                           ext3      268M  105M  150M  42% /boot
/dev/mapper/labsdb1004--vg-postgres ext4     1008G  258G  700G  27% /srv/postgres
/dev/mapper/labsdb1004--vg-labsdb   xfs       2.3T  847G  1.5T  37% /srv/labsdb

This is the event:

#171114 19:42:51 server id 171975945  end_log_pos 72676499      Write_rows: table id 760737 flags: STMT_END_F
### INSERT INTO `s51290__dpl_p`.`i_psub`
### SET
###   @1=3077441
###   @2='99riverstreet.jpg'
# at 72676499
#171114 19:42:51 server id 171975945  end_log_pos 72676577      Query   thread_id=58909691      exec_time=0     error_code=0
SET TIMESTAMP=1510688571/*!*/;
COMMIT
/*!*/;
# at 72676577
#171114 19:42:51 server id 171975945  end_log_pos 72676615      GTID 0-171975945-12852166892
/*!100001 SET @@session.gtid_seq_no=12852166892*//*!*/;
BEGIN
/*!*/;
# at 72676615
# at 72676672
#171114 19:42:51 server id 171975945  end_log_pos 72676672      Table_map: `s51290__dpl_p`.`i_psub` mapped to number 760737
#171114 19:42:51 server id 171975945  end_log_pos 72676728      Write_rows: table id 760737 flags: STMT_END_F
### INSERT INTO `s51290__dpl_p`.`i_psub`
### SET
###   @1=33592610
###   @2='99th_Dream_cover.jpg'
# at 72676728
#171114 19:42:51 server id 171975945  end_log_pos 72676806      Query   thread_id=58909691      exec_time=0     error_code=0
SET TIMESTAMP=1510688571/*!*/;

Event Timeline

Marostegui moved this task from Backlog to ToolsDB on the Data-Services board.

First attempts to convert the table to InnoDB/Aria are not successful:

mysql:root@localhost [s51290__dpl_p]> alter table i_psub engine=InnoDB;
ERROR 1709 (HY000): Index column size too large. The maximum column size is 767 bytes.

mysql:root@localhost [s51290__dpl_p]> alter table i_psub engine=aria;
ERROR 1071 (42000): Specified key was too long; max key length is 1000 bytes

So the only way to solve this issue is: truncating the table or creating it with InnoDB.
Given this is a MEMORY table, what I have done is:

  • stop mysql
  • temporary set: max_heap_table_size = 256M (it was 16M)
  • start mysql

Replication flow till the table got full again. Which took around 20 seconds or less

cloud-services-team any way you can contact this user and let him/her know that he either has to re-create the table as InnoDB or sooner or later the limit will be hit and we'd need to truncate the table, as otherwise replication will break again.

In order to avoid affecting all the users in this host, I have truncated the table.
However, I have saved the records and took a mysqldump left at:

root@labsdb1004:~# ls -lh /root/i_psub.sql
-rw-r--r-- 1 root root 9.5M Nov 15 06:50 /root/i_psub.sql

Even if the table gets truncated, it gets full super quickly, affecting replication again.

               Last_SQL_Error: Could not execute Write_rows_v1 event on table s51290__dpl_p.i_psub; The table 'i_psub' is full, Error_code: 1114; handler error HA_ERR_RECORD_FILE_FULL; the event's master log log.164200, end_log_pos 37503205
  Replicate_Ignore_Server_Ids:
             Master_Server_Id: 171975945
               Master_SSL_Crl:
           Master_SSL_Crlpath:
                   Using_Gtid: Slave_Pos
                  Gtid_IO_Pos: 0-171975945-12882627823
1 row in set (0.00 sec)

mysql:root@localhost [s51290__dpl_p]> select count(*) from i_psub;
+----------+
| count(*) |
+----------+
|   248620 |
+----------+
1 row in set (0.00 sec)

So, I have taken another mysqldump of those new records.

Truncated the table, and recreated it with InnoDB. The UNIQUE index, is obviously restricted to a lower width, so there are chances of duplicate entries as only the first 191 out of 255 chars are getting checked.
This is the current schema of the final table

mysql:root@localhost [s51290__dpl_p]> show create table i_psub\G
*************************** 1. row ***************************
       Table: i_psub
Create Table: CREATE TABLE `i_psub` (
  `ps_id` int(10) unsigned NOT NULL DEFAULT '0',
  `ps_title` varchar(255) COLLATE utf8mb4_bin NOT NULL DEFAULT '',
  PRIMARY KEY (`ps_id`),
  UNIQUE KEY `u_i_psub_1` (`ps_title`(191))
) ENGINE=InnoDB DEFAULT CHARSET=utf8mb4 COLLATE=utf8mb4_bin
1 row in set (0.00 sec)

I have loaded all the records onto that table. But this needs the user s51290 to fix it in a good way, this is just a work around to stop affecting everyone using that server and getting replication broken all the time.

And replication got broken with a different in memory table from the same user:

mysql:root@localhost [s51290__dpl_p]> show create table s51290__dpl_p.i_orphan_candidates;
+---------------------+-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| Table               | Create Table                                                                                                                                                                                                                                          |
+---------------------+-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| i_orphan_candidates | CREATE TABLE `i_orphan_candidates` (
  `oc_id` int(10) unsigned NOT NULL DEFAULT '0',
  `oc_title` varchar(255) COLLATE utf8mb4_bin NOT NULL DEFAULT '',
  UNIQUE KEY `u_oc` (`oc_title`)
) ENGINE=MEMORY DEFAULT CHARSET=utf8mb4 COLLATE=utf8mb4_bin |
+---------------------+-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+

With this new table, the procedure of recreating it as innodb doesn't work, as the code is dropping it and recreating it all the time, so it comes back as Memory table again, breaking replication.
So the solution for now is to ignore that table while replicating.
So I have set the replication filter until this gets fixed:

replicate_wild_ignore_table    = s51290__dpl_p.i_orphan_candidates

If the same issue comes back again with i_psub table I will do the same, as we cannot let a single user affect all the ones using this server.

Change 391508 had a related patch set uploaded (by Marostegui; owner: Marostegui):
[operations/puppet@production] tools.my.cnf: Replication filter: s51290__dpl_p

https://gerrit.wikimedia.org/r/391508

Change 391508 merged by Marostegui:
[operations/puppet@production] tools.my.cnf: Replication filter: s51290__dpl_p

https://gerrit.wikimedia.org/r/391508

Change 391513 had a related patch set uploaded (by Marostegui; owner: Marostegui):
[operations/puppet@production] tools.my.cnf: Replication filter for i_psub

https://gerrit.wikimedia.org/r/391513

Change 391513 merged by Marostegui:
[operations/puppet@production] tools.my.cnf: Replication filter for i_psub

https://gerrit.wikimedia.org/r/391513

Marostegui renamed this task from labsdb1004 replication broken: table s51290__dpl_p.i_psub; The table 'i_psub' is full to labsdb1004 replication broken: in memory tables from s51290__dpl_p.Nov 15 2017, 8:00 AM

And as expected, a drop and a recreation of i_psub also happened, recreating the table with MEMORY engine, so it got full again and replication broke.
So I am going to add this table to the replication ignore list until the user fixes the logic.

Also the following table i_redirects also broke. So adding it to the replication ignore filter too.

Change 391514 had a related patch set uploaded (by Marostegui; owner: Marostegui):
[operations/puppet@production] tools.my.cnf: Replication filter for i_redirects

https://gerrit.wikimedia.org/r/391514

Change 391514 merged by Marostegui:
[operations/puppet@production] tools.my.cnf: Replication filter for i_redirects

https://gerrit.wikimedia.org/r/391514

u_pagelinks also broke

Last_Error: Could not execute Write_rows_v1 event on table s51290__dpl_p.u_pagelinks; The table 'u_pagelinks' is full, Error_code: 1114; handler error HA_ERR_RECORD_FILE_FULL; the event's master log log.164218, end_log_pos 45342262

Looks like it is the last MEMORY table from this user:

mysql:root@localhost [s51290__dpl_p]> SELECT table_name FROM INFORMATION_SCHEMA.TABLES   WHERE engine = 'memory' and TABLE_SCHEMA="s51290__dpl_p";
+---------------------+
| table_name          |
+---------------------+
| i_orphan_candidates |
| i_psub              |
| i_redirects         |
| u_pagelinks         |
+---------------------+
4 rows in set (0.00 sec)

Change 391516 had a related patch set uploaded (by Marostegui; owner: Marostegui):
[operations/puppet@production] tools.my.cnf.erb: Ignore filter for u_pagelinks

https://gerrit.wikimedia.org/r/391516

Change 391516 merged by Marostegui:
[operations/puppet@production] tools.my.cnf.erb: Ignore filter for u_pagelinks

https://gerrit.wikimedia.org/r/391516

Unfortunately, new memory tables keep coming, so I am going to filter all the tables called i_ and u_ because I wouldn't want to filter the whole user.

Last_SQL_Error: Could not execute Write_rows_v1 event on table s51290__dpl_p.u_imagelinks; The table 'u_imagelinks' is full, Error_code: 1114; handler error HA_ERR_RECORD_FILE_FULL; the event's master log log.164218, end_log_pos 81268337
mysql:root@localhost [s51290__dpl_p]> SELECT table_name FROM INFORMATION_SCHEMA.TABLES   WHERE engine = 'memory' and TABLE_SCHEMA="s51290__dpl_p";
+---------------------+
| table_name          |
+---------------------+
| i_orphan_candidates |
| i_pl                |
| i_psub              |
| i_redirects         |
| u_imagelinks        |
| u_pagelinks         |
+---------------------+
6 rows in set (0.01 sec)

Change 391517 had a related patch set uploaded (by Marostegui; owner: Marostegui):
[operations/puppet@production] tools.my.cnf.erb: Ignore memory tables for s51290

https://gerrit.wikimedia.org/r/391517

Change 391517 merged by Marostegui:
[operations/puppet@production] tools.my.cnf.erb: Ignore memory tables for s51290

https://gerrit.wikimedia.org/r/391517

Ignore the full database, notify the breakage to the user (and that he will not have backups/redundancy, as done with the other tools (no plan B, no backups, database will not exist while on failover/maintenance).

Marostegui claimed this task.

I am going to fix this as resolved, as the scope of the ticket is finished.
As nothing else broke in 24h, I am going to leave the replication filters as they are, that is: not replicating in memory tables.
If something else happens, and as Jaime said, we will need to ban the whole user from this host, meaning no backups/redundancy for it.
Once this is fixed from code side, feel free to reopen this ticket or create a new one so we can remove the replication filters.

No one contacted me during this whole process, until after everything had been "resolved". I won't run the job with MEMORY tables again, so you can remove the filters.

Change 391789 had a related patch set uploaded (by Marostegui; owner: Marostegui):
[operations/puppet@production] tools.my.cnf: Remove filters for s51290

https://gerrit.wikimedia.org/r/391789

Change 391789 merged by Marostegui:
[operations/puppet@production] tools.my.cnf: Remove filters for s51290

https://gerrit.wikimedia.org/r/391789

They have been removed from config and from live replication.