Page MenuHomePhabricator

s51187 and p50380g50692 database users are generating excessive lag on replica service
Closed, ResolvedPublic

Description

Screenshot_20170809_150309.png (227×646 px, 14 KB)

------------------------
LATEST DETECTED DEADLOCK
------------------------
2017-08-09 09:13:13 7f742a17f700
*** (1) TRANSACTION:
TRANSACTION 78008195605, ACTIVE 41 sec starting index read
mysql tables in use 1, locked 1
LOCK WAIT 24 lock struct(s), heap size 2936, 24 row lock(s), undo log entries 23
MySQL thread id 165027767, OS thread handle 0x7f71e2ea8700, query id 16558091672 executing
*** (1) WAITING FOR THIS LOCK TO BE GRANTED:
RECORD LOCKS space id 723 page no 729779 n bits 200 index `PRIMARY` of table `enwiki`.`page` trx id 78008195605 lock_mode X locks rec
*** (2) TRANSACTION:
TRANSACTION 78008145646, ACTIVE 104 sec starting index read
mysql tables in use 1, locked 1
117655 lock struct(s), heap size 12105256, 5973 row lock(s)
MySQL thread id 165151888, OS thread handle 0x7f742a17f700, query id 16558035348 10.68.18.18 s51290 Sending data
DELETE FROM ch_pl
      WHERE NOT EXISTS
           (
            SELECT 1
              FROM enwiki_p.page
             WHERE pl_from = page_id
               AND page_namespace = 0
               AND page_is_redirect = 0
           )
*** (2) HOLDS THE LOCK(S):
RECORD LOCKS space id 723 page no 729779 n bits 200 index `PRIMARY` of table `enwiki`.`page` trx id 78008145646 lock mode S locks rec
*** (2) WAITING FOR THIS LOCK TO BE GRANTED:
RECORD LOCKS space id 723 page no 12786 n bits 208 index `PRIMARY` of table `enwiki`.`page` trx id 78008145646 lock mode S locks rec 
*** WE ROLL BACK TRANSACTION (1)
`

These are the main tools involved:

  • Xtools:
INSERT HIGH_PRIORITY IGNORE INTO `s51187__xtools_tmp`.`20170809125445e77252c210e090f34d0d1308
                                        SELECT 
                                                  page_namespace
                                                , a.rev_page
                                                , a.rev_timestamp
                                                , a.rev_minor_edit
                                                , IFNULL(a.rev_parent_id,0)
                                                , IFNULL(a.rev_len,0)
                                                , CONCAT( LEFT(SUBSTRING_INDEX( a.rev_comment, '*/', -1),20), RIGHT(a.rev_comment,20)
                                        FROM revision_userindex a
                                        JOIN page ON page_id = a.rev_page
                                        WHERE   a.rev_user = '7181920' AND  a.rev_timestamp > 20080526
  • dplbot (I was told):
CREATE TABLE n_dab_pl (
      pl_from int unsigned NOT NULL default '0',
      pl_title varchar(255) binary NOT NULL default ''
    ) ENGINE=MyISAM AS 
         SELECT pl_from,
                oc_title AS pl_title
           FROM enwiki_p.pagelinks, n_dab_link_candidates
          WHERE pl_namespace = 0
            AND pl_title = oc_title

Because they are using MyISAM (a heavy locking engine, unsupported and discouraged on replica hosts), combined with DDL/DMLs (INSERT/UPDATE/DELETE/CREATE ... SELECT) they are causing replication delay for All users. This has been happening for a while, but normally it only caused small delays, they have now increased to several hours of delay (maybe because the interaction between the 2?).

The easiest way to solve this is to create/convert the user tables with the InnoDB engine (ENGINE=InnoDB), which doesn't require locking the original replicated table. Alternatively, a SELECT can be done separated from the insertion to avoid blocking writes. Please correct this issue as soon if possible. As documented:

Caution: Writing (INSERT, UPDATE, DELETE) tables on the replica servers leads to the replication being stopped for all users on that server until the query finishes. So you need to make sure that such queries can be executed in a time frame that does not disrupt other users' work too much, for example by processing data in smaller batches or rethinking your data flow. As a rule of thumb, queries should finish in less than a minute. -- https://wikitech.wikimedia.org/wiki/Help:Toolforge/Database#Steps_to_create_a_user_database_on_the_replica_servers

Because it impacts so badly to other users, I will either convert the tables or block these queries action is not taken reasonably soon (or we will get other users' complains of lagging).

Event Timeline

@madhuvishy @bd808 can you help me reach the owner(s) of p50380g50692__DPL_p , I don't know which tool corresponds to that database/username.

jcrespo added subscribers: Dispenser, JaGa, russblau.

I am told the DPL database may be dplbot tool.

We are very close to killing the old edit counter which creates these temporary tables. In the meantime, please feel free to block those queries / drop the tables, whatever you need to do. If the old edit counter is malfunctioning that will only gain more interest in the new one :) And sorry for the disruption!

Thanks, converting them to InnoDB will make them not lock the database- I will do that for Xtools.

Well, this came as quite a surprise, but I've gone ahead and converted all of dplbot's user databases to use InnoDB. (Hard to believe we were causing all this lag, since the lag was continuing to be a problem even when none of dplbot's jobs were running, but whatever.)

Change 370992 had a related patch set uploaded (by Jcrespo; owner: Jcrespo):
[operations/puppet@production] labsdb: Set InnoDB as the default storage engine

https://gerrit.wikimedia.org/r/370992

Change 370992 merged by Jcrespo:
[operations/puppet@production] labsdb: Set InnoDB as the default storage engine

https://gerrit.wikimedia.org/r/370992

Hard to believe we were causing all this lag, since the lag was continuing to be a problem even when none of dplbot's jobs were running, but whatever.

It was not the only cause- there is heavy querying affecting all of production (T164173), however, while on other hosts it recovers quickly -including labsdb1003, which replicates from the same source-, on labsdb1001 these locks where causing to not recover fully. I also had not taken action yet with Xtools, which my previous patch handles. Lag should be with all those fixes now getting closer to 0.

I will close this when that happens. Thanks everybody for reacting quickly!

This keeps happening:

p50380g50692__DPL_p
 Command: Query
    Time: 79
   State: Sending data
    Info: CREATE TABLE ch_pl (
      pl_from int unsigned NOT NULL default '0',
      pl_to int unsigned NOT NULL default '0',
      pl_r int unsigned NOT NULL default '0'
    ) ENGINE=MEMORY AS

Both memory, aria and myisam have table-level locking, and not row-level or mvcc:

DELETE FROM ch_pl
      WHERE NOT EXISTS
           (
            SELECT 1
              FROM enwiki_p.page
             WHERE pl_from = page_id
               AND page_namespace = 0
               AND page_is_redirect = 0
           )

root@labsdb1001[p50380g50692__DPL_p]> show create table ch_pl\G
*************************** 1. row ***************************
       Table: ch_pl
Create Table: CREATE TABLE `ch_pl` (
  `pl_from` int(10) unsigned NOT NULL DEFAULT '0',
  `pl_to` int(10) unsigned NOT NULL DEFAULT '0',
  `pl_r` int(10) unsigned NOT NULL DEFAULT '0'
) ENGINE=MEMORY DEFAULT CHARSET=binary
1 row in set (0.00 sec)

Also:

CREATE TABLE IF NOT EXISTS `s51187__xtools_tmp`.`201708101025461935804d7bc9bfde231b57ce8330830b_parent` (
                                        `rev_id` int(8) unsigned DEFAULT '0',
                                        `rev_len` int(8) unsigned DEFAULT '0',
                                        KEY `revid` (`rev_id`)
                                        ) ENGINE=MEMORY

ENGINE=MEMORY changed to ENGINE=InnoDB for all dplbot jobs.

jcrespo claimed this task.

Right now the lag is 0 everywhere, maybe it is just a coincidence, but that was the main goal anyway: https://tools.wmflabs.org/replag/

Also:

CREATE TABLE IF NOT EXISTS `s51187__xtools_tmp`.`201708101025461935804d7bc9bfde231b57ce8330830b_parent` (
                                        `rev_id` int(8) unsigned DEFAULT '0',
                                        `rev_len` int(8) unsigned DEFAULT '0',
                                        KEY `revid` (`rev_id`)
                                        ) ENGINE=MEMORY

I've changed this to ENGINE=InnoDB, and did a quick test and everything still seems to work. Hopefully that addresses the issues on our end. We are all looking forward to putting the old XTools to rest!

Thank you both of you- even if this wasn't the cause, it certainly helps catching up faster and avoiding unnecessary noise.

This is still ongoing- the CREATE ... SELECT on p50380g50692__DPL_p is the main cause why enwiki is lagging 5 hours behind. To prove it, when queries are killed, it catches up.

What is the specific query that seems to be causing the problem now?

CREATE TABLE ch_pl SELECT ... was running at the time, not sure if it was the only one.

dplbot/s51290 seems to keep creating issues. In this case, the lag was caused by a different issue (another user creating heavy queries, not by itself), but it was making difficult to catch up replication right now. I have banned the s51290 from labsdb1001 (not from other replicas, that can still be used, or other database hosts) until the lag goes back to 0 or close. Once I did that, replication lag started decreasing.

Is it possible to give s51290 read-only access to labsdb1001 until I have a chance to update the troublesome scripts?

Is it possible to give s51290 read-only access to labsdb1001 until I have a chance to update the troublesome scripts?

Done.

jcrespo closed this task as Resolved.EditedSep 30 2017, 9:47 AM

I've granted GRANT ALL PRIVILEGES ON `s51290\_\_%`.* TO 's51290'@'%' and GRANT ALL PRIVILEGES ON `p50380g50692\_\_%`.* TO 's51290'@'%' back to this account on labsdb1001.