Page MenuHomePhabricator

Possibly a big update going to: u2815__p`.`all_articles
Closed, ResolvedPublic

Description

@Dispenser are you aware of a big transaction (looks like an UPDATE) being done to: u2815__p.all_articles.
Apparently it started yesterday around 23:09:34
That run on the toolforge db master and it has now being executed on the slave (labsdb1004) - it has been running now for around 8 hours)

The slave is stuck processing: Exec_Master_Log_Pos: 16403525 for log.186637

Which looks like it has a massive amount of UPDATES like:

### UPDATE `u2815__p`.`all_articles`
### WHERE
###   @1=REMOVED /* INT meta=0 nullable=0 is_null=0 */
###   @2='REMOVED' /* VARSTRING(1020) meta=1020 nullable=0 is_null=0 */
###   @3='1' /* VARSTRING(1024) meta=1024 nullable=0 is_null=0 */
###   @4=0 /* TINYINT meta=0 nullable=0 is_null=0 */
###   @5=-1 (255) /* TINYINT meta=0 nullable=0 is_null=0 */
###   @6=-1 (255) /* TINYINT meta=0 nullable=0 is_null=0 */
### SET
###   @1=REMOVED /* INT meta=0 nullable=0 is_null=0 */
###   @2='REMOVED' /* VARSTRING(1020) meta=1020 nullable=0 is_null=0 */
###   @3='1' /* VARSTRING(1024) meta=1024 nullable=0 is_null=0 */
###   @4=0 /* TINYINT meta=0 nullable=0 is_null=0 */
###   @5=0 /* TINYINT meta=0 nullable=0 is_null=0 */
###   @6=0 /* TINYINT meta=0 nullable=0 is_null=0 */

There is no PK on that key.

       Table: all_articles
Create Table: CREATE TABLE `all_articles` (
  `aa_page` int(10) unsigned NOT NULL,
  `aa_title` varchar(255) CHARACTER SET utf8mb4 COLLATE utf8mb4_bin NOT NULL,
  `title_normalized` varchar(256) COLLATE utf8mb4_unicode_ci NOT NULL,
  `aa_is_dab` tinyint(1) NOT NULL,
  `linkedFromDab` tinyint(1) NOT NULL,
  `linkedFromSI` tinyint(1) NOT NULL
) ENGINE=InnoDB DEFAULT CHARSET=utf8mb4 COLLATE=utf8mb4_unicode_ci

Event Timeline

This has been running now for 10h:

---TRANSACTION 19459252027, ACTIVE 36528 sec fetching rows
mysql tables in use 1, locked 1
1829 lock struct(s), heap size 226856, 315107 row lock(s), undo log entries 313280
MySQL thread id 447721, OS thread handle 0x7f11f0084700, query id 3523868229 executing

And it is degrading the service for the rest of the users, I am considering if we should ignore and filter out u2815__p.all_articles for replication.

Kill it and drop the table. The script should only take 10 minutes for LOAD DATA

@Dispenser- the proposal would be to remove redundancy of that table, if you say it is (easy) to automatically generated data, we will ignore it on replication- with the consequence of it will not have redundancy and you will have to regenerate it manually in case of a hardware failure or other kind of maintenance. I will deploy now such change.

Mentioned in SAL (#wikimedia-operations) [2018-03-23T10:59:09Z] <jynus> deployed new replication filter for labsdb1004 on u2815__p.all_articles T190488

@jcrespo I was developing the script last night, so something might have gone astray. I'll add a PRIMARY KEY so it doesn't happen in the future.

That would be cool, thank you! I can remove the filter and reimport the table to the (hidden) replica when you are done. I will claim it meanwhile for other related breakages on replication.

jcrespo added a project: DBA.
jcrespo moved this task from Triage to In progress on the DBA board.

Change 421509 had a related patch set uploaded (by Jcrespo; owner: Jcrespo):
[operations/puppet@production] toolsdb: Deploy replication filter to avoid replica lag

https://gerrit.wikimedia.org/r/421509

Change 421509 merged by Jcrespo:
[operations/puppet@production] toolsdb: Deploy replication filter to avoid replica lag

https://gerrit.wikimedia.org/r/421509

That would be cool, thank you! I can remove the filter and reimport the table to the (hidden) replica when you are done. I will claim it meanwhile for other related breakages on replication.

Done and I've dropped u2815__p.all_articles for consistency.