Page MenuHomePhabricator

db1151, db2144 X2 masters error: Could not execute Delete_rows_v1 event on table mainstash.objectstash
Closed, ResolvedPublic

Description

Aug 16 03:08:12 db2144 mysqld[2821]: 2022-08-16  3:08:12 7252901 [ERROR] Slave SQL: Could not execute Delete_rows_v1 event on table mainstash.objec>
Aug 16 03:08:12 db2144 mysqld[2821]: 2022-08-16  3:08:12 7252901 [Warning] Slave: Can't find record in 'objectstash' Error_code: 1032
Aug 16 03:08:12 db2144 mysqld[2821]: 2022-08-16  3:08:12 7252901 [ERROR] Error running query, slave SQL thread aborted. Fix the problem, and restar>
Aug 16 03:08:12 db2144 mysqld[2821]: 2022-08-16  3:08:12 7252901 [Note] Slave SQL thread exiting, replication stopped in log 'db1151-bin.000810' at>
Aug 16 03:08:12 db2144 mysqld[2821]: 2022-08-16  3:08:12 7252901 [Note] master was db1151.eqiad.wmnet:3306
Aug 16 03:08:12 db1151 mysqld[1739]: 2022-08-16  3:08:12 2023045947 [ERROR] Slave SQL: Could not execute Delete_rows_v1 event on table mainstash.ob>
Aug 16 03:08:12 db1151 mysqld[1739]: 2022-08-16  3:08:12 2023045947 [Warning] Slave: Can't find record in 'objectstash' Error_code: 1032
Aug 16 03:08:12 db1151 mysqld[1739]: 2022-08-16  3:08:12 2023045947 [ERROR] Error running query, slave SQL thread aborted. Fix the problem, and res>
Aug 16 03:08:12 db1151 mysqld[1739]: 2022-08-16  3:08:12 2023045947 [Note] Slave SQL thread exiting, replication stopped in log 'db2144-bin.000813'>
Aug 16 03:08:12 db1151 mysqld[1739]: 2022-08-16  3:08:12 2023045947 [Note] master was db2144.codfw.wmnet:3306

Replication is stopped.


The affected x2 cluster is the "MediaWiki "Main stash" containing session data which was moved out of Redis in T212129.

Related Objects

StatusSubtypeAssignedTask
Resolvedaaron
Resolvedjijiki
Resolvedaaron
ResolvedKrinkle
Resolvedtstarling
Resolvedaaron
ResolvedPRODUCTION ERRORtstarling
Resolvedtstarling
Resolvedtstarling
Resolvedaaron
Resolvedaaron
ResolvedKrinkle
Resolvedaaron
ResolvedEevans
Resolvedaaron
ResolvedKrinkle
ResolvedPapaul
ResolvedMarostegui
Resolvedaaron
ResolvedKrinkle
Resolvedtstarling
Resolvedtstarling
ResolvedPRODUCTION ERRORjcrespo
Resolvedtstarling
ResolvedMarostegui
Resolvedtstarling
ResolvedKrinkle
ResolvedEevans
Resolvedtstarling
ResolvedLadsgroup
Resolvedtstarling
ResolvedVgutierrez
Resolvedtstarling
Resolvedaaron
Resolvedjcrespo
Resolvedjcrespo
Resolved Cmjohnson
ResolvedGehel
Resolvedjcrespo
Resolvedjcrespo
Resolvedjcrespo
ResolvedMarostegui
DuplicateNone
Resolvedjcrespo
Resolvedaaron
Resolvedjcrespo
Resolvedtstarling
Resolvedtstarling
Resolvedaaron
ResolvedKrinkle

Event Timeline

@jcrespo did some magic

db1151 and db2144 show normal status again

Next steps based on my understanding of what went wrong:

  • Switch x2 to statement-based replication
  • Restore replication on codfw replicas
  • Validate MW’s concept of multi-master conflict resolution by performing simultaneous writes and simultaneous purges on both DCs
  • Fix uncaught exception from LoadBalancer::approvePrimaryChanges() which caused total failure rather than graceful failure
  • Re-enable multi-DC mode on testwiki, test2wiki and mediawikiwiki

Before any of those:

  • Debug exactly what was the trigger of the replication breaking using the still broken codfw replicas, so we can base any later change on hard facts by reconstructing the exact sequence of transactions and server actions.

Assuming the issue is what it looked, at the same time as STATEMENT:

  • Disable GTID (at the very least, on any host that is going to be directly writable, but probably on all, if we are going to have different data on different hosts) - or in general, apply parsecache-like config to x2.
  • Consider extreme config like apply slave_exec_mode as IDEMPOTENT or skip certain errors automatically if, even under weird states, replication has to continue despite consistency issues (although this seems like a really bad idea, one could potentially end up with 0 writes being transmitted in certain circumstances)

Mentioned in SAL (#wikimedia-operations) [2022-08-16T18:37:12Z] <jynus> restore x2 codfw replication T315271

I've done the first bullet point, to the extent possible- because "reset..." had been run on both primaries, and I think on the eqiad replicas, all coordinates related to eqiad had been lost. However, the relay log of codfw was intact, and I was able to find out the root cause, with high level of confidence.

I will comment it with @Marostegui tomorrow and search his ok to paste here the link to the preliminary report on wikitech.

I "fixed" (where fixed means purged binlog events) from the codfw replicas, making replication restart. I haven't yet repooled or deleted downtimes and silences from any host, but all are now "green" on icinga and orchestrator.

I was able to reproduce a replication failure locally. I set up two instances of MariaDB 10.3.34 with ring replication and binlog-format=row. One of them had master_delay=5, simulating network latency. Writing different values with the same key to both instances simultaneously led to permanently inconsistent data, with db1 having the value written to db2 and vice versa. A simultaneous DELETE query, achieved by calling SqlBagOStuff::set() with purgePeriod=1, caused replication to fail.

Change 823793 had a related patch set uploaded (by Tim Starling; author: Tim Starling):

[mediawiki/core@master] SqlBagOStuff: Fix modtoken comparison

https://gerrit.wikimedia.org/r/823793

  • Validate MW’s concept of multi-master conflict resolution by performing simultaneous writes and simultaneous purges on both DCs

It turns out to be broken, I fixed it in the patch linked above.

  • Fix uncaught exception from LoadBalancer::approvePrimaryChanges() which caused total failure rather than graceful failure

This is done in https://gerrit.wikimedia.org/r/c/mediawiki/core/+/823791

Change 824037 had a related patch set uploaded (by Tim Starling; author: Tim Starling):

[operations/puppet@production] Set binlog_format=STATEMENT on x2 servers

https://gerrit.wikimedia.org/r/824037

Change 824039 had a related patch set uploaded (by Tim Starling; author: Tim Starling):

[operations/puppet@production] Re-enable multi-DC mode on testwiki, test2wiki and mediawiki.org

https://gerrit.wikimedia.org/r/824039

Change 824037 merged by Marostegui:

[operations/puppet@production] Set binlog_format=STATEMENT on x2 servers

https://gerrit.wikimedia.org/r/824037

Change 823793 merged by jenkins-bot:

[mediawiki/core@master] SqlBagOStuff: Fix modtoken comparison

https://gerrit.wikimedia.org/r/823793

Change 824445 had a related patch set uploaded (by Tim Starling; author: Tim Starling):

[mediawiki/core@wmf/1.39.0-wmf.25] SqlBagOStuff: Fix modtoken comparison

https://gerrit.wikimedia.org/r/824445

Change 824445 merged by jenkins-bot:

[mediawiki/core@wmf/1.39.0-wmf.25] SqlBagOStuff: Fix modtoken comparison

https://gerrit.wikimedia.org/r/824445

Mentioned in SAL (#wikimedia-operations) [2022-08-22T00:25:23Z] <tstarling@deploy1002> Synchronized php-1.39.0-wmf.25/includes/objectcache/SqlBagOStuff.php: fix modtoken comparison T315271 (duration: 03m 45s)

Change 824039 merged by Tim Starling:

[operations/puppet@production] Re-enable multi-DC mode on testwiki, test2wiki and mediawiki.org

https://gerrit.wikimedia.org/r/824039

Krinkle claimed this task.
Krinkle reassigned this task from Krinkle to tstarling.
Krinkle subscribed.
  • Validate MW’s concept of multi-master conflict resolution by performing simultaneous writes and simultaneous purges on both DCs

I tested this in production with eval.php, running simultaneously on mwmaint1002 and mwmaint2002:

$dc = 'codfw'; // or eqiad
$c = MediaWiki\MediaWikiServices::getInstance()->getMainObjectStash()
$startTime = round(time() + 20,-1); print microtime(true) . "\t$startTime\n"; while ( microtime(true) < $startTime ); $c->set('test', $dc); for ( $i = 0; $i < 100; $i++) { print $c->get('test') . "\n"; usleep(10000); }

$tc = Wikimedia\TestingAccessWrapper::newFromObject($c);
$tc->purgePeriod = 1;
$startTime = round(time() + 20,-1); print microtime(true) . "\t$startTime\n"; while ( microtime(true) < $startTime ); $c->set('test', $dc); $c->set('test', $dc, 10);

I have added some more info at:
https://wikitech.wikimedia.org/wiki/MariaDB#x2 and https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting#x2_special_topology

At https://wikitech.wikimedia.org/wiki/MariaDB#x2 @Krinkle or @tstarling could you add the procedure for disabling a DC for any of those? (ie: codfw master dies and we want to stop writes to codfw).

The incident report (still a draft) for this outage is at: https://wikitech.wikimedia.org/wiki/Incidents/2022-08-16_x2_databases_replication_breakage