db1151, db2144 X2 masters error: Could not execute Delete_rows_v1 event on table mainstash.objectstash
Closed, ResolvedPublic
Actions

Assigned To

Authored By

	colewhite
	Aug 16 2022, 3:36 AM

Description

Aug 16 03:08:12 db2144 mysqld[2821]: 2022-08-16  3:08:12 7252901 [ERROR] Slave SQL: Could not execute Delete_rows_v1 event on table mainstash.objec>
Aug 16 03:08:12 db2144 mysqld[2821]: 2022-08-16  3:08:12 7252901 [Warning] Slave: Can't find record in 'objectstash' Error_code: 1032
Aug 16 03:08:12 db2144 mysqld[2821]: 2022-08-16  3:08:12 7252901 [ERROR] Error running query, slave SQL thread aborted. Fix the problem, and restar>
Aug 16 03:08:12 db2144 mysqld[2821]: 2022-08-16  3:08:12 7252901 [Note] Slave SQL thread exiting, replication stopped in log 'db1151-bin.000810' at>
Aug 16 03:08:12 db2144 mysqld[2821]: 2022-08-16  3:08:12 7252901 [Note] master was db1151.eqiad.wmnet:3306

Aug 16 03:08:12 db1151 mysqld[1739]: 2022-08-16  3:08:12 2023045947 [ERROR] Slave SQL: Could not execute Delete_rows_v1 event on table mainstash.ob>
Aug 16 03:08:12 db1151 mysqld[1739]: 2022-08-16  3:08:12 2023045947 [Warning] Slave: Can't find record in 'objectstash' Error_code: 1032
Aug 16 03:08:12 db1151 mysqld[1739]: 2022-08-16  3:08:12 2023045947 [ERROR] Error running query, slave SQL thread aborted. Fix the problem, and res>
Aug 16 03:08:12 db1151 mysqld[1739]: 2022-08-16  3:08:12 2023045947 [Note] Slave SQL thread exiting, replication stopped in log 'db2144-bin.000813'>
Aug 16 03:08:12 db1151 mysqld[1739]: 2022-08-16  3:08:12 2023045947 [Note] master was db2144.codfw.wmnet:3306

Replication is stopped.

The affected x2 cluster is the "MediaWiki "Main stash" containing session data which was moved out of Redis in T212129.

Details

Subject	Repo	Branch	Lines +/-
Re-enable multi-DC mode on testwiki, test2wiki and mediawiki.org	operations/puppet	production	+9 -0
SqlBagOStuff: Fix modtoken comparison	mediawiki/core	wmf/1.39.0-wmf.25	+20 -3
SqlBagOStuff: Fix modtoken comparison	mediawiki/core	master	+20 -3
Set binlog_format=STATEMENT on x2 servers	operations/puppet	production	+6 -6

Customize query in gerrit

Related Objects
Search...

Status	Subtype	Assigned	Task
Resolved		aaron	T252951 ResourceLoader DepStore lock acquired twice?
Resolved		jijiki	T267581 Phase out "redis_sessions" cluster and away from memcached cluster
Resolved		aaron	T88445 MediaWiki active/active datacenter investigation and work (tracking)
Resolved		Krinkle	T270223 FY2021-2022: Enable basic Multi-DC operations for read traffic (tracking)
Resolved		tstarling	T267270 Determine multi-dc strategy for CentralAuth
Resolved		aaron	T278392 Storage solution for cross-datacenter tokens
Resolved	PRODUCTION ERROR	tstarling	T311590 ApiEchoUnreadNotificationPages::getUnreadNotificationPagesFromForeign: Unexpected API response from {wiki}
Resolved		tstarling	T313496 Harmonize CentralAuth and core session TTL and migrate CentralAuth sessions to Kask
Resolved		tstarling	T91820 Create HTTP verb and sticky cookie DC routing in VCL
Resolved		aaron	T91816 Add code to enable setting sticky DC cookies for POST requests
Resolved		aaron	T121440 Dedicated post-edit cache busting cookie to prevent stale reads (session consistency)
Resolved		Krinkle	T113916 Switch ResourceLoader file dependency tracking to MultiDC-friendly backend
Resolved		aaron	T212129 Move MainStash out of Redis to a simpler multi-dc aware solution
Resolved		Eevans	T222851 Improve Echo seentime code for multi-DC access
Resolved		aaron	T229062 Look into a simple way to have global keys with db-replicated
Resolved		Krinkle	T254634 Determine and implement multi-dc strategy for ChronologyProtector
			Unknown Object (Task)
Resolved		Papaul	T267041 (Need By: 2020-11-29) rack/setup/install db214[234]
Resolved		Marostegui	T269324 Productionize x2 databases
Resolved		aaron	T274174 Add modtoken field and flags to objectcache table
Resolved		Krinkle	T288998 Significant ParserCache space increase after 2021-08-12 (1.37.0-wmf.18 regression)
Resolved		tstarling	T306118 Notify DBA prior to sending db traffic to x2
Resolved		tstarling	T315271 db1151, db2144 X2 masters error: Could not execute Delete_rows_v1 event on table mainstash.objectstash
Resolved	PRODUCTION ERROR	jcrespo	T315274 Wikimedia\Rdbms\DBTransactionError: Explicit transaction still active; a caller might have failed to call endAtomic() or cancelAtomic().
Resolved		tstarling	T315427 Create a puppet role for x2 hosts
Resolved		Marostegui	T315853 reclone x2 codfw hosts
Resolved		tstarling	T315995 Document how to disable x2 per DC
Resolved		Krinkle	T270225 Finish session storage to actually meet multi-DC requirements
Resolved		Eevans	T222990 Audit session storage to determine max age of un-GC'd sessions
Resolved		tstarling	T279664 Progressive Multi-DC roll out
Resolved		Ladsgroup	T314486 Add support for multidc to auto_schema
Resolved		tstarling	T314750 Clean up testwiki experiments (Aug 2022)
Resolved		Vgutierrez	T315434 multi-dc.lua ATS script failing in production
Resolved		tstarling	T134809 App servers <=> mariadb SSL/TLS for cross-datacenter writes
Resolved		aaron	T136218 Audit mysql database class and hhvm binding support of SSL
Resolved		jcrespo	T111654 Set up TLS for MariaDB replication
Resolved		jcrespo	T120122 Perform a rolling restart of all MySQL slaves (masters too for those services with low traffic)
Resolved		• Cmjohnson	T120689 es1019 and its management interface are unresponsive
Resolved		Gehel	T128077 Create a PKI that can be used by Puppet and for general purpose certificates
Resolved		jcrespo	T151995 Rolling restart of external storage servers for TLS certificate update
Resolved		jcrespo	T152029 Rolling restart of parsercache servers for TLS certificate update
Resolved		jcrespo	T152188 Restart pending mysql hosts with old TLS cert
Resolved		Marostegui	T152364 db1047 out of disk space, eventlogging_sync spam
Duplicate		None	T152595 Implement TLS expiration/validation checking for MariaDB certificates
Resolved		jcrespo	T156005 Reimage db1065 and db1066
Resolved		aaron	T171071 Perform testing for TLS effect on connection rate
Resolved		jcrespo	T175672 Evaluate proxysql and native php-mysql client for TLS connection between Apache and mariadb
Resolved		tstarling	T196378 Investigate solutions for MySQL connection pooling
Resolved		tstarling	T313578 Make OAuth work in Multi-DC active/active mode
Resolved		aaron	T320995 Create blog entry about multi-dc
Resolved		Krinkle	T329692 Story idea for Blog: Around the world: how we serve traffic from multiple data centers

Event Timeline

colewhite created this task.Aug 16 2022, 3:36 AM

Restricted Application added a subscriber: Aklapper. · View Herald TranscriptAug 16 2022, 3:36 AM

colewhite triaged this task as High priority.Aug 16 2022, 3:39 AM

Switch www.mediawiki.org to multi-DC mode happened a little while before that.

It has been reverted.

colewhite added subscribers: tstarling, Ladsgroup.Aug 16 2022, 4:06 AM

Dzahn added parent tasks: T212129: Move MainStash out of Redis to a simpler multi-dc aware solution, T88445: MediaWiki active/active datacenter investigation and work (tracking).Aug 16 2022, 4:07 AM

Dzahn updated the task description. (Show Details)Aug 16 2022, 4:09 AM

TheresNoTime subscribed.Aug 16 2022, 4:22 AM

TheresNoTime added a subtask: T315274: Wikimedia\Rdbms\DBTransactionError: Explicit transaction still active; a caller might have failed to call endAtomic() or cancelAtomic()..Aug 16 2022, 4:27 AM

colewhite closed subtask T315274: Wikimedia\Rdbms\DBTransactionError: Explicit transaction still active; a caller might have failed to call endAtomic() or cancelAtomic(). as Resolved.Aug 16 2022, 4:53 AM

@jcrespo did some magic

db1151 and db2144 show normal status again

Next steps based on my understanding of what went wrong:

Switch x2 to statement-based replication
Restore replication on codfw replicas
Validate MW’s concept of multi-master conflict resolution by performing simultaneous writes and simultaneous purges on both DCs
Fix uncaught exception from LoadBalancer::approvePrimaryChanges() which caused total failure rather than graceful failure
Re-enable multi-DC mode on testwiki, test2wiki and mediawikiwiki

tstarling reopened subtask T315274: Wikimedia\Rdbms\DBTransactionError: Explicit transaction still active; a caller might have failed to call endAtomic() or cancelAtomic(). as Open.Aug 16 2022, 6:09 AM

RhinosF1 subscribed.Aug 16 2022, 7:52 AM

Before any of those:

Debug exactly what was the trigger of the replication breaking using the still broken codfw replicas, so we can base any later change on hard facts by reconstructing the exact sequence of transactions and server actions.

Assuming the issue is what it looked, at the same time as STATEMENT:

Disable GTID (at the very least, on any host that is going to be directly writable, but probably on all, if we are going to have different data on different hosts) - or in general, apply parsecache-like config to x2.
Consider extreme config like apply slave_exec_mode as IDEMPOTENT or skip certain errors automatically if, even under weird states, replication has to continue despite consistency issues (although this seems like a really bad idea, one could potentially end up with 0 writes being transmitted in certain circumstances)

LSobanski subscribed.Aug 16 2022, 10:15 AM

Mentioned in SAL (#wikimedia-operations) [2022-08-16T18:37:12Z] <jynus> restore x2 codfw replication T315271

I've done the first bullet point, to the extent possible- because "reset..." had been run on both primaries, and I think on the eqiad replicas, all coordinates related to eqiad had been lost. However, the relay log of codfw was intact, and I was able to find out the root cause, with high level of confidence.

I will comment it with @Marostegui tomorrow and search his ok to paste here the link to the preliminary report on wikitech.

I "fixed" (where fixed means purged binlog events) from the codfw replicas, making replication restart. I haven't yet repooled or deleted downtimes and silences from any host, but all are now "green" on icinga and orchestrator.

I was able to reproduce a replication failure locally. I set up two instances of MariaDB 10.3.34 with ring replication and binlog-format=row. One of them had master_delay=5, simulating network latency. Writing different values with the same key to both instances simultaneously led to permanently inconsistent data, with db1 having the value written to db2 and vice versa. A simultaneous DELETE query, achieved by calling SqlBagOStuff::set() with purgePeriod=1, caused replication to fail.

Change 823793 had a related patch set uploaded (by Tim Starling; author: Tim Starling):

[mediawiki/core@master] SqlBagOStuff: Fix modtoken comparison

https://gerrit.wikimedia.org/r/823793

gerritbot added a project: Patch-For-Review.Aug 17 2022, 3:30 AM

Validate MW’s concept of multi-master conflict resolution by performing simultaneous writes and simultaneous purges on both DCs

It turns out to be broken, I fixed it in the patch linked above.

Fix uncaught exception from LoadBalancer::approvePrimaryChanges() which caused total failure rather than graceful failure

This is done in https://gerrit.wikimedia.org/r/c/mediawiki/core/+/823791

Change 824037 had a related patch set uploaded (by Tim Starling; author: Tim Starling):

[operations/puppet@production] Set binlog_format=STATEMENT on x2 servers

https://gerrit.wikimedia.org/r/824037

Change 824039 had a related patch set uploaded (by Tim Starling; author: Tim Starling):

[operations/puppet@production] Re-enable multi-DC mode on testwiki, test2wiki and mediawiki.org

https://gerrit.wikimedia.org/r/824039

Change 824037 merged by Marostegui:

[operations/puppet@production] Set binlog_format=STATEMENT on x2 servers

https://gerrit.wikimedia.org/r/824037

Marostegui added a subtask: T315427: Create a puppet role for x2 hosts.Aug 17 2022, 10:37 AM

Krinkle mentioned this in T277831: Evaluate WMF's ParserCache database setup for Multi DC.Aug 18 2022, 4:42 PM

tstarling closed subtask T315274: Wikimedia\Rdbms\DBTransactionError: Explicit transaction still active; a caller might have failed to call endAtomic() or cancelAtomic(). as Resolved.Aug 19 2022, 2:03 AM

Change 823793 merged by jenkins-bot:

[mediawiki/core@master] SqlBagOStuff: Fix modtoken comparison

https://gerrit.wikimedia.org/r/823793

ReleaseTaggerBot added a project: MW-1.39-notes (1.39.0-wmf.26; 2022-08-22).Aug 19 2022, 11:00 PM

Change 824445 had a related patch set uploaded (by Tim Starling; author: Tim Starling):

[mediawiki/core@wmf/1.39.0-wmf.25] SqlBagOStuff: Fix modtoken comparison

https://gerrit.wikimedia.org/r/824445

Change 824445 merged by jenkins-bot:

[mediawiki/core@wmf/1.39.0-wmf.25] SqlBagOStuff: Fix modtoken comparison

https://gerrit.wikimedia.org/r/824445

Mentioned in SAL (#wikimedia-operations) [2022-08-22T00:25:23Z] <tstarling@deploy1002> Synchronized php-1.39.0-wmf.25/includes/objectcache/SqlBagOStuff.php: fix modtoken comparison T315271 (duration: 03m 45s)

ReleaseTaggerBot edited projects, added MW-1.39-notes (1.39.0-wmf.25; 2022-08-15); removed MW-1.39-notes (1.39.0-wmf.26; 2022-08-22).Aug 22 2022, 1:00 AM

Marostegui closed subtask T315427: Create a puppet role for x2 hosts as Resolved.Aug 22 2022, 7:55 AM

Marostegui closed subtask T315853: reclone x2 codfw hosts as Resolved.Aug 22 2022, 2:47 PM

Change 824039 merged by Tim Starling:

[operations/puppet@production] Re-enable multi-DC mode on testwiki, test2wiki and mediawiki.org

https://gerrit.wikimedia.org/r/824039

tstarling mentioned this in T279664: Progressive Multi-DC roll out.Aug 22 2022, 11:07 PM

Krinkle closed this task as Resolved.Aug 22 2022, 11:11 PM

Krinkle claimed this task.

Krinkle reassigned this task from Krinkle to tstarling.

Krinkle subscribed.

In T315271#8155933, @tstarling wrote:

Validate MW’s concept of multi-master conflict resolution by performing simultaneous writes and simultaneous purges on both DCs

I tested this in production with eval.php, running simultaneously on mwmaint1002 and mwmaint2002:

$dc = 'codfw'; // or eqiad
$c = MediaWiki\MediaWikiServices::getInstance()->getMainObjectStash()
$startTime = round(time() + 20,-1); print microtime(true) . "\t$startTime\n"; while ( microtime(true) < $startTime ); $c->set('test', $dc); for ( $i = 0; $i < 100; $i++) { print $c->get('test') . "\n"; usleep(10000); }

$tc = Wikimedia\TestingAccessWrapper::newFromObject($c);
$tc->purgePeriod = 1;
$startTime = round(time() + 20,-1); print microtime(true) . "\t$startTime\n"; while ( microtime(true) < $startTime ); $c->set('test', $dc); $c->set('test', $dc, 10);

Maintenance_bot removed a project: Patch-For-Review.Aug 22 2022, 11:30 PM

I have added some more info at:
https://wikitech.wikimedia.org/wiki/MariaDB#x2 and https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting#x2_special_topology

At https://wikitech.wikimedia.org/wiki/MariaDB#x2 @Krinkle or @tstarling could you add the procedure for disabling a DC for any of those? (ie: codfw master dies and we want to stop writes to codfw).

The incident report (still a draft) for this outage is at: https://wikitech.wikimedia.org/wiki/Incidents/2022-08-16_x2_databases_replication_breakage

Marostegui mentioned this in T315995: Document how to disable x2 per DC.Aug 23 2022, 11:58 AM

Krinkle closed subtask T315995: Document how to disable x2 per DC as Resolved.Sep 26 2022, 4:49 PM

db1151, db2144 X2 masters error: Could not execute Delete_rows_v1 event on table mainstash.objectstashClosed, ResolvedPublicActions

Description

Details

Related ObjectsSearch...

Event Timeline

db1151, db2144 X2 masters error: Could not execute Delete_rows_v1 event on table mainstash.objectstash
Closed, ResolvedPublic
Actions

Related Objects
Search...