FY18/19 TEC1.6 Q4: Improve or replace the usage of GTID_WAIT with pt-heartbeat in MW
Closed, DeclinedPublic0 Estimated Story Points
Actions

Description

Problem

In order to be able to serve read requests from both DCs, on the MediaWiki side of things we need to ensure the replication lag is minimised between the two DCs. In this context, using GTID has proven problematic, so the current thinking is to deprecate it in favour of pt-heartbeat.

Proposal

Currently, pt-heartbeat is used only for monitoring the lag. We would need to switch to using it MW as a mechanism for waiting on replicas to catch up. Moreover, we would need to introduce the notion of multiple DCs and "distant replicas" so that MW code can decide/check whether it needs to wait on all replicas to catch up or only the ones in the local DC. This would then be used in background update jobs, such as refreshlinks, to finish the execution only once all of the replicas have been updated. This allows us to ensure minimal differences for reads between the two DCs.

Open Questions

Would it be acceptable for Chronology Protector to have a longer wait for content in cases where GETs for the current user are served from the secondary DC?
How to address the failure scenarios, such as split brain or network connectivity issues between the two DCs?

Details

	Subject	Repo	Branch	Lines +/-
	rdbms: cleanup erroneous and duplicated logging in LoadBalancer::waitForMasterPos()	mediawiki/core	master	+5 -15
	rdbms: make DatabaseMysqlBase::masterPosWait() error messages prefer filtered GTID list	mediawiki/core	master	+3 -3

Customize query in gerrit

Related Objects
Search...

Status	Assigned	Task
Resolved	aaron	T88445 MediaWiki active/active datacenter investigation and work (tracking)
Declined	aaron	T221159 FY18/19 TEC1.6 Q4: Improve or replace the usage of GTID_WAIT with pt-heartbeat in MW
Resolved	tstarling	T224419 Allow MediaWiki to know the "real" DB master
Resolved	aaron	T224422 Implement logic to filter bogus GTIDs

Event Timeline

• mobrovac triaged this task as High priority.Apr 16 2019, 11:49 PM

• mobrovac created this task.

Restricted Application added a subscriber: Aklapper. · View Herald TranscriptApr 16 2019, 11:49 PM

@aaron @jcrespo @Marostegui the task desc is a draft, please feel free to update it appropriately.

• mobrovac added a parent task: T88445: MediaWiki active/active datacenter investigation and work (tracking).Apr 16 2019, 11:50 PM

• mobrovac added projects: Services (watching), User-mobrovac.

• mobrovac mentioned this in T172497: Fix mediawiki heartbeat model, change pt-heartbeat model to not use super-user, avoid SPOF and switch automatically to the real master without puppet dependency.

(collapsed a T-ref in task description as it overflew with scroll, the hovercard does multi-line, however)

jcrespo updated the task description. (Show Details)Apr 17 2019, 6:59 AM

Ladsgroup subscribed.Apr 17 2019, 9:24 AM

Marostegui moved this task from Triage to Meta/Epic on the DBA board.Apr 18 2019, 6:32 PM

For those not familiar with the chronology's protector https://www.mediawiki.org/wiki/Manual:MediaWiki_architecture#Database_and_text_storage:

MediaWiki's "chronology protector" ensures that replication lag never causes a user to see a page that claims an action they've just performed hasn't happened yet. This is done by storing the master's position in the user's session if a request they made resulted in a write query. The next time the user makes a read request, the load balancer read this position from the session, and tries to select a slave that has caught up to that replication position to serve the request. If none is available, it will wait until one is. It may appear to other users as though the action hasn't happened yet, but the chronology remains consistent for each user.

Krinkle moved this task from Untriaged to Rdbms library on the MediaWiki-libs-Rdbms board.Apr 23 2019, 7:50 PM

• Gilles moved this task from Inbox, needs triage to Radar on the Performance-Team board.Apr 23 2019, 7:51 PM

• Gilles edited projects, added Performance-Team (Radar); removed Performance-Team.

Moving the old task description here into a comment for posterity:

We use both GTID and pt-heartbeat, which is problematic if we want to do cross-DC DB requests (since both mechanisms require actions from the master DB process). For more context cf. T172497 and in particular T172497#4905268 .

Jaime's summary: I am not saying GTID should be removed, but we should have a discussion to understand it and either fix it or consider a replacement. More details about issues:

GTID has many practical problems
- MySQL and MariaDB had different implementations, which is a big deal because it means we cannot transition from one to the other. While this is not a blocker for Cross-dc itself, if alternatives are searched, this should be taken into account
- GTID is causing operational issues- in theory it allows easy handling of databases, but in reality it is a mess we actually stopped using operationally because how many bugs it has
- GTID is causing mediawiki issues- every time a master failover happens (what GTID promises it does nicely), jobs fail unable to locate gtid position because it varies from master to master. This causues complains by other developers and production errors.
- GTID string keeps growing and growing coordinates, in a useless way, while only a master (an id) is actually writing at a time
Chronology protector doesn't work well on master switch. A master change (be it within dc or cross-dc) should be taken into account in the logic, (probably due to the above issues). Ideally, also, a replication checker would avoid querying the master
Replication control has issues cross-dc- spikes of lag are very likely between datacenters, either due to the added latency, the limited bandwidth or the extra unreliability. This was not an issue until now because the passive database could just get delayed. But what will it happen when we have reads on the non-primary dc? Should we stop writes until the secondary datacenter catches up? Should we consider the secondary dc as "down". Should we be permissive against delayed reads. Should we switch datacenter? All options have problems- stopping writes (and reads), may impact not only the throughput of the replicated dc, but the primary dc too (many batched actions happens every week that would be 10x or slower. Can the secondary dc (or some queries) accept more delay? How should mediawiki behave in case of a split brain (network lost to secondary dc), should it wait forever, or is it a timeout ok- could that impact the number of open connections to the primary master?
pt-heartbeat has been working nicely for what we need (it may need changes T172497, but that is on the DBAs), however it cannot be at the moment a proper substitute for GTID because it is not "continuous", it just generate events every second (currently), so it is not a proper substitute. Should frequency be changed or a new event be generated within the transaction each time chronology wants to be changed to be a viable alternative? Is pt-hearbeat being used transactionally consistent by Mediawiki (as far as I know, it isn't, it tries to use real-world clock rather than "database time", which is certainly leading to problems).
What are proper logical, architecture and operational solutions to the above issues?

Krinkle moved this task from Limbo to Watching on the Performance-Team (Radar) board.Apr 23 2019, 8:38 PM

• mobrovac renamed this task from FY18/19 TEC1.6 Q4: Improve or replace GTID + pt-heartbeat logic for cross-DC to FY18/19 TEC1.6 Q4: Replace the usage of GTID_WAIT with pt-heartbeat in MW.Apr 23 2019, 9:16 PM

• mobrovac updated the task description. (Show Details)

Thanks @mobrovac for the summary!
I would like to also mention that point #2 is not only for failures, is also important for general maintenance (ie: the intermediate master has to go down for HW maintenance) - as spoken on the meeting a way to depool a DC (or at least, the secondary, the one not writable) needs to be provided. But that is implementation, which is yet to be discussed :-)

• mobrovac added a subtask: T224419: Allow MediaWiki to know the "real" DB master.May 27 2019, 11:10 AM

• mobrovac added a parent task: T224422: Implement logic to filter bogus GTIDs.May 27 2019, 11:24 AM

• mobrovac removed a parent task: T224422: Implement logic to filter bogus GTIDs.May 31 2019, 4:12 AM

• mobrovac added a subtask: T224422: Implement logic to filter bogus GTIDs.

Change 517830 had a related patch set uploaded (by Aaron Schulz; owner: Aaron Schulz):
[mediawiki/core@master] rdbms: fix active GTID filtering in DatabaseMysqlBase

https://gerrit.wikimedia.org/r/517830

gerritbot added a project: Patch-For-Review.Jun 19 2019, 11:01 AM

aaron renamed this task from FY18/19 TEC1.6 Q4: Replace the usage of GTID_WAIT with pt-heartbeat in MW to FY18/19 TEC1.6 Q4: Improve or replace the usage of GTID_WAIT with pt-heartbeat in MW.Jun 19 2019, 4:54 PM

Restricted Application edited projects, added Platform Engineering; removed Platform Engineering (Multi-DC (TEC1)). · View Herald TranscriptJul 23 2019, 4:58 PM

Krinkle removed a project: Platform Engineering.Jul 23 2019, 5:02 PM

Krinkle moved this task from Rdbms library to Rdbms: MySQL support on the MediaWiki-libs-Rdbms board.Jul 23 2019, 5:35 PM

aaron closed subtask T224422: Implement logic to filter bogus GTIDs as Resolved.Oct 6 2019, 1:38 AM

Krinkle reopened subtask T224422: Implement logic to filter bogus GTIDs as Open.Oct 10 2019, 5:21 AM

• Gilles added a project: Platform Engineering.Oct 10 2019, 8:58 AM

Patch was merged, removing the patch for review tag.

Pn the patch, @jcrespo said:

I believe this is not working as intended, not because code- I can see now SELECT MASTER_GTID_WAIT('171978924-171978924-255115225', 1) on the logs, but errors out as:
"Timed out waiting for replication to reach 171970645-171970645-288070551,171978778-171978778-3298185533,171978924-171978924-255115225,180359242-180359242-170963125"

1.34.0-wmf.25

It may need some shaking on infra, maybe, to make it work? Otherwise I don't understand where those extra coords come from.

@Marostegui replied:

I believe this is not working as intended, not because code- I can
see now SELECT MASTER_GTID_WAIT('171978924-171978924-255115225', 1)
on the logs, but errors out as:
"Timed out waiting for replication to reach 171970645-171970645-288070551,171978778-171978778-3298185533,171978924-171978924-255115225,180359242-180359242-170963125"

1.34.0-wmf.25

It may need some shaking on infra, maybe, to make it work?
Otherwise I don't understand where those extra coords come from.

https://phabricator.wikimedia.org/T224422#5558330

Not sure what CPT can do here. Tagging for triage.

aaron closed subtask T224422: Implement logic to filter bogus GTIDs as Resolved.Oct 11 2019, 4:26 PM

WDoranWMF removed projects: Platform Engineering, Platform Team Legacy (Watching / External).Oct 23 2019, 6:44 PM

TK-999 subscribed.Jul 10 2020, 9:49 AM

In T221159#5562483, @daniel wrote:
Patch was merged, removing the patch for review tag.

Pn the patch, @jcrespo said:

I believe this is not working as intended, not because code- I can see now SELECT MASTER_GTID_WAIT('171978924-171978924-255115225', 1) on the logs, but errors out as:
"Timed out waiting for replication to reach 171970645-171970645-288070551,171978778-171978778-3298185533,171978924-171978924-255115225,180359242-180359242-170963125"

1.34.0-wmf.25

It may need some shaking on infra, maybe, to make it work? Otherwise I don't understand where those extra coords come from.

@Marostegui replied:
I believe this is not working as intended, not because code- I can
see now SELECT MASTER_GTID_WAIT('171978924-171978924-255115225', 1)
on the logs, but errors out as:
"Timed out waiting for replication to reach 171970645-171970645-288070551,171978778-171978778-3298185533,171978924-171978924-255115225,180359242-180359242-170963125"

1.34.0-wmf.25

It may need some shaking on infra, maybe, to make it work?
Otherwise I don't understand where those extra coords come from.
https://phabricator.wikimedia.org/T224422#5558330

Note that the error message includes the raw, unfiltered, GTID set. I can probably just change the SPI logging to {wait_pos} instead of {raw_pos}

Change 618425 had a related patch set uploaded (by Aaron Schulz; owner: Aaron Schulz):
[mediawiki/core@master] rdbms: make DatabaseMysqlBase::masterPosWait() error messages prefer filtered GTID list

https://gerrit.wikimedia.org/r/618425

gerritbot added a project: Patch-For-Review.Aug 5 2020, 4:38 AM

@Marostegui FYI - I'm landing the above assuming you'd like that change as well, but let me know if not :)

Change 618425 merged by jenkins-bot:
[mediawiki/core@master] rdbms: make DatabaseMysqlBase::masterPosWait() error messages prefer filtered GTID list

https://gerrit.wikimedia.org/r/618425

ReleaseTaggerBot added a project: MW-1.36-notes (1.36.0-wmf.5; 2020-08-18).Aug 12 2020, 5:56 AM

In T221159#6378430, @Krinkle wrote:

@Marostegui FYI - I'm landing the above assuming you'd like that change as well, but let me know if not :)

I do! :)

Change 641574 had a related patch set uploaded (by Aaron Schulz; owner: Aaron Schulz):
[mediawiki/core@master] rdbms: cleanup erroneous and duplicated logging in LoadBalancer::waitForMasterPos()

https://gerrit.wikimedia.org/r/641574

Change 641574 merged by jenkins-bot:
[mediawiki/core@master] rdbms: cleanup erroneous and duplicated logging in LoadBalancer::waitForMasterPos()

https://gerrit.wikimedia.org/r/641574

ReleaseTaggerBot edited projects, added MW-1.36-notes (1.36.0-wmf.20; 2020-12-01); removed MW-1.36-notes (1.36.0-wmf.5; 2020-08-18).Nov 25 2020, 10:00 PM

Kormat added a project: User-Kormat.Jan 5 2021, 9:03 AM

Kormat moved this task from Unsorted 💣 to Watching 👀 on the User-Kormat board.

I don't think it would be worth using pt-heartbeat for LoadBalancer::waitFor() unless the precision was much higher (likely problematically high in terms of spammy heartbeat table updates).

Closing this given all the tagged cleanup patches that were merged (note that a lot of errors were just mistaken logging).

tstarling closed subtask T224419: Allow MediaWiki to know the "real" DB master as Resolved.Oct 4 2022, 6:49 PM

FY18/19 TEC1.6 Q4: Improve or replace the usage of GTID_WAIT with pt-heartbeat in MWClosed, DeclinedPublic0 Estimated Story PointsActions