Page MenuHomePhabricator

FY18/19 TEC1.6 Q4: Improve or replace the usage of GTID_WAIT with pt-heartbeat in MW
Open, HighPublic0 Story Points

Description

Problem

In order to be able to serve read requests from both DCs, on the MediaWiki side of things we need to ensure the replication lag is minimised between the two DCs. In this context, using GTID has proven problematic, so the current thinking is to deprecate it in favour of pt-heartbeat.

Proposal

Currently, pt-heartbeat is used only for monitoring the lag. We would need to switch to using it MW as a mechanism for waiting on replicas to catch up. Moreover, we would need to introduce the notion of multiple DCs and "distant replicas" so that MW code can decide/check whether it needs to wait on all replicas to catch up or only the ones in the local DC. This would then be used in background update jobs, such as refreshlinks, to finish the execution only once all of the replicas have been updated. This allows us to ensure minimal differences for reads between the two DCs.

Open Questions

  1. Would it be acceptable for Chronology Protector to have a longer wait for content in cases where GETs for the current user are served from the secondary DC?
  2. How to address the failure scenarios, such as split brain or network connectivity issues between the two DCs?

Event Timeline

mobrovac triaged this task as High priority.Apr 16 2019, 11:49 PM
mobrovac created this task.
Restricted Application added a subscriber: Aklapper. · View Herald TranscriptApr 16 2019, 11:49 PM

@aaron @jcrespo @Marostegui the task desc is a draft, please feel free to update it appropriately.

Krinkle updated the task description. (Show Details)EditedApr 16 2019, 11:55 PM
Krinkle added a subscriber: Krinkle.

(collapsed a T-ref in task description as it overflew with scroll, the hovercard does multi-line, however)

jcrespo updated the task description. (Show Details)Apr 17 2019, 6:59 AM
Marostegui moved this task from Triage to Meta/Epic on the DBA board.Apr 18 2019, 6:32 PM

For those not familiar with the chronology's protector https://www.mediawiki.org/wiki/Manual:MediaWiki_architecture#Database_and_text_storage:

MediaWiki's "chronology protector" ensures that replication lag never causes a user to see a page that claims an action they've just performed hasn't happened yet. This is done by storing the master's position in the user's session if a request they made resulted in a write query. The next time the user makes a read request, the load balancer read this position from the session, and tries to select a slave that has caught up to that replication position to serve the request. If none is available, it will wait until one is. It may appear to other users as though the action hasn't happened yet, but the chronology remains consistent for each user.
Gilles moved this task from Inbox to Radar on the Performance-Team board.Apr 23 2019, 7:51 PM
Gilles edited projects, added Performance-Team (Radar); removed Performance-Team.
mobrovac updated the task description. (Show Details)Apr 23 2019, 8:37 PM

Moving the old task description here into a comment for posterity:

We use both GTID and pt-heartbeat, which is problematic if we want to do cross-DC DB requests (since both mechanisms require actions from the master DB process). For more context cf. T172497 and in particular T172497#4905268 .

Jaime's summary: I am not saying GTID should be removed, but we should have a discussion to understand it and either fix it or consider a replacement. More details about issues:

  • GTID has many practical problems
    • MySQL and MariaDB had different implementations, which is a big deal because it means we cannot transition from one to the other. While this is not a blocker for Cross-dc itself, if alternatives are searched, this should be taken into account
    • GTID is causing operational issues- in theory it allows easy handling of databases, but in reality it is a mess we actually stopped using operationally because how many bugs it has
    • GTID is causing mediawiki issues- every time a master failover happens (what GTID promises it does nicely), jobs fail unable to locate gtid position because it varies from master to master. This causues complains by other developers and production errors.
    • GTID string keeps growing and growing coordinates, in a useless way, while only a master (an id) is actually writing at a time
  • Chronology protector doesn't work well on master switch. A master change (be it within dc or cross-dc) should be taken into account in the logic, (probably due to the above issues). Ideally, also, a replication checker would avoid querying the master
  • Replication control has issues cross-dc- spikes of lag are very likely between datacenters, either due to the added latency, the limited bandwidth or the extra unreliability. This was not an issue until now because the passive database could just get delayed. But what will it happen when we have reads on the non-primary dc? Should we stop writes until the secondary datacenter catches up? Should we consider the secondary dc as "down". Should we be permissive against delayed reads. Should we switch datacenter? All options have problems- stopping writes (and reads), may impact not only the throughput of the replicated dc, but the primary dc too (many batched actions happens every week that would be 10x or slower. Can the secondary dc (or some queries) accept more delay? How should mediawiki behave in case of a split brain (network lost to secondary dc), should it wait forever, or is it a timeout ok- could that impact the number of open connections to the primary master?
  • pt-heartbeat has been working nicely for what we need (it may need changes T172497, but that is on the DBAs), however it cannot be at the moment a proper substitute for GTID because it is not "continuous", it just generate events every second (currently), so it is not a proper substitute. Should frequency be changed or a new event be generated within the transaction each time chronology wants to be changed to be a viable alternative? Is pt-hearbeat being used transactionally consistent by Mediawiki (as far as I know, it isn't, it tries to use real-world clock rather than "database time", which is certainly leading to problems).
  • What are proper logical, architecture and operational solutions to the above issues?
mobrovac renamed this task from FY18/19 TEC1.6 Q4: Improve or replace GTID + pt-heartbeat logic for cross-DC to FY18/19 TEC1.6 Q4: Replace the usage of GTID_WAIT with pt-heartbeat in MW.Apr 23 2019, 9:16 PM
mobrovac updated the task description. (Show Details)

Thanks @mobrovac for the summary!
I would like to also mention that point #2 is not only for failures, is also important for general maintenance (ie: the intermediate master has to go down for HW maintenance) - as spoken on the meeting a way to depool a DC (or at least, the secondary, the one not writable) needs to be provided. But that is implementation, which is yet to be discussed :-)

Change 517830 had a related patch set uploaded (by Aaron Schulz; owner: Aaron Schulz):
[mediawiki/core@master] rdbms: fix active GTID filtering in DatabaseMysqlBase

https://gerrit.wikimedia.org/r/517830

aaron renamed this task from FY18/19 TEC1.6 Q4: Replace the usage of GTID_WAIT with pt-heartbeat in MW to FY18/19 TEC1.6 Q4: Improve or replace the usage of GTID_WAIT with pt-heartbeat in MW.Jun 19 2019, 4:54 PM
daniel added a subscriber: daniel.

Patch was merged, removing the patch for review tag.

Pn the patch, @jcrespo said:

I believe this is not working as intended, not because code- I can see now SELECT MASTER_GTID_WAIT('171978924-171978924-255115225', 1) on the logs, but errors out as:
"Timed out waiting for replication to reach 171970645-171970645-288070551,171978778-171978778-3298185533,171978924-171978924-255115225,180359242-180359242-170963125"
1.34.0-wmf.25
It may need some shaking on infra, maybe, to make it work? Otherwise I don't understand where those extra coords come from.

@Marostegui replied:

I believe this is not working as intended, not because code- I can
see now SELECT MASTER_GTID_WAIT('171978924-171978924-255115225', 1)
on the logs, but errors out as:
"Timed out waiting for replication to reach 171970645-171970645-288070551,171978778-171978778-3298185533,171978924-171978924-255115225,180359242-180359242-170963125"
1.34.0-wmf.25
It may need some shaking on infra, maybe, to make it work?
Otherwise I don't understand where those extra coords come from.

https://phabricator.wikimedia.org/T224422#5558330

Not sure what CPT can do here. Tagging for triage.