Fix mediawiki heartbeat model, change pt-heartbeat model to not use super-user, avoid SPOF and switch automatically to the real master without puppet dependency
Open, Stalled, MediumPublic
Actions

Assigned To

None

Authored By

	jcrespo
	Aug 4 2017, 9:07 AM

Description

pt-heartbeat uses super-user to write to the database, even if it is in read-only mode. This works great to maintain the symetry between eqiad and codfw (where both send heartbeat events everywhere else). After being like that for over a year, this may not be the best model- it is good because the SPOF is unlikely, allows for dc <-> dc linck checking, and makes failovers easy, but super-user mode has issues (not respecting read-only), having to use a root account and making database master failovers more complex beacause its dependency from puppet, while masters being controlled on mediawiki.

One proposed model change is to take pt-heartbeat client outside of the master, duplicating it to avoid SPOFs, making it run from 2 separate places pointing to the deployed mediawiki master (controlled, for example with https://noc.wikimedia.org/db.php?format=json and writing only on the real master, which will be switched automatically no matter the puppet or mediawiki state. The above config will be changed to etcd when it is ready. The db will contain the same structure (maybe dc and other fields are no longer needed?), only the method to write it would change.

This is something that MediaWiki-Platform-Team and Performance-Team should be aware of, but probably no action is needed from them, as it should be transparent with the current application lag checking.

So this started with an infrastructure-focus problem, but the more I think about this, I am thinking of increasing the scope of the solution.

GTID has become difficult to work with, and not the once-for-all solution we thought it may be.

One proposal would be to drop its support for replication checking, and use a heartbeat-like solution (some people call it pseudo-gtid), and integrate it into mediawiki code, so it is no longer a wmf-specific setup.

The issue would not be without problems (it would require a polling model, which has some disadvantages), but polling the database is by itself already a problem, as seen on T180918.

The fundamentals of the solution would be:

migrate pt-heartbeat to a mediawiki script so it is on application layer (we can keep it at infrastructure layer for non-mediawiki services). E.g. maintenance script from maintenance server, witch reads automatically etcd configuration and switches to the configured master based on mediawiki config, and not like now, based on puppet config
Migrate chronology protector, lag checking and other replication-based checks to heartbeat-based
Increase the heartbeat frequency (0.1 s between updates ? 0.5?)
Coordinate a way to check (poll) heartbeat to avoid cache issues (large transactions, overload, cache stampede) and fail correctly on network and hardware issues- This will solve most of T180918

Related Objects
Search...

Status	Assigned	Task
Resolved	Reedy	T206777 Create Wikipedia Shan
Resolved	Reedy	T205710 Create Wikinews Limburgish
Resolved	Reedy	T205546 Create Wiktionary Cantonese
Resolved	Ladsgroup	T209820 Add Wikidata support to new wikis
Resolved	Ladsgroup	T211530 Cannot add yue.wt sitelinks onto Wikidata items
Resolved	Ladsgroup	T214400 Add yue.wikt to Cognate
Open	None	T214402 populateCognatePages.php query keeps timing out while waiting for replication
Stalled	None	T172497 Fix mediawiki heartbeat model, change pt-heartbeat model to not use super-user, avoid SPOF and switch automatically to the real master without puppet dependency

Event Timeline

jcrespo created this task.Aug 4 2017, 9:07 AM

Restricted Application added a subscriber: Aklapper. · View Herald TranscriptAug 4 2017, 9:07 AM

greg added a project: Wikimedia-Incident.Aug 4 2017, 9:18 PM

greg moved this task from Active investigation to Follow-up prevention on the Wikimedia-Incident board.

Marostegui moved this task from Triage to Backlog on the DBA board.Aug 7 2017, 4:56 AM

So this started with an infrastructure-focus problem, but the more I think about this, I am thinking of increasing the scope of the solution.

GTID has become difficult to work with, and not the once-for-all solution we thought it may be.

The issue would not be without problems (it would require a polling model, which has some disadvantages), but polling the database is by itself already a problem, as seen on T180918.

The fundamentals of the solution would be:

migrate pt-heartbeat to a mediawiki script so it is on application layer (we can keep it at infrastructure layer for non-mediawiki services). E.g. maintenance script from maintenance server, witch reads automatically etcd configuration and switches to the configured master based on mediawiki config, and not like now, based on puppet config
Migrate chronology protector, lag checking and other replication-based checks to heartbeat-based
Increase the heartbeat frequency (0.1 s between updates ? 0.5?)
Coordinate a way to check (poll) heartbeat to avoid cache issues (large transactions, overload, cache stampede) and fail correctly on network and hardware issues- This will solve most of T180918

This is a very preliminary proposal that needs lots of discussion. CC @aaron

• Imarlier edited projects, added Performance-Team (Radar); removed Performance-Team.Jun 25 2018, 7:53 PM

tstarling moved this task from Inbox to Watching on the MediaWiki-Platform-Team-Archived board.Jul 3 2018, 1:26 AM

CCicalese_WMF edited projects, added Core-Platform-Team-Old; removed MediaWiki-Platform-Team-Archived.Jul 11 2018, 11:09 PM

CCicalese_WMF moved this task from Inbox to Watching on the Core-Platform-Team-Old board.Jul 11 2018, 11:28 PM

jcrespo mentioned this in T193226: Test MySQL 8.0 with production data and evaluate its fit for WMF databases.Jul 13 2018, 8:39 AM

Krinkle moved this task from Limbo to Watching on the Performance-Team (Radar) board.Jul 18 2018, 12:15 AM

Does it have to use the same table definition? To measure lag, MediaWiki uses

SELECT ts FROM heartbeat.heartbeat WHERE shard='s1' AND datacenter='eqiad' ORDER BY ts DESC LIMIT 1

which is not even indexed. Is there a reason for keeping the positions of previous masters in the heartbeat table? Should we just have a table with a single row? I see that we have pt-heartbeat running on the local masters in codfw, is there a reason for this? MediaWiki just reads the row associated with the ultimate master, does anything need to know the local lag?

Can the relay_master_log_file and exec_master_log_pos fields be removed?

I think shard should be renamed to section if it is really still needed, for consistency with terminology in the MW core.

With interval=1 the lag measured by this method will oscillate between 0 and 1, we're not using the client system clock to round to zero like what pt-heartbeat --monitor does. Could we write the interval to the heartbeat table so that MediaWiki can correctly round down the lag?

In summary, could we have something like:

CREATE TABLE heartbeat (
  row_id PRIMARY KEY AUTO_INCREMENT,
  ts varbinary(26) NOT NULL,
  file varbinary(255),
  position bigint,
  interval unsigned int, -- integer microseconds
);

could we have something like

Yes, all those changes are possible, and in a way, there were already part of the proposal. I pushed for some of those, but I got disagreements from some people, so I stopped pushing for those. I would however not rename some of the fields because pt-heartbeat is mostly a standarized tool. If we make us incompatible, I would reimplement it as part of mediawiki itself (even if reimplement just means add the perl codebase to mediawiki core / mediawiki maintenance repositories and track it there), so that the non-trivial patches you propose are tracked together with the codebase that uses it.

Note that:

SELECT ts FROM heartbeat.heartbeat WHERE shard='s1' AND datacenter='eqiad' ORDER BY ts DESC LIMIT 1

Is actually wrong, and the right way to calculate lag is to do something like:
TIMESTAMPDIFF(MICROSECOND, ts, UTC_TIMESTAMP(6)) because queries work in its own freezed time (when transaction starts, time is freezed), which led to some custom fixes to mediawiki tracking (using cache) that I am not sure they should be there.

I think section can be removed now that we will have only one instance per section- we used to have multi-source hosts, which required to differentiate between sections (several sections sharing the same table). The only ones left are labsdbs, which are not part of mediawiki and maybe we could do with triggers or something at cloud infrastructure (unrelated to mediawiki).

I am not sure the PK is the right way (replace is used, which would increase the row_id), but of course we can index it in anyway we want. Having the past masters helped us debugging in case of an emergency failover with the master down, but not strictly needed at all, can be dropped.

No disagreement on the rest of the proposals, those were things I would either already wanted to do, or were bad decisions taken at the time because we didn't know how this was going to be used when first implemented.

does anything need to know the local lag

The idea (at the time) was to make possible to detect datacenter split brains, but if it is not used, of course it can be removed. We will need some way to assure high availability for the process, as if it fails, wikis will be read only. Most clear example: imagine we run it from a maintenance host so it can switch automatically to the right master- the maintenance host is rebooted frequently, so we need some redundancy on multiple hosts. Again, nothing worrying, just need to get the architecture right. It would be nice to talk in IRC to get the full details right.

My proposal-- let's talk further, we (and I include mainly me) did a bad job when I first implemented this, and want now a lot of consensus to get it right this time, however I think the underlying idea was ok (just the implementation wasn't great specially because we didn't know at the time the full use case).

jcrespo renamed this task from Change pt-heartbeat model to not use super-user, avoid SPOF and switch automatically to the real master without puppet dependency to Fix mediawiki heartbeat model, change pt-heartbeat model to not use super-user, avoid SPOF and switch automatically to the real master without puppet dependency.Aug 22 2018, 7:14 AM

jcrespo updated the task description. (Show Details)

CCicalese_WMF edited projects, added Platform Team Legacy (Watching / External); removed Core-Platform-Team-Old.Oct 1 2018, 4:51 PM

Ladsgroup subscribed.Dec 23 2018, 12:40 PM

A ping here to remind that:

pt-heartbeat (or the current way we use it) has architecture and functional issues (eg. lack of HA, our model "cached value used" and the things Tim mentioned above)
GTID and chronology checker has architecture, administration and functional issues (in this case, MariaDB support of it is mostly to be blamed, although I think MySQL has the same issues; GTID_WAIT issues, plus we need proper multi-dc support, which GTID was supposed to give automatically, but has may shortcomings, nor it delivered the promise to provide easy topology changes).

Nikerabbit mentioned this in T203059: Fourth manual run of unpublished draft purge script.Jan 21 2019, 2:12 PM

Marostegui mentioned this in T214402: populateCognatePages.php query keeps timing out while waiting for replication.Jan 22 2019, 4:46 PM

From a perspective of layered architecture and separation of concern, I'm not sure I like the idea of a MediaWiki script. But some script that reads from etcd and does the updates to the table seems reasonable.

Lag functions in MW are basically split into two categories: (a) lag estimates and (b) synchronization barriers. I get the issues with pt-heartbeat, which in turn, effects lag estimates using them. For sync barriers, how are they supposed to work using heartbeats (given their limited precision vs GTIDs)? Also, ignoring that, I'm not sure what is gained over GTIDs+MASTER_GTID_WAIT for barriers (which do not use heartbeats now). Maybe we can split barriers into two cases:

b1: barriers in maintenance scripts that grab the master position and GTID_WAIT with the replicas just to make sure they are not too behind
b2: barriers in maintenance scripts or jobs that want to move I/O to replicas but want certain changes to be seen (RefreshLinks,CategoryMembershipChangeJob)
b3: chronology protector, which stores positions for a minute or so, checking against replicas used for site access by a user after updates were caused by that user recently

I think b1 could use a heartloop easily enough. The b2/b3 cases would have some risk of either undershooting the MIN timestamp needed to stop the wait loop (hurting correctness) or overshooting by rounding up the timestamp (causing delay). The tighter the heartbeat interval, the less of a concert this is. It's probably doable, but I'd want to know the advantages first. Is there a task describing the mediawiki-affecting GTID problems (not heartbeat ones)?

Also, see the comment in getHeartbeatData() about UTC_TIMESTAMP. A DIFF function would be better as long as there are no support issues with that given our deployed maria versions. For third parties, we probably don't have to care. The required mysql for MW is now 5.5.8 and the precision field was added in 5.6, unfortunately. It would be easy to patch.

I'm not sure what is gained over GTIDs+MASTER_GTID_WAIT

Given that GTID is broken and does not work, I think we need at least *something that works* :-) I am open for alternatives!

I don't necessarily want to push for heartbeat, but we have to either patch around the (fundamentally flawed) broken mariadb GTID implementation, which reveals itself every time there is a datacenter or master switch, or search for an alternative. GTID is great on paper but it doesn't work in practice as a practical replacement of binlog coordinates. See:

https://phabricator.wikimedia.org/P8014
https://phabricator.wikimedia.org/P8021

One possible way would be to assume every host has its gid and domain id equal to that of the ip (which is on config) and only check that.

There is also some issues with mediawiki, which you are trying to make it work as READ COMITTED for those barriers, while the current database model is REPEATABLE READ mode, which is not necessarily bad, but cannot be done without a different arch.

I will create a separate ticket, I think I am confusing many people with 3 separate issues:

The DBA architecture problem (this is our problem to solve, and we just ask for feedback) - mostly with heartbeat
The mediawiki architecture problem - mostly related to GTID+cross dc
The actual solution/implementation for the previous problem, which needs coding but may require also DBA reachitecturing

Let's talk some time soon in person to sync up. :-)

@Addshore let us know today that there is a "new" error that started happening today which looks related to this thread (I think):
https://logstash.wikimedia.org/goto/018d06f1ac178c272964fa71b76702e1

Addshore added a parent task: T214402: populateCognatePages.php query keeps timing out while waiting for replication.Feb 11 2019, 4:35 PM

Liuxinyu970226 subscribed.Feb 15 2019, 1:45 PM

• mobrovac added projects: Platform Engineering (Multi-DC (TEC1)), Services (watching), User-mobrovac.Apr 15 2019, 8:19 PM

• mobrovac added a parent task: T88445: MediaWiki active/active datacenter investigation and work (tracking).

Krinkle unsubscribed.Apr 15 2019, 9:11 PM

@mobrovac I think this task is confusing, we should use separate tasks for T172497#4905268 instead, and separate the non-cross dc work away.

• mobrovac mentioned this in T221159: FY18/19 TEC1.6 Q4: Improve or replace the usage of GTID_WAIT with pt-heartbeat in MW.Apr 16 2019, 11:49 PM

• mobrovac removed a parent task: T88445: MediaWiki active/active datacenter investigation and work (tracking).

In T172497#5113999, @jcrespo wrote:

@mobrovac I think this task is confusing, we should use separate tasks for T172497#4905268 instead, and separate the non-cross dc work away.

Good point. I created T221159: FY18/19 TEC1.6 Q4: Improve or replace the usage of GTID_WAIT with pt-heartbeat in MW for that purpose. Please feel free to edit the task and chime in there.

Krinkle moved this task from Untriaged to Rdbms library on the MediaWiki-libs-Rdbms board.Jul 11 2019, 2:50 PM

Krinkle triaged this task as High priority.Jul 23 2019, 5:30 PM

Krinkle edited projects, added Sustainability (Incident Followup); removed Wikimedia-Incident.Apr 28 2020, 9:50 PM

TK-999 subscribed.Jul 10 2020, 9:49 AM

LSobanski lowered the priority of this task from High to Medium.Apr 26 2021, 1:36 PM

Joe added a project: SRE-Sprint-Week-Sustainability-March2023.Mar 20 2023, 11:35 AM

This task seems to have seen no activity in years, and from past comments it seems that most of the stuff in here has either been spun off to separate tasks or done. @jcrespo do you think this should stay open?

For database development questions, I think it is for @Ladsgroup to decide (I think he asked precisely for a list of open issues). Personally, I think it is no longer a super huge priority but it is still valid- whether it should be kept open to capture the issues or closed, it is for the DBAs to decide.

Ok, I'll set it to stalled and wait for @Ladsgroup to give feedback when available.

Thanks for asking my opinion. I read this. pt-heartbeat has a lot of limitations. For example the task description doesn't cover many other ones. I personally have two more concerns: T327852: MediaWiki replication lags are inflated with average of half a second and time drift between replica and primary (do we use NTPs? hosts can't talk to outside so that's not very likely). Parts of it is due to the nature of distributed systems and there is no way around it, just trade-offs.

So my suggestion is to have a more concrete problem in mind and try to solve or mitigate that instead. Or at least split it to multiple more concrete tickets. For example "Make pt-heartbeat work out of the box in switchovers". Or "pt-heartbeat shouldn't use root" (Isn't that fixed?) and possibly a dedicated ticket for caching of replag in mw. That is a complicated problem I ran into multiple times, the caching of replag actually happens in LoadMonitor so T314020: LoadMonitor connection weighting reimagined might address it to some degree.

Regardless, I advise against writing something from scratch in mw to replace pt-heartbeat, there is probably something written outside we can reuse

Addshore unsubscribed.Apr 1 2023, 3:41 PM

Krinkle removed projects: Services (watching), Platform Team Legacy (Watching / External), Performance-Team (Radar).Aug 7 2023, 1:46 AM

Krinkle removed a subscriber: • mobrovac.

Fix mediawiki heartbeat model, change pt-heartbeat model to not use super-user, avoid SPOF and switch automatically to the real master without puppet dependencyOpen, Stalled, MediumPublicActions

Description

Related ObjectsSearch...

Event Timeline

Fix mediawiki heartbeat model, change pt-heartbeat model to not use super-user, avoid SPOF and switch automatically to the real master without puppet dependency
Open, Stalled, MediumPublic
Actions

Related Objects
Search...