Page MenuHomePhabricator

Fix mediawiki heartbeat model, change pt-heartbeat model to not use super-user, avoid SPOF and switch automatically to the real master without puppet dependency
Open, Stalled, MediumPublic

Description

pt-heartbeat uses super-user to write to the database, even if it is in read-only mode. This works great to maintain the symetry between eqiad and codfw (where both send heartbeat events everywhere else). After being like that for over a year, this may not be the best model- it is good because the SPOF is unlikely, allows for dc <-> dc linck checking, and makes failovers easy, but super-user mode has issues (not respecting read-only), having to use a root account and making database master failovers more complex beacause its dependency from puppet, while masters being controlled on mediawiki.

One proposed model change is to take pt-heartbeat client outside of the master, duplicating it to avoid SPOFs, making it run from 2 separate places pointing to the deployed mediawiki master (controlled, for example with https://noc.wikimedia.org/db.php?format=json and writing only on the real master, which will be switched automatically no matter the puppet or mediawiki state. The above config will be changed to etcd when it is ready. The db will contain the same structure (maybe dc and other fields are no longer needed?), only the method to write it would change.

This is something that MediaWiki-Platform-Team and Performance-Team should be aware of, but probably no action is needed from them, as it should be transparent with the current application lag checking.

So this started with an infrastructure-focus problem, but the more I think about this, I am thinking of increasing the scope of the solution.

GTID has become difficult to work with, and not the once-for-all solution we thought it may be.

One proposal would be to drop its support for replication checking, and use a heartbeat-like solution (some people call it pseudo-gtid), and integrate it into mediawiki code, so it is no longer a wmf-specific setup.

The issue would not be without problems (it would require a polling model, which has some disadvantages), but polling the database is by itself already a problem, as seen on T180918.

The fundamentals of the solution would be:

  • migrate pt-heartbeat to a mediawiki script so it is on application layer (we can keep it at infrastructure layer for non-mediawiki services). E.g. maintenance script from maintenance server, witch reads automatically etcd configuration and switches to the configured master based on mediawiki config, and not like now, based on puppet config
  • Migrate chronology protector, lag checking and other replication-based checks to heartbeat-based
  • Increase the heartbeat frequency (0.1 s between updates ? 0.5?)
  • Coordinate a way to check (poll) heartbeat to avoid cache issues (large transactions, overload, cache stampede) and fail correctly on network and hardware issues- This will solve most of T180918

Event Timeline

So this started with an infrastructure-focus problem, but the more I think about this, I am thinking of increasing the scope of the solution.

GTID has become difficult to work with, and not the once-for-all solution we thought it may be.

One proposal would be to drop its support for replication checking, and use a heartbeat-like solution (some people call it pseudo-gtid), and integrate it into mediawiki code, so it is no longer a wmf-specific setup.

The issue would not be without problems (it would require a polling model, which has some disadvantages), but polling the database is by itself already a problem, as seen on T180918.

The fundamentals of the solution would be:

  • migrate pt-heartbeat to a mediawiki script so it is on application layer (we can keep it at infrastructure layer for non-mediawiki services). E.g. maintenance script from maintenance server, witch reads automatically etcd configuration and switches to the configured master based on mediawiki config, and not like now, based on puppet config
  • Migrate chronology protector, lag checking and other replication-based checks to heartbeat-based
  • Increase the heartbeat frequency (0.1 s between updates ? 0.5?)
  • Coordinate a way to check (poll) heartbeat to avoid cache issues (large transactions, overload, cache stampede) and fail correctly on network and hardware issues- This will solve most of T180918

This is a very preliminary proposal that needs lots of discussion. CC @aaron

Does it have to use the same table definition? To measure lag, MediaWiki uses

SELECT ts FROM heartbeat.heartbeat WHERE shard='s1' AND datacenter='eqiad' ORDER BY ts DESC LIMIT 1

which is not even indexed. Is there a reason for keeping the positions of previous masters in the heartbeat table? Should we just have a table with a single row? I see that we have pt-heartbeat running on the local masters in codfw, is there a reason for this? MediaWiki just reads the row associated with the ultimate master, does anything need to know the local lag?

Can the relay_master_log_file and exec_master_log_pos fields be removed?

I think shard should be renamed to section if it is really still needed, for consistency with terminology in the MW core.

With interval=1 the lag measured by this method will oscillate between 0 and 1, we're not using the client system clock to round to zero like what pt-heartbeat --monitor does. Could we write the interval to the heartbeat table so that MediaWiki can correctly round down the lag?

In summary, could we have something like:

CREATE TABLE heartbeat (
  row_id PRIMARY KEY AUTO_INCREMENT,
  ts varbinary(26) NOT NULL,
  file varbinary(255),
  position bigint,
  interval unsigned int, -- integer microseconds
);

could we have something like

Yes, all those changes are possible, and in a way, there were already part of the proposal. I pushed for some of those, but I got disagreements from some people, so I stopped pushing for those. I would however not rename some of the fields because pt-heartbeat is mostly a standarized tool. If we make us incompatible, I would reimplement it as part of mediawiki itself (even if reimplement just means add the perl codebase to mediawiki core / mediawiki maintenance repositories and track it there), so that the non-trivial patches you propose are tracked together with the codebase that uses it.

Note that:

SELECT ts FROM heartbeat.heartbeat WHERE shard='s1' AND datacenter='eqiad' ORDER BY ts DESC LIMIT 1

Is actually wrong, and the right way to calculate lag is to do something like:
TIMESTAMPDIFF(MICROSECOND, ts, UTC_TIMESTAMP(6)) because queries work in its own freezed time (when transaction starts, time is freezed), which led to some custom fixes to mediawiki tracking (using cache) that I am not sure they should be there.

I think section can be removed now that we will have only one instance per section- we used to have multi-source hosts, which required to differentiate between sections (several sections sharing the same table). The only ones left are labsdbs, which are not part of mediawiki and maybe we could do with triggers or something at cloud infrastructure (unrelated to mediawiki).

I am not sure the PK is the right way (replace is used, which would increase the row_id), but of course we can index it in anyway we want. Having the past masters helped us debugging in case of an emergency failover with the master down, but not strictly needed at all, can be dropped.

No disagreement on the rest of the proposals, those were things I would either already wanted to do, or were bad decisions taken at the time because we didn't know how this was going to be used when first implemented.

does anything need to know the local lag

The idea (at the time) was to make possible to detect datacenter split brains, but if it is not used, of course it can be removed. We will need some way to assure high availability for the process, as if it fails, wikis will be read only. Most clear example: imagine we run it from a maintenance host so it can switch automatically to the right master- the maintenance host is rebooted frequently, so we need some redundancy on multiple hosts. Again, nothing worrying, just need to get the architecture right. It would be nice to talk in IRC to get the full details right.

My proposal-- let's talk further, we (and I include mainly me) did a bad job when I first implemented this, and want now a lot of consensus to get it right this time, however I think the underlying idea was ok (just the implementation wasn't great specially because we didn't know at the time the full use case).

jcrespo renamed this task from Change pt-heartbeat model to not use super-user, avoid SPOF and switch automatically to the real master without puppet dependency to Fix mediawiki heartbeat model, change pt-heartbeat model to not use super-user, avoid SPOF and switch automatically to the real master without puppet dependency.Aug 22 2018, 7:14 AM
jcrespo updated the task description. (Show Details)

A ping here to remind that:

  • pt-heartbeat (or the current way we use it) has architecture and functional issues (eg. lack of HA, our model "cached value used" and the things Tim mentioned above)
  • GTID and chronology checker has architecture, administration and functional issues (in this case, MariaDB support of it is mostly to be blamed, although I think MySQL has the same issues; GTID_WAIT issues, plus we need proper multi-dc support, which GTID was supposed to give automatically, but has may shortcomings, nor it delivered the promise to provide easy topology changes).

From a perspective of layered architecture and separation of concern, I'm not sure I like the idea of a MediaWiki script. But some script that reads from etcd and does the updates to the table seems reasonable.

Lag functions in MW are basically split into two categories: (a) lag estimates and (b) synchronization barriers. I get the issues with pt-heartbeat, which in turn, effects lag estimates using them. For sync barriers, how are they supposed to work using heartbeats (given their limited precision vs GTIDs)? Also, ignoring that, I'm not sure what is gained over GTIDs+MASTER_GTID_WAIT for barriers (which do not use heartbeats now). Maybe we can split barriers into two cases:

  • b1: barriers in maintenance scripts that grab the master position and GTID_WAIT with the replicas just to make sure they are not too behind
  • b2: barriers in maintenance scripts or jobs that want to move I/O to replicas but want certain changes to be seen (RefreshLinks,CategoryMembershipChangeJob)
  • b3: chronology protector, which stores positions for a minute or so, checking against replicas used for site access by a user after updates were caused by that user recently

I think b1 could use a heartloop easily enough. The b2/b3 cases would have some risk of either undershooting the MIN timestamp needed to stop the wait loop (hurting correctness) or overshooting by rounding up the timestamp (causing delay). The tighter the heartbeat interval, the less of a concert this is. It's probably doable, but I'd want to know the advantages first. Is there a task describing the mediawiki-affecting GTID problems (not heartbeat ones)?

Also, see the comment in getHeartbeatData() about UTC_TIMESTAMP. A DIFF function would be better as long as there are no support issues with that given our deployed maria versions. For third parties, we probably don't have to care. The required mysql for MW is now 5.5.8 and the precision field was added in 5.6, unfortunately. It would be easy to patch.

I'm not sure what is gained over GTIDs+MASTER_GTID_WAIT

Given that GTID is broken and does not work, I think we need at least *something that works* :-) I am open for alternatives!

I don't necessarily want to push for heartbeat, but we have to either patch around the (fundamentally flawed) broken mariadb GTID implementation, which reveals itself every time there is a datacenter or master switch, or search for an alternative. GTID is great on paper but it doesn't work in practice as a practical replacement of binlog coordinates. See:

https://phabricator.wikimedia.org/P8014
https://phabricator.wikimedia.org/P8021

One possible way would be to assume every host has its gid and domain id equal to that of the ip (which is on config) and only check that.

There is also some issues with mediawiki, which you are trying to make it work as READ COMITTED for those barriers, while the current database model is REPEATABLE READ mode, which is not necessarily bad, but cannot be done without a different arch.

I will create a separate ticket, I think I am confusing many people with 3 separate issues:

  • The DBA architecture problem (this is our problem to solve, and we just ask for feedback) - mostly with heartbeat
  • The mediawiki architecture problem - mostly related to GTID+cross dc
  • The actual solution/implementation for the previous problem, which needs coding but may require also DBA reachitecturing

Let's talk some time soon in person to sync up. :-)

@Addshore let us know today that there is a "new" error that started happening today which looks related to this thread (I think):
https://logstash.wikimedia.org/goto/018d06f1ac178c272964fa71b76702e1

@mobrovac I think this task is confusing, we should use separate tasks for T172497#4905268 instead, and separate the non-cross dc work away.

@mobrovac I think this task is confusing, we should use separate tasks for T172497#4905268 instead, and separate the non-cross dc work away.

Good point. I created T221159: FY18/19 TEC1.6 Q4: Improve or replace the usage of GTID_WAIT with pt-heartbeat in MW for that purpose. Please feel free to edit the task and chime in there.

Krinkle triaged this task as High priority.Jul 23 2019, 5:30 PM
LSobanski lowered the priority of this task from High to Medium.Apr 26 2021, 1:36 PM
Joe subscribed.

This task seems to have seen no activity in years, and from past comments it seems that most of the stuff in here has either been spun off to separate tasks or done. @jcrespo do you think this should stay open?

For database development questions, I think it is for @Ladsgroup to decide (I think he asked precisely for a list of open issues). Personally, I think it is no longer a super huge priority but it is still valid- whether it should be kept open to capture the issues or closed, it is for the DBAs to decide.

Joe changed the task status from Open to Stalled.Mar 21 2023, 10:25 AM

Ok, I'll set it to stalled and wait for @Ladsgroup to give feedback when available.

Thanks for asking my opinion. I read this. pt-heartbeat has a lot of limitations. For example the task description doesn't cover many other ones. I personally have two more concerns: T327852: MediaWiki replication lags are inflated with average of half a second and time drift between replica and primary (do we use NTPs? hosts can't talk to outside so that's not very likely). Parts of it is due to the nature of distributed systems and there is no way around it, just trade-offs.

So my suggestion is to have a more concrete problem in mind and try to solve or mitigate that instead. Or at least split it to multiple more concrete tickets. For example "Make pt-heartbeat work out of the box in switchovers". Or "pt-heartbeat shouldn't use root" (Isn't that fixed?) and possibly a dedicated ticket for caching of replag in mw. That is a complicated problem I ran into multiple times, the caching of replag actually happens in LoadMonitor so T314020: LoadMonitor connection weighting reimagined might address it to some degree.

Regardless, I advise against writing something from scratch in mw to replace pt-heartbeat, there is probably something written outside we can reuse