Page MenuHomePhabricator

Make LoadBalancer slave lag check and read-only mode more robust (for example, using pt-heartbeat)
Closed, ResolvedPublic

Description

We rely on the stock mysql SHOW STATUS functions to detect slave lag, but these only give log -> DB lag not the "lag behind master DB". We want to properly detect lag even when:
a) The slave can't talk to the master
b) Binlogs relay at a trickle due to network issues
c) Replication uses chaining (eg. master -> slave1 -> slave2)

A few possibilities:
a) We have pt-heartbeat set up, and it would be nice to use something like that. We can change how DatabaseMysql::getLag() works using a config flag. I'd prefer this.
b) Use some schema (post-send lazy loading and local memcached) to check slave vs master position/timestamp instead of only using the Mysql slave lag functions which only check the slaves application lag (not the full relay lag).

Related Objects

Event Timeline

aaron raised the priority of this task from to Needs Triage.
aaron updated the task description. (Show Details)
aaron subscribed.
aaron added a project: DBA.
aaron set Security to None.
aaron moved this task from Tag to Doing on the Sustainability board.
aaron updated the task description. (Show Details)

This is already being done in heartbeat.heartbeat (maybe non-working in some cases).

As they were no current consumers, this service is a bit abandoned.

To use it, you need to query that table, like this:

root@iron:~$ mysql -h db1073 -e "SELECT * FROM heartbeat.heartbeat WHERE server_id = 103222"
+----------------------------+-----------+-------------------+-----------+-----------------------+---------------------+
| ts                         | server_id | file              | position  | relay_master_log_file | exec_master_log_pos |
+----------------------------+-----------+-------------------+-----------+-----------------------+---------------------+
| 2015-09-25T16:48:10.000510 |    103222 | db1052-bin.002419 | 921714571 | NULL                  |                NULL |
+----------------------------+-----------+-------------------+-----------+-----------------------+---------------------+

103222 is the server_id (unique identifier) of the master:

root@iron:~$ mysql -h db1052 -e "SELECT @@server_id"
+-------------+
| @@server_id |
+-------------+
|      103222 |
+-------------+

This is because due to multi-source replication, a server could have multiple masters (or any server on top of it).

To calculate the lag, substract this value from the current date. However, the write only happens every 0.5 seconds, which means it has this error resolution (the actual lag can be the one calculated or that minus up to 0.5 seconds).

Gotchas: this increases the writes of ours servers. 2 extra QPS are not an issue for our hardware, but combined with long running transactions (like the ones done while doing backups or XML dumps), it may produce fragmentation issues and lower performance in extreme cases, due to transactions not being able to purge its old data.

I am implementing this as an agent on every mysql box, as I need it for icinga checks: T112473#1650739, but still unsure about how it should be implemented. Feedback is welcome.

Change 241133 had a related patch set uploaded (by Aaron Schulz):
Added pt-heartbeat support to DatabaseMysqlBase

https://gerrit.wikimedia.org/r/241133

Checked that this is working on all active wiki slaves, including both eqiad and codfw, except dbstore*, db1069 and labsdb (those are not part of the mediawiki production servers). It is not working there due to replication rules, which I may want to fix for monitoring reasons.

<s>@aaron, did you do something to the existing pt-heartbeat installation? I do not remember it being in a good state, after bing unmaintained.</s>

Looking at the execution timestamps, dating to 2013, let's just say "I have worked very very hard to give you a working installation", even if that is not true.

Good. I guess wikiadmin/wikiuser just need read access to the table.

Change 243116 had a related patch set uploaded (by Aaron Schulz):
[WIP] Switched to pt-heartbeat lag detection

https://gerrit.wikimedia.org/r/243116

Change 241133 merged by jenkins-bot:
Added pt-heartbeat support to DatabaseMysqlBase

https://gerrit.wikimedia.org/r/241133

jcrespo renamed this task from Make LoadBalancer slave lag check and read-only mode more robust to Make LoadBalancer slave lag check and read-only mode more robust (for example, using pt-heartbeat).Oct 5 2015, 9:40 AM

Changing title to make it easier to find by searching the tool used.

Because of the offsite, I had the whole pt-heartbeat work freezed, I intend to continue this new week.

The second part of this ticket is

Also make sure the read only mode triggers when remote DC slaves are lagged too

Should cross-datacenter slaves be configured- for example, reading the configuration from the other datacenter? Should all slaves be automatically discovered? What are some (design) thoughts on that? Is this desirable (for example, if we want to do some maintenance on all servers of the passive datacenter, or a split brain happens)?

Because of the offsite, I had the whole pt-heartbeat work freezed, I intend to continue this new week.

The second part of this ticket is

Also make sure the read only mode triggers when remote DC slaves are lagged too

Should cross-datacenter slaves be configured- for example, reading the configuration from the other datacenter? Should all slaves be automatically discovered? What are some (design) thoughts on that? Is this desirable (for example, if we want to do some maintenance on all servers of the passive datacenter, or a split brain happens)?

I'd like to explore it, but from T114271 and some IRL discussions it's not obvious that every one would want to do that. I think that's a topic for next meeting.

I'll remove that from the description of this bug to keep reasonable scope.

I'd like to explore it too, I think it is important to acknowledge geographical replication issues -I was only bringing it up because I do not know how to answer to some of those questions yet.

I agree with keeping the scope of this ticket small, and later think about those issues.

aaron triaged this task as Medium priority.Jan 13 2016, 5:00 PM

@aaron We just hit a special case that we may not had into account. While preparing an s2-master failover, I setup a multi-tier replication topology, with db1024 as the original master, and db1018 as an intermediate master for half of the nodes.

It turns out that the load balancer checks that a slave's (immediate) master lag and if it that one is in read-only mode (not the one setup in configuration), which leads to errors on api queries, as it thinks writes cannot go through. I consider this a bug, depite we will want to avoid usually multi-tier setups.

See: https://logstash.wikimedia.org/#dashboard/temp/AVLBwZpkptxhN1XaL4z1

For that time the topology was db1024 -> db1018 -> half of the slaves

Change 274178 had a related patch set uploaded (by Jcrespo):
Upgrade mariadb module to allow new heartbeat updates

https://gerrit.wikimedia.org/r/274178

Change 274178 merged by Jcrespo:
Upgrade mariadb module to allow new heartbeat updates

https://gerrit.wikimedia.org/r/274178

New canonical way to check lag (available cluster-wide):

SELECT timestampdiff(SECOND, max(`heartbeat`.`heartbeat`.`ts`), utc_timestamp()) AS `lag`
FROM `heartbeat`.`heartbeat`
WHERE shard=?
GROUP BY `shard`;

where ? can be s1-7, m1-5, es2-3, tools, and x1

Change 275908 had a related patch set uploaded (by Aaron Schulz):
Support channel based lag detection

https://gerrit.wikimedia.org/r/275908

Aaron, looks great! Yay!

There is a couple of things that bothers me now, not for our setup, which I think is the way to go, but for mediawiki as a product (easy to use, self contained, etc.):

  • Should mediawiki, in the future, absorb pt-heartbeat functionality so there is no external dependency (I have already imported pt-heartbeat code in our repo, so we make sure we do not depend on external code) or we can justify that being optional, people can set it up if they want? pt-hearbeat, despite having a simple task, gets very complex with checks and timing (e.g. writes are 0-aligned), that is why I chose to modify it instead of implementing it on my own
  • We probably should make the table name "heartbeat.heartbeat" an option, and not hardcoded so that people that want to use it but do not have a multi-instance can put the table on the same db 2) it is explicitly shown on configuration files, making it more visible

Thinking on following steps: active-active (multi-tier) chronology solutions- should we try to integrate heartbeat with it for slave delay checking or should we focus on gtid (pt-heartbeat would serve as a multi-platform pseudo-gtid. GTIDs are better one-liners checks, but mariadb and mysql have different implementations and technically we cannot use them with our current baseline (5.0).

Since it's optional, I'd rather stick with having it work with the existing (as well as custom) pt-heartbeat rather than reinventing that in MediaWiki.

Making the table configurable is potentially OK but is complicated a bit by the fact that LoadBalancer/LoadMonitor assume there is one lag value per server. We already implicitly assume either no parallel replication or in-order parallel at most. If the heartbeat table was allowed to be per-DB, that would need to be refactored. It would make caching more complex by increase the number of cache keys and misses...though on the other hand the long-tail of less used wikis would have low lag time query rates and it's pretty fast anyway (unlike the global SHOW SLAVE STATUS locks with the old method). It could added at some point in the future (like when we want inter-db parallel replication).

Thank you Aaron, everything you said makes sense. I think we can stick to the current setup (the scope of this ticket) until we need fully active-active setup. I predict myself talking at some point with you about parallel replication, but we are not yet there. The next issue will be multi-tier replication, and we are not yet there either.

Change 275908 merged by jenkins-bot:
Support custom query in pt-heartbeat lag detection

https://gerrit.wikimedia.org/r/275908

aaron raised the priority of this task from Medium to High.Apr 20 2016, 7:50 PM

During the failover, we detected 3 problems with pt-heartbeat implementation:

  • If server goes read-only, replication lag will increase despite no writes are being done (this is technically fixable by executing pt-heartbeat with SUPER rights, which we already do, but we may want to have a special exception on lag detection while running mediawiki in read-only mode, or combine it with binlog/GTID accounting.
  • On master failover, pt-heartbeat must transactionally safely stop on one server and start on the new master. This is an orchestration problem we have not yet solved
  • pt-heartbeat itself is a process separated from the database server, which means it can fail without the server failing. It is a SPOF. Maybe we should run 2 processes- one on each datacenter master?

Those issue are more infrastructure-oriented than software, but they are being discussed and tracked on T133339, and I wanted to share them with you.

Since our current setup now has pt-heartbeat on each local master, lag between DCs won't show up (as noted in T134480). For active/active DCs this means that requests to view page edit forms might not show the read-only message but the user will get it when they POST the form. It also means that the automatic TTL lowering of CDN expiry and WAN object cache expiry might not apply when it should, leading to stale values stuck in cache. I wonder if there is some way to get lag measurements to work correctly in all DCs at once.

The reason for that model was thinking that in the future we will want dual masters, with 100% mirrored setup and no active or passive.

There are several options here, some at ops layer and some at mediawiki layer:

  • Assuming there is still a primary datacenter and a secondary datacenter- we can detect the original server making a slightly different query (through server id, binlog name or a new column called "datacenter"). In the last case the query could be changed into "SELECT max(ts) WHERE datacenter ='eqiad'. Or, "SELECT max(ts)... GROUP BY datacenter", and there you have the 2 values (a local and a remote datacenter lag), and the logic could be changed depending on that. For example, a primary datacenter will want the min of these 2 values; secondary would want only the remote one). The problem with just checking one value is that we have again a SPOF- until some kind of master-master setup is created as proposed here: T119626
  • Assuming we still have to live with a master SPOF, two writes could be setup by the use of 2 separate proxys on 2 separate hosts, but both writing to the same host (primary datacenter master). On datacenter failover, the proxyies switch the master they are writing to the active one.

The second approach, which would be closer to the second idea will take some time, because no proxying has yet been setup on production (although it is planned). The first one gives more flexibility thinking on active-active -it will allow to detect different issues: local master vs. remote master replication lag and datacenter link loss. I could even add a GTID parameter to work nicely with chronology detection. I like the first approach more because it will make both datacenters mirrors (except for the real-time datacenter config) and potentially more interesting.

I think a separate datacenter column makes sense.

So, I need an extra brain about heartbeat architecture, and despite you being on performance, I think you have some feet on HA, and after all this is blocking you. 0:-)

  1. In the beginning, we had pt-heartbeat in a single host: the master.
  2. After my concerns regarding availability- if this is used on al wikis, if a single process fails (pt-heartbeat)- everything fails- or I assume, it goes back to read only mode. However, everything was in theory useful- it is not as if the database itself has crashed.
  3. To address those concerns, and the fact that it made the datacenter failover much easier, I applied the master=true role to both eqiad (the real master) and codfw (the candidate master). Now both datacenters are 100% symmetrical because each are writing its heartbeats, and if one fails or is under maintenance, the heartbeat keep coming from the other master.
  4. now the issue is that by being symmetrical, it is not possible to detect the effective lag between the real master and the remote datacenter. I proposed maintaining the heartbeat, but adding a parameter "datacenter".

The question is, how to architecture that so that heartbeat is still redundant? Maybe we should have a 3rd process, sending heartbeats from (let's say) terbium (so it is full independent) to both servers (or maybe only the "real" master?). That would be 4 QPS already, which would start to become significant. Maybe just some systemd watchdog that restarts the service (in addition to puppet, that already restarts it if it is not running)?

I will start by coding the extra column and and offset option and continue thinking what is the best option.

I think I have an alternative solution, without requiring external scripts. Instead of using pt-heartbeat, reuse the same table, but update it with an event: http://code.openark.org/blog/mysql/seconds_behind_master-vs-absolute-slave-lag

I hate events, they are difficult to write (pseudo-SQL, no debbugging, no tools to use them, bad logging) and manage, it would avoid most of the issues with the process failing because it is not but another thread inside mysqld. While it can fail or be disabled, it is more difficult to do so as it is part of the same process.

The largest problem with this approach is that it cannot be controlled by puppet anymore.

I've chosen to implement the datacenter field- masters will continue to be a SPOF until we have automatic failover or dual master-master.

I will apply the new pt-heartbeat and checks tomorrow during my morning:

This is now done at database level. One can check:

  • The lag to the local master
  • The lag to the primary datacenter master
  • The lag to the secondary datacenter master

individually.

MariaDB PRODUCTION s2 localhost (none) > SELECT * FROM heartbeat.heartbeat;
+----------------------------+-----------+-------------------+-----------+-----------------------+---------------------+-------+------------+
| ts                         | server_id | file              | position  | relay_master_log_file | exec_master_log_pos | shard | datacenter |
+----------------------------+-----------+-------------------+-----------+-----------------------+---------------------+-------+------------+
| 2014-01-10T13:06:36.001010 |     10167 | db1018-bin.000187 | 662124998 | db1036-bin.000403     |           241948950 | NULL  | NULL       |
| 2014-02-10T03:09:14.500840 |    101625 | db1036-bin.000478 |  13957399 | NULL                  |                NULL | NULL  | NULL       |
| 2014-12-05T02:43:23.500810 |    103230 | db1060-bin.000784 |  70301170 | db1024-bin.000709     |           786882435 | NULL  | NULL       |
| 2016-02-09T23:38:52.501100 |    101613 | db1024-bin.002072 | 759190624 | db1018-bin.000975     |           344078894 | NULL  | NULL       |
| 2016-08-03T08:53:07.001170 | 171970567 | db1018-bin.001634 | 208005732 | db2017-bin.002132     |           310795262 | s2    | eqiad      |
| 2016-08-03T08:53:07.000970 | 180359173 | db2017-bin.002132 | 310795798 | db1018-bin.001634     |           208005464 | s2    | codfw      |
+----------------------------+-----------+-------------------+-----------+-----------------------+---------------------+-------+------------+
6 rows in set (0.00 sec)

Lines 1-4 are leftovers from previous masters/previous server_id values (that is why sever id or log files names are not reliable methods, they can change). They can be deleted, but I left them as a way to control past masters.

Line 5 comes from the current eqiad master. Line 6 comes from the current codfw master. Shard is a 0-10 binary string; datacenter is a 5 char binary string. Note how, despite being aligned, the write from eqiad seemed to come first- this is a combination of probably the writes being slower on the active master + write lock on the table heartbeat.

The check I use on icinga is shard = ? and datacenter = ? ORDER BY ts DESC LIMIT 1

Note that the current setup has a random error of [0, 1 second + δ] , (where δ is the time it takes for a 1 row replace query to be committed, normally mili or microseconds or less). However, the updates are 0-aligned; meaning that you could get a better approximation by performing the checks 0 + δ aligned, too; or that if you want a max lag of X, you have to check for X+1 seconds of lag to be accurate.

Change 243116 merged by jenkins-bot:
Switched to pt-heartbeat lag detection on s6

https://gerrit.wikimedia.org/r/243116