Page MenuHomePhabricator

Rdbms library is too aggressive in changing to read-only
Closed, ResolvedPublic

Description

We are doing way more master switchovers and part of the switchover is the replicas topology changes. Basically move all of the replicas (except the-to-be-master) under the-to-be-master. For example:

image.png (995×1 px, 241 KB)

(and during the actual switchover, you just bring the old master under the new one)

This happens around half an hour before the actual switchover (taking around ten to twenty minutes) and only deals with replicas. Yet MediaWiki is acting too smart and decides to go read-only a lot during that time, causing read-only errors way more than the actual switchover: https://logstash.wikimedia.org/goto/580e0362614b1ffd4df5d71749a681a6

This needs fixing.

Event Timeline

Aklapper renamed this task from MediaWiki is too aggersive in changing to read-only to MediaWiki is too aggressive in changing to read-only.Aug 11 2022, 7:43 AM

I think the easiest would be to increase the max lag before topology change and reset it right after.

I don't know where this comment is coming from though:

// should be safely less than $wgCdnReboundPurgeDelay
'max lag'            => 6,

We have semi-sync enabled in production, there is not much point in going read-only when replication is lagging behind.

I don't see enough lag to hit "max lag" read-only mode for s2 in https://grafana-rw.wikimedia.org/d/G9kbQdRVz/mediawiki-loadbalancer?orgId=1&from=1660194000000&to=1660197600000&kiosk . I see the "all replicas lagged" logs though (I guess the statsd gauge wasn't updated frequently enough to see >6s). The tree diagram shows 0s lag and MW showed ~4s. Wait positions timed out too. Is the diagram just using seconds_behind_master? Is there really high lag during those 20min (pt-heartbeat)? The "Mysql Replication" dashboard on Grafana just uses seconds_behind_master too.

I think it's pt-heartbeat. Again, I said it in the patch and I say it here again. MediaWiki should not do the job replication management. It's job of mysql cluster/semi-sync/pt-heartbeat/alerting/etc. The whole concept of going read-only if there is lag > 6s in all replicas is useless and harmful.

Change 832270 had a related patch set uploaded (by Ladsgroup; author: Amir Sarabadani):

[mediawiki/core@master] Get rid of concept of lagged replica protection

https://gerrit.wikimedia.org/r/832270

Krinkle renamed this task from MediaWiki is too aggressive in changing to read-only to Rdbms library is too aggressive in changing to read-only.Oct 17 2022, 7:47 PM

Change 874940 had a related patch set uploaded (by Ladsgroup; author: Amir Sarabadani):

[mediawiki/core@master] rdbms: Stop going read-only if all replicas are lagged

https://gerrit.wikimedia.org/r/874940

Change 874940 merged by jenkins-bot:

[mediawiki/core@master] rdbms: Stop going read-only if all replicas are lagged

https://gerrit.wikimedia.org/r/874940

Change 880995 had a related patch set uploaded (by Krinkle; author: Krinkle):

[mediawiki/core@master] rdbms: Remove "read-only primary" and restore "short cache" in lagged mode

https://gerrit.wikimedia.org/r/880995

Change 880996 had a related patch set uploaded (by Krinkle; author: Krinkle):

[mediawiki/core@master] rdbms: Remove getLaggedReplicaMode() in favour of laggedReplicaUsed()

https://gerrit.wikimedia.org/r/880996

We still don't know why the pt-heartbeat would show high lag in this scenario. Is pt-heartbeat stopped during this time on the primary?

Change 874940 merged by jenkins-bot:

[mediawiki/core@master] rdbms: Stop going read-only if all replicas are lagged

https://gerrit.wikimedia.org/r/874940

I did several switchovers and can confirm this fixes the issue.

We still don't know why the pt-heartbeat would show high lag in this scenario. Is pt-heartbeat stopped during this time on the primary?

We only kill pt-hearbeat for like 2 seconds when we do the primary switchover itself (that is: changing the master, the last step of the switchover).
When we move the replicas around, we stop replication on two hosts:

  • The future primary master
  • The replica that is going to be moved under it.

We only move one replica at the time, so if anything, we should only show lag on two hosts (the future master and the replica that is being moved around). Obviously, as more and more replicas get moved under the future new master, this pt-heartbeat lag might show up on more hosts).
For instance, say we have 10 replicas, once 9 of them are already under the new master, when moving the last one, stopping replication on the future new master will make those 9 replicas to show pt-heartbeat-lag.

Hope this helps

Thanks @Marostegui. Some stuff to add:

  • The all replicas are lagged happens actually quite early during the topology changes, e.g. when only two replicas are actually lagged.
    • So I think there might be something with how mw reads lag, orchestrator doesn't show any lag during the time the mw goes read-only.
  • The whole operation is quite quick (in total one minute per moving one replica and it includes a lot of operations that are outside of the stopped replication timespan), the actual time replicas are lagged is quite short.
  • Generally, the LB/LBF going read-only here is quite an overreaction to lagged replicas. MW shouldn't do flow control for databases.

I spent most of yesterday and today reading code and looking at pt-hearbeat and I'm fairly certain it's not an issue with pt-heartbeat (I also found an issue with handling of pt-heartbeat in mw, ticket coming). Otherwise it would show up in grafana which it doesn't as you said in T314975#8209689.

My bet would be on LoadMonitor and how it stores and retrieves cached values. I think something makes it add lags to the cached value instead of replacing it, it's an issue with our caching that I have seen with Special:LintErrors too. If you refresh the page in certain times (not back to back), all the values it in the page would be doubled (these values are coming from a cache, look at TotalsLookup class), probably the eager recalculate is buggy.

I spent most of yesterday and today reading code and looking at pt-hearbeat and I'm fairly certain it's not an issue with pt-heartbeat (I also found an issue with handling of pt-heartbeat in mw, ticket coming). Otherwise it would show up in grafana which it doesn't as you said in T314975#8209689.

Also T141968: Display lag on grafana (prometheus) from pt-heartbeat instead (or in addition) of Seconds_Behind_Master

Change 880995 merged by jenkins-bot:

[mediawiki/core@master] rdbms: Remove "read-only primary" and restore "short cache" in lagged mode

https://gerrit.wikimedia.org/r/880995

Change 880996 merged by jenkins-bot:

[mediawiki/core@master] rdbms: Remove getLaggedReplicaMode() in favour of laggedReplicaUsed()

https://gerrit.wikimedia.org/r/880996

Ladsgroup updated Other Assignee, added: Krinkle.

Change 832270 abandoned by Krinkle:

[mediawiki/core@master] Get rid of concept of lagged replica protection

Reason:

Superseded by alternate patches that trim down the feature a lot, but keeps some of it.

https://gerrit.wikimedia.org/r/832270