Page MenuHomePhabricator

Notify DBA prior to sending db traffic to x2
Closed, ResolvedPublic

Description

x2 hardware is ready which is great, unfortunately Multi-DC work was postponed again. To avoid unneeded work, paging is currently disabled for this (idle) cluster.

Before we send db traffic to it, this task is to remind us to give the DBA team a heads up before we do.

Event Timeline

Krinkle changed the task status from Open to Stalled.Apr 13 2022, 5:27 PM
Krinkle triaged this task as Medium priority.
Krinkle changed the task status from Stalled to Open.EditedMay 31 2022, 11:36 PM
Krinkle added a project: DBA.

This is now ready as per T212129#7972534, with the schema ready to be provisioned and MW core and wmf-config ready on our side as well, to be scheduled for this/next week.

\cc @Ladsgroup @Marostegui

Ok I am going to enable notifications on the hosts.

Change 801841 had a related patch set uploaded (by Marostegui; author: Marostegui):

[operations/puppet@production] x2: Enable notifications

https://gerrit.wikimedia.org/r/801841

Change 801841 merged by Marostegui:

[operations/puppet@production] x2: Enable notifications

https://gerrit.wikimedia.org/r/801841

Notifications enabled.
@Krinkle I am sure this was answered when we first set up these hosts, but I cannot find the answer, it's been more than a year :-) So quick question, in case one of the masters dies or if we need to do maintenance on any of them, how should we proceed.
Ie: let's say we need to do maintenance on codfw master (db2142) or it crashes (https://orchestrator.wikimedia.org/web/cluster/alias/x2), just depooling those two replicas would be enough for MW to continue to operate normally?

Thanks!

[…] So quick question, in case one of the masters dies or if we need to do maintenance on any of them, how should we proceed.
Ie: let's say we need to do maintenance on codfw master (db2142) or it crashes (https://orchestrator.wikimedia.org/web/cluster/alias/x2), just depooling those two replicas would be enough for MW to continue to operate normally?

I'm not sure I understand which "two replicas" you mean, I assume the codfw ones in that scenario, but I'm not sure I understand why you'd want to depool the replicas in addition to the master being down, that leaves nothing and would likely fatal all web traffic as db configs must have at least the [0] index defined, to a writable master or pretend/read-only master. Replicas are logically optional as far as db config server arrays go.

I'm assuming I misunderstood your scenario — please clarify :)

The dataset currently fits on a single host, with 6=2x3 hosts effectivelly identical hosts across the two DCs. Failing over to any of the other local two, would be fine. And if we prepare a commented-out piece of code for cross-DC conns with TLS, we can even fail over to the other DC during maintenance as we already do with ElasticSearch and Kask/Cassandra at times.

While the main stash service generally expects higher availability than memcached, it is implemented such that, apart from configuration errors, any runtime errors are tolerated and do not result in fatal errors. That is, even if every connection/read/write query fails, the abstraction layer masks this and acts as if it returned null and continues on, similar to what we do with other graceful layers like Memcached.

Unlike cache (Memc) however, unavailability of stash does result in user-visible impact and unrecoverable loss of information that can't be recomputed or retried. But the general category of data here is secondary data. E.g. tokens, drafts, notification read markers, fairly minor things compared to core DBs.

[…] So quick question, in case one of the masters dies or if we need to do maintenance on any of them, how should we proceed.
Ie: let's say we need to do maintenance on codfw master (db2142) or it crashes (https://orchestrator.wikimedia.org/web/cluster/alias/x2), just depooling those two replicas would be enough for MW to continue to operate normally?

I'm not sure I understand which "two replicas" you mean, I assume the codfw ones in that scenario, but I'm not sure I understand why you'd want to depool the replicas in addition to the master being down, that leaves nothing and would likely fatal all web traffic as db configs must have at least the [0] index defined, to a writable master or pretend/read-only master. Replicas are logically optional as far as db config server arrays go.

If any of the masters crashes (or needs to go down for maintenance) the replicas under it would have replication broken and delay.

I'm assuming I misunderstood your scenario — please clarify :)

The dataset currently fits on a single host, with 6=2x3 hosts effectivelly identical hosts across the two DCs. Failing over to any of the other local two, would be fine. And if we prepare a commented-out piece of code for cross-DC conns with TLS, we can even fail over to the other DC during maintenance as we already do with ElasticSearch and Kask/Cassandra at times.

That's good, but how would MW handle replication being broken or delayed? Do we need to "promote" one of the replicas to master on that local DC where the master has crashed?

To clarify, if you check the above orchestrator link, the scenario I am thinking of would be db2142 being down.
db2143 and db2144 would have replication broken (and would have no up-to-date data).

  • What would MW do there?
  • Do we need to depool those two hosts?

Thanks!

What would MW do there? Do we need to depool those two hosts?

As it is currently configured, even when nothing is wrong, all reads go to the configured master. The configured replicas are never used even if the master is down. So when db2142 goes down, all codfw reads and writes will fail, returning false, and the callers are expected to degrade. So it is not necessary to depool the two hosts, everything will be safely broken without doing that. We can test this if it is a point of concern.

Recovery can be done by promoting a replica to the master role and then updating the MediaWiki config.

A netsplit type situation, with long-term loss of cross-DC replication, could be dealt with by depooling the affected DC at the CDN level.

@tstarling thanks for the explanation, it is all clear now. We (at least I) didn't know what would need to be done in case of master failure. It is clear now :)
Thanks again

tstarling claimed this task.
jcrespo subscribed.

Could you clarify this:

As it is currently configured, even when nothing is wrong, all reads go to the configured master.
So it is not necessary to depool the two hosts, everything will be safely broken without doing that

In light of T315274?

MediaWiki seemed to be quite broken when replication on the replicas broke- so either the instructions were not accurate or this was not tested, or the conditions changed?

T315274 did expose a lack of testing. The replicas are indeed contacted and checked for replication lag, so that a DBReadOnlyError can be thrown in "lagged replica mode". In a meeting, @aaron suggested removing the replication lag check for x2, but I don't think we turned that into a task or patch.

The fact that the exception was user-visible was fixed in https://gerrit.wikimedia.org/r/c/mediawiki/core/+/823791 . So the currently deployed situation is that stopping replication on all x2 replicas in a DC will cause silent read and write failure within that DC.

I mean, theoretically. I've tested it locally with two MariaDB instances with broken replication. I reproduced the bug and confirmed the fix. But production is more complicated than that test setup.

Thank you, @tstarling +1 to test on production, in the safest way possible DBAs feel comfortably with, before documenting something that may not be 100% right. E.g. I prefer to document the status quo "depool replicas on local replication breakage" rather than "it is ok to kill replicas", if that is not right.

Filed as T312809: Avoid x2-mainstash replica connections (ChronologyProtector).

Same issue but different side-effect (portion of traffic needlessly tries to connect to replicas). I imagine the fix will turn off both sides of this, eg CP and RO mode equally.

I echo Tim's point that in terms of isolation this is handled by the last patch already. It would be better if replication failure didn't even silently degrade stash but simply keeps going, as short-term writability is more valuable than preserving replication.

I also agree we should verify the isolation barrier in prod.

I am going to resolve this, once we are aware of the issue- as the original scope was not about documentation. Will amend docs with current status https://wikitech.wikimedia.org/wiki/MariaDB#x2 update later when T312809 is fixed. This understanding, makes me think the outage was more related to surprising behavior from mw (unknown to the operator) than an admin error, as if this was a known issue I doubt it would have happened in the first place.