Page MenuHomePhabricator

Production test of x2 failure modes
Closed, ResolvedPublic

Description

As discussed at T306118#8177779, as a followup to T315274, it would be nice to do a test of x2 failure modes in production in order to gain confidence with the multi-DC deployment.

After deployment of the dbctl patch which will remove the x2 replicas from MediaWiki's configuration, there should be no need to test replication failure on the leaf nodes, since MW will have no way to connect to them. But we can test stopped and delayed replication on the codfw x2 master (db2142). We could also test connection failures.

The idea would be:

  • Set multi-DC mode to testwiki only
  • Disable paging alerts for db2142, db2143, db2144
  • Stop replication on db2142
  • Try some page views on testwiki, monitor the logs
  • Start replication on db2142 with MASTER_DELAY=30, repeat tests.
  • Restore normal replication on db2142.
  • Simulate db2142 failure with iptables -I INPUT -p tcp --syn --src 10.192.0.0/16 --dport 3306 -j DROP. Fighting ferm, but it only needs to be deployed for a few minutes. Repeat tests. Restore with iptables -D INPUT 1.
  • Repeat test with -j REJECT
  • Restore alerts.

This could be done in the AU/EU overlap on Monday 5 Sep, assuming the dbctl patch is deployed by then.

Event Timeline

Change 828677 had a related patch set uploaded (by Tim Starling; author: Tim Starling):

[operations/puppet@production] Multi-DC: go back to testwiki only

https://gerrit.wikimedia.org/r/828677

Marostegui moved this task from Triage to Ready on the DBA board.
Marostegui added a subscriber: CDanis.

That sounds ok to me Tim, I am off Monday 5th, but probably @Ladsgroup can help. If not I can help when I am back (the 6th). Not sure when @CDanis is planning to merge that dbctl patch though.

As a followup to this, here or on a a separate tasks, it would be wise to reevaluate paging strategy for x2- so if something doesn't create an outage anymore, removing it from paging (not sure if just the replicas or also the primaries- all depends on the tests results, I guess). Currently it pages for everything, which I believe it is wise in the status during T315274, but might be able to be relaxed if tests work as expected.

Here is a few scenarios that I can think of that could be tested: iptables with REJECT, iptables with DROP, regular lag, replication breaking, high query latency/max_connections overload.

Yeah, the replicas will definitely have notifications disabled.

The dbctl patch (https://gerrit.wikimedia.org/r/c/operations/software/conftool/+/828606) was merged yesterday and x2 is now running with the new flag

root@cumin1001:~#  dbctl -s codfw section x2 get
{
    "tags": "datacenter=codfw",
    "x2": {
        "flavor": "external",
        "master": "db2142",
        "min_replicas": 0,
        "omit_replicas_in_mwconfig": true,
        "readonly": false,
        "ro_reason": "test"
    }
}
root@cumin1001:~#  dbctl -s eqiad section x2 get
{
    "tags": "datacenter=eqiad",
    "x2": {
        "flavor": "external",
        "master": "db1151",
        "min_replicas": 1,
        "omit_replicas_in_mwconfig": true,
        "readonly": false,
        "ro_reason": "test"
    }
}
root@cumin1001:~#

So the replicas aren't supposed to be used from dbctl and this can be visible here: https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&var-job=All&var-server=db1152&var-port=9104&from=1661993853796&to=1662069003226

Change 828677 merged by Tim Starling:

[operations/puppet@production] Multi-DC: go back to testwiki only

https://gerrit.wikimedia.org/r/828677

Mentioned in SAL (#wikimedia-operations) [2022-09-05T11:15:57Z] <tstarling@cumin1001> START - Cookbook sre.hosts.downtime for 2:00:00 on db[2142-2144].codfw.wmnet with reason: T316847 x2 failure test

Mentioned in SAL (#wikimedia-operations) [2022-09-05T11:16:12Z] <tstarling@cumin1001> END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on db[2142-2144].codfw.wmnet with reason: T316847 x2 failure test

There are log entries like

2022-09-05 10:22:36.951647 [764c3497-8ff4-4ca2-a60c-1a7122c6943f] mw2320 testwiki 1.39.0-wmf.27 DBReplication DEBUG: Wikimedia\Rdbms\ChronologyProtector::applySessionReplicationPosition: extension2 (db2142) has no position

Not sure why it's checking the replication position, but there's no error, just a debug message. Viewing and editing works just fine, using codfw with stopped replication on x2.

Mentioned in SAL (#wikimedia-operations) [2022-09-05T11:29:58Z] <TimStarling> on db2142: set master_delay=30 and restarted replication T316847

Mentioned in SAL (#wikimedia-operations) [2022-09-05T11:37:36Z] <TimStarling> on db2142: dropping inbound mysql traffic T316847

With -j DROP there were plenty of errors logged in the DBConnection channel, and from the log timestamps it looks like the connection timeout was 3s. No problem with page views or edits. Seemed slow, but I tested action=edit again after I dropped the iptables rule, and it was about as slow (~3.5 second response time). There was no spike in the exception channel.

Mentioned in SAL (#wikimedia-operations) [2022-09-05T11:55:22Z] <TimStarling> on db2142: rejecting inbound mysql traffic T316847

In the -j REJECT test I found a better manual test to do, which was to click on the echo bell. It triggered a DBConnection log entry, but no user-visible error.

2022-09-05 11:56:21.663667 [7f29d1c3-a634-4d13-b4ab-f13b7ef46af9] mw2366 testwiki 1.39.0-wmf.27 DBConnection ERROR: Error connecting to db2142 as user wikiuser202206: :real_connect(): (HY000/2002): Connection refused {"db_server":"db2142","db_name":"mainstash","db_user":"wikiuser202206","error":":real_connect(): (HY000/2002): Connection refused","db_log_category":"connection"}

As with -j DROP there was a spike of DBConnection log entries but no associated exception log entries.

In summary, nothing unexpected. Stopping or delaying replication causes a split brain — probably there are subtle consequences, but the site didn't go down, and that's all we were really checking for here. With iptables simulating failure of the x2 master, the logs showed graceful failure after a 3 second connect timeout.

tstarling claimed this task.
tstarling updated the task description. (Show Details)

Change 830052 had a related patch set uploaded (by Marostegui; author: Marostegui):

[operations/puppet@production] x2 replicas: Disable notifications

https://gerrit.wikimedia.org/r/830052

Change 830052 merged by Marostegui:

[operations/puppet@production] x2 replicas: Disable notifications

https://gerrit.wikimedia.org/r/830052

I have disabled notifications on all x2 replicas. They won't page, but they'll still be shown in icinga if something breaks