Page MenuHomePhabricator

Progressive Multi-DC roll out
Closed, ResolvedPublic

Description

Proposed plan for discussion

Stage 0: X-Wikimedia-Debug

After the data store patches are deployed (811394, 809326), we can use an X-Wikimedia-Debug header to test basic operations for correctness. This does not use multi-dc.lua, so even POST requests will go to the secondary DC, but it should still work, with a latency penalty.

  • CentralAuth explicit login
  • CentralAuth auto login (Special:CentralAutoLogin)
  • Edit token, page save
  • AbuseFilter profile update
  • Upload stash

Stage 1: testwiki

Configure multi-dc.lua to use the local mode for test.wikipedia.org. This is the first test of multi-dc.lua in production.

  • Method-dependent (GET/POST) routing including page save
  • UseDC=master cookie including MW integration (hasOrMadeRecentPrimaryChanges)
  • Rollback
  • API action=centralauthtoken
  • Echo user talk bell icon, page view dismiss
  • Echo user talk email
  • API edit (e.g. HotCat), high latency

Stage 1.5: test2wiki

  • Page view performance
  • Parser cache miss performance
  • API edit worst case test rig
  • pywikibot

Stage 2: mediawikiwiki

  • Beta test period

Stage 3: traffic percentage

Sending 2% of eligible traffic to the local DC allows capacity modelling.

  • Observe cross-DC database connection rate, analyse sources
  • Observe cross-DC mcrouter queries
  • Observe load on local services (DB replicas, mcrouter, etc.)

Stage 4: full deployment

This is the first stage that gives us reproducible testing of loginwiki routing. So the CentralAuth tests should be repeated at this stage.

  • CentralAuth explicit login
  • CentralAuth auto login
  • CentralAuth account creation, cpPosIndex query parameter (appendShutdownCPIndexAsQuery)

Considerations

Some things to think about:

  • By edge DC (e.g. esams, ulsfo, eqsin).
  • By type of traffic (e.g. no-session logged-outs first, then sessioned users who use a Beta Feature cookie, then all sessioned/logged-ins).
  • By wiki (e.g. test wikis, then internal wikis, then select early adopter wikis).
  • By wiki project (e.g. Wiktionary, then Wikidata, then Commons, then Wikipedia).
  • By percentage of traffic.

Of course these can be combined as well. For example:

  1. testwikis logged-outs (100% of logged-outs from any DC).
  2. Wiktionary logged-outs (one DC at a time, slowly ramp up percentage one after the other).
  3. etc.

Note that WikimediaDebug already allows early testing of this today (both logged-out and logged-in) for the applayer (though this ignores traffic routing of course, so e.g. this ignores user flows like POST going to primary, sticky DC cookie, then going back to secondary etc.)

Especially no-session users before sessioned/logged-ins, I think, will help in getting early learnings and progress while other parts are still being worked on since that involves fewer moving parts (no session storage, no significant DB writes), and alllows teams to work more in parallel rather than being blocked until everything is "done" at once.

Event Timeline

There are a very large number of changes, so older changes are hidden. Show Older Changes
tstarling renamed this task from Decide on details of progressive Multi-DC roll out to Progressive Multi-DC roll out.Jul 6 2022, 2:18 AM
tstarling updated the task description. (Show Details)

I edited the task description with a proposed rollout plan, and I renamed the task to encompass the actual work, not just deciding on the work.

Change 815403 had a related patch set uploaded (by Tim Starling; author: Tim Starling):

[operations/puppet@production] Switch testwiki to multi-DC active/active mode

https://gerrit.wikimedia.org/r/815403

Change 815834 had a related patch set uploaded (by Tim Starling; author: Tim Starling):

[operations/mediawiki-config@master] Don't send debug log from test2wiki to testwiki.log

https://gerrit.wikimedia.org/r/815834

Change 815834 merged by jenkins-bot:

[operations/mediawiki-config@master] Don't send debug log from test2wiki to testwiki.log

https://gerrit.wikimedia.org/r/815834

Change 815403 merged by Tim Starling:

[operations/puppet@production] Switch testwiki to multi-DC active/active mode

https://gerrit.wikimedia.org/r/815403

Change 818652 had a related patch set uploaded (by Tim Starling; author: Tim Starling):

[operations/puppet@production] Discovery: codfw should be pooled for api-ro and appservers-ro

https://gerrit.wikimedia.org/r/818652

Initial ab run:

1
2[0118][tstarling@mwmaint2002:~]$ ab -n100 -X mw2377.codfw.wmnet:80 -H 'X-Forwarded-Proto: https' http://test2.wikipedia.org/wiki/Foobar
3This is ApacheBench, Version 2.3 <$Revision: 1843412 $>
4Copyright 1996 Adam Twiss, Zeus Technology Ltd, http://www.zeustech.net/
5Licensed to The Apache Software Foundation, http://www.apache.org/
6
7Benchmarking test2.wikipedia.org [through mw2377.codfw.wmnet:80] (be patient).....done
8
9
10Server Software: mw2377.codfw.wmnet
11Server Hostname: test2.wikipedia.org
12Server Port: 80
13
14Document Path: /wiki/Foobar
15Document Length: 28902 bytes
16
17Concurrency Level: 1
18Time taken for tests: 14.359 seconds
19Complete requests: 100
20Failed requests: 0
21Total transferred: 2946300 bytes
22HTML transferred: 2890200 bytes
23Requests per second: 6.96 [#/sec] (mean)
24Time per request: 143.592 [ms] (mean)
25Time per request: 143.592 [ms] (mean, across all concurrent requests)
26Transfer rate: 200.38 [Kbytes/sec] received
27
28Connection Times (ms)
29 min mean[+/-sd] median max
30Connect: 0 0 0.0 0 0
31Processing: 120 143 24.7 141 367
32Waiting: 120 143 24.7 141 366
33Total: 121 144 24.7 142 367
34
35Percentage of the requests served within a certain time (ms)
36 50% 142
37 66% 146
38 75% 149
39 80% 151
40 90% 155
41 95% 158
42 98% 164
43 99% 367
44 100% 367 (longest request)
45
46[0118][tstarling@mwmaint1002:~]$ ab -n100 -X mw1441.eqiad.wmnet:80 -H 'X-Forwarded-Proto: https' http://test2.wikipedia.org/wiki/Foobar
47This is ApacheBench, Version 2.3 <$Revision: 1843412 $>
48Copyright 1996 Adam Twiss, Zeus Technology Ltd, http://www.zeustech.net/
49Licensed to The Apache Software Foundation, http://www.apache.org/
50
51Benchmarking test2.wikipedia.org [through mw1441.eqiad.wmnet:80] (be patient).....done
52
53
54Server Software: mw1441.eqiad.wmnet
55Server Hostname: test2.wikipedia.org
56Server Port: 80
57
58Document Path: /wiki/Foobar
59Document Length: 28902 bytes
60
61Concurrency Level: 1
62Time taken for tests: 12.460 seconds
63Complete requests: 100
64Failed requests: 3
65 (Connect: 0, Receive: 0, Length: 3, Exceptions: 0)
66Total transferred: 2946294 bytes
67HTML transferred: 2890197 bytes
68Requests per second: 8.03 [#/sec] (mean)
69Time per request: 124.600 [ms] (mean)
70Time per request: 124.600 [ms] (mean, across all concurrent requests)
71Transfer rate: 230.92 [Kbytes/sec] received
72
73Connection Times (ms)
74 min mean[+/-sd] median max
75Connect: 0 0 0.0 0 0
76Processing: 96 124 16.1 124 173
77Waiting: 95 124 16.2 124 172
78Total: 96 125 16.1 124 173
79
80Percentage of the requests served within a certain time (ms)
81 50% 124
82 66% 130
83 75% 132
84 80% 135
85 90% 150
86 95% 153
87 98% 157
88 99% 173
89 100% 173 (longest request)

A possible reason for the slightly slower times on codfw is cross-DC connections for LoadBalancer::isPrimaryRunningReadOnly(). While running ab -n100, I saw 17 cross-DC database connections, i.e. about one per second. I looked more closely at one of them with tcpdump, a connection to db1157 due to an empty read-only cache, and it looked like it added about 200ms of latency.

I looked more closely at one of them with tcpdump

So are cross-DC connections happening in plain text?

@jcrespo No, cross-DC DB connections are encrypted but you can figure out what's going on by looking at surrounding (DC-local) memcached traffic.

All cross-DC connections except the first had an associated statsd metric MediaWiki.wanobjectcache.rdbms_server_readonly.hit.refresh, which implies that the queries were deferred until post-send, and the latency should have been hidden from the benchmark.

@jcrespo No, cross-DC DB connections are encrypted but you can figure out what's going on by looking at surrounding (DC-local) memcached traffic.

Ok, thank you, I got worried for a second :-D.

A possible reason for the slightly slower times on codfw is cross-DC connections for LoadBalancer::isPrimaryRunningReadOnly(). While running ab -n100, I saw 17 cross-DC database connections, i.e. about one per second. I looked more closely at one of them with tcpdump, a connection to db1157 due to an empty read-only cache, and it looked like it added about 200ms of latency.

Do we expect that to happen regularly on a high percentage of requests? If 17% of all requests need to make a cross-dc connection then it would need some thought.

Are you proposing to do away with the concept of "active" DC, then? e.g. currently swiftrepl runs from the active DC to fix up where MW failed to create / delete objects in both ms swift clusters. This enables us to answer "bucket a in codfw has object X, bucket a in eqiad does not - should X be added to eqiad, or removed from codfw?"
Without that, I'm not sure what we can do to work around the fact that MW doesn't reliably write/delete to both swift clusters...

Are you proposing to do away with the concept of "active" DC, then? e.g. currently swiftrepl runs from the active DC to fix up where MW failed to create / delete objects in both ms swift clusters. […]

In short:

  • No, there remains a "primary" DC for write actions.
  • Yes, "active" in the sense of "pooled" will no longer be limited to one DC, as both DCs will generally be pooled for read traffic.

I have not emperically verified this, but it is my understanding that, today, when someone views a URL on upload.wikimedia.org and their near-by cache POP has a cache miss for that URL, and the nearest application DC is Codfw instead of Eqiad, we serve that cache miss from Swift in Codfw, not Eqiad. Thus, I believe Swift is active-active for reads today.

The MediaWiki Multi-DC project is specifcially about read traffic, to do essentially the same thing. This took many years of preparation as there were numerous operations and coordination and caching logic that assumed it was operated from one DC only.

There remains a primary DC which is where HTTP/database/swift writes take place. We do not plan to do uploads and writes in a distributed fashion, that's quite a different project indeed and one that would require many years of effort (e.g. think auto-increment primary keys in MySQL, eventual consistency, splitbrain). That's not something I expect us to consider for a good long while, if at all. The RTT between data centers is fairly small compared to everything else we have under our control that we can optimise and/or mask the latency of, before the investment and on-going maintenance cost of something that complex would be appealing for the ~30ms gain. As example, saving an edit currently takes ~1300ms from the client-side.

Having said that, it's good that you're raising the issue with swiftrepl. While MediaWiki does not rely or assume this about Swift, separate from the MW Multi-DC project, I myself did actually think that Swift was active-active today even for writes to upload originals and thumbnails. I understand SRE wanted that as it saves work around primary DC switchovers when a service supports active-active writes. Based on T125791 and T204245 and past switchovers (where Swift and MediaWiki switch two days apart from eachother), I got the impression that "Swift active-active for writes" was essentially completed. My guess is that swiftrepl uses the notion of a primary DC to inform how it breaks ties and resolves conflicts, but that those familiar with both MediaWiki and Swift (not me) have enough confidence in this working correctly even when it's "the wrong way around", that we switch these services two days apart from each other. Feel free to correct and fill in the gaps! :-)

Do we expect that to happen regularly on a high percentage of requests? If 17% of all requests need to make a cross-dc connection then it would need some thought.

No. If I understand it correctly (it's Aaron's code), worthRefreshExpiring() causes the probability of a refresh for a given request to increase linearly from 0 to 100% over the 5 seconds since the last refresh. With equally spaced requests, I derive a refresh rate as follows:

Request rate (req/s)Refresh rate (req/s)
101.2
201.6
402.3
803.2
1604.6
3206.4
6409.1
128012.8
256018.1
512025.6

I used the following code to derive the expected value of the refresh period. It's just a probability tree, nothing too advanced:

for ( $rate = 10; $rate <= 10000; $rate *= 2 ) {
	$P = 1;
	$sum = 0;
	$n = 0;
	for ( $t = 0; $P > 0; $t += 1 / $rate ) {
		$p = $t / 5;
		$sum += $t * $p * $P;
		$P *= 1 - $p;
	}
	printf( "| %d | %.1f |\n", $rate, 1 / $sum );
}

I validated it against a simulation:

<?php

$rate = 160;
$numRefreshes = 0;
$lastRefreshTime = 0;
$effectiveLowTTL = 5;

for ( $t = 0; $t < 1000000 / $rate; $t += 1 / $rate ) {
	$age = $t - $lastRefreshTime;
	$curTTL = max( 5 - $age, 0 );
	$chance = ( 1 - $curTTL / $effectiveLowTTL );
	if ( mt_rand( 1, 1000000000 ) <= 1000000000 * $chance ) {
		$numRefreshes++;
		$lastRefreshTime = $t;
	}
}

Without that, I'm not sure what we can do to work around the fact that MW doesn't reliably write/delete to both swift clusters...

filebackend.php configures the way MediaWiki writes to Swift. It says

	# DO NOT change the master backend unless it is fully trusted or autoRsync is off
	'backends'    => [
		[ 'template' => 'local-swift-eqiad', 'isMultiMaster' => true ],
	],
...
$localMultiWriteFileBackend['backends'][] = [ 'template' => 'local-swift-codfw' ];

Well, we're not changing the master backend at least. Referring to FileBackendMultiWrite.php line 184:

		// Propagate the operations to the clone backends if there were no unexpected errors
		// and everything didn't fail due to predicted errors. If $ops only had one operation,
		// this might avoid backend sync inconsistencies.
		if ( $masterStatus->isOK() && $masterStatus->successCount > 0 ) {

So if MediaWiki, running in codfw, attempts to do a Swift write, it will write to eqiad first, and if the eqiad write fails, it will be counted as a failure of the whole operation, and the write to codfw will not continue. So the consistency of the two Swift clusters should not be affected by this deployment.

I also reviewed the situation with reads. filebackend.php has:

		// When used by FileBackendMultiWrite, read from this cluster if it's the local one
		'readAffinity'       => ( $specificDC === $wmgDatacenter ),

So MW running in codfw should have readAffinity=true for the codfw backends, which should cause reads to go to the codfw replicas.

I tried a warmup request followed by another request for the same page view, the second having MW logging enabled with microsecond resolution. I saw 45ms spent in LoadMonitor::getServerStates() in the second request, pre-send. It uses WANObjectCache with a TTL of 1s, with an extra layer of ramp-up over 500ms implemented in the caller:

		$ageStaleSec = mt_rand( 1, self::POLL_PERIOD_MS ) / 1e3;
		$minAsOfTime = $this->getCurrentTime() - $ageStaleSec;

which I think gives it the following refresh rates:

Request rate (req/s)Refresh rate (req/s)
102.2
202.8
403.4
804.3
1605.4
3206.7
6408.5
128010.6
256013.4
512016.9

In general, this kind of cluster maintenance complicates comparative benchmarking. It would be easier if we just had a single process monitoring DB states in a loop.

I tested parse times with ab -n10 -H'X-Forwarded-Proto: https' -X mw1441.eqiad.wmnet:80 'http://test2.wikipedia.org/w/api.php?action=parse&format=json&page=Greece&prop=limitreportdata&wrapoutputclass=test2perf'. While I did the test in codfw, I kept the cluster warm with page view requests to a different server. Median service times were 2749ms in eqiad and 2887ms in codfw, which I think is close enough.

The main point of these tests is to detect configuration errors which cause cross-DC requests. It's too early to detect an end-user latency benefit or to debug the lack of such a benefit. In any case, the best argument for multi-DC is that it doubles our appserver capacity, not that it improves end-user latency.

It looks like pywikibot is a decent test rig for session replication races. I installed pywikibot on a Linode instance in the Dallas region, which has a ping time to codfw of 3ms. I added millisecond timestamps and a raw request log (P32319). A typical invocation starts with centralauth_Token and centralauth_Session cookies from the persistent cookie jar, but no local session cookies, so a local session is created each time. The API edit request is posted about 1ms after the CSRF tokens are received.

Now I'm just waiting for codfw to be repooled in appservers-ro.

I haven't figured out a way to do a full integration test of appendShutdownCPIndexAsQuery() in stage 1.5. The only thing I've found that triggers it is CentralAuth account creation, so that may have to wait until stage 4. I can do an isolated test of that part of multi-dc.lua.

I did a pywikibot edit on testwiki from my Dallas test instance. The time between the completion of the last codfw sessionstore write and the eqiad sessionstore fetch was 109ms, so the edit was successful. Of that 109ms, about 50ms could be attributed to the time between pywikibot beginning its POST request and the first log entry appearing in testwiki.log, i.e. connection establishment, cross-DC HTTP proxying and MW startup. Then 20ms was consumed between the first log entry and the sessionstore fetch.

It's a race between Cassandra replication, and HTTP routing over the same distance plus MW startup time. Cassandra is going to win unless something goes wrong with it, like packet loss. We could make it more robust by writing the edit token to the session when the session is created, rather than waiting until the meta=tokens request arrives. But it's good enough for me for now.

Change 822496 had a related patch set uploaded (by Tim Starling; author: Tim Starling):

[mediawiki/core@master] SessionManager: write default edit token during session creation

https://gerrit.wikimedia.org/r/822496

Change 818652 merged by Tim Starling:

[operations/puppet@production] Discovery: codfw should be pooled for api-ro and appservers-ro

https://gerrit.wikimedia.org/r/818652

Change 822496 merged by jenkins-bot:

[mediawiki/core@master] SessionManager: write default edit token during session creation

https://gerrit.wikimedia.org/r/822496

The rollout was reverted back to stage 0 on August 15 due to T315271. I just reverted the revert, so it will be running on mediawiki.org once the puppet run finishes.

Stage 3 (traffic percentage) is useful for capacity modelling, but it's not expected to be optimal for data store consistency, since the stability of routing of a client to a DC helps to hide session and mainstash replication delay.

I suggest deploying stage 3, doing the tests listed in the task description, and then proceeding to stage 4 on the same day. If stage 3 shows problems, we would depool codfw (i.e. revert to stage 0) rather than leaving it deployed.

Change 827616 had a related patch set uploaded (by Tim Starling; author: Tim Starling):

[operations/puppet@production] Multi-DC stage 3: send 2% of traffic to appservers-ro

https://gerrit.wikimedia.org/r/827616

Change 827617 had a related patch set uploaded (by Tim Starling; author: Tim Starling):

[operations/puppet@production] Multi-DC stage 4: send all traffic to appservers-ro

https://gerrit.wikimedia.org/r/827617

Planning for stage 3/4 capacity monitoring.

Observe cross-DC database connection rate, analyse sources

In the DBPerformance logs, we see a daily peak connection rate of around 4.7 master connections per second on GET requests. Around 27% of backend traffic comes from eqsin, ulsfo and codfw according to this dashboard, so we can expect to see around 1.3 cross-DC DB connections per second at full load, or 0.026 cross-DC DB connections per second (92 per hour) at 2% load.

This will be dwarfed by connections from LoadBalancer::isPrimaryConnectionReadOnly() which apparently do not trigger a DBPerformance log entry. My modelling at T279664#8130148 indicates that we can expect around 13 connections per second at full load or 3.7 connections per second at 2% load from this source. These connections will be detectable using tcpdump.

Observe cross-DC mcrouter queries

I think this will be a Grafana panel similar to this graph of mcrouter "FOUND" responses.

Observe load on local services (DB replicas, mcrouter, etc.)

Connection count, CPU usage.

Change 828677 had a related patch set uploaded (by Tim Starling; author: Tim Starling):

[operations/puppet@production] Multi-DC: go back to testwiki only

https://gerrit.wikimedia.org/r/828677

I'm aiming to do stage 3 and 4 on September 6.

Change 828677 merged by Tim Starling:

[operations/puppet@production] Multi-DC: go back to testwiki only

https://gerrit.wikimedia.org/r/828677

Change 827616 merged by Tim Starling:

[operations/puppet@production] Multi-DC stage 3: send 2% of traffic to appservers-ro

https://gerrit.wikimedia.org/r/827616

Observe cross-DC database connection rate, analyse sources

It's not necessary to use tcpdump since we can just look at SSL connection counts. I added a dashboard panel for this. We're seeing an extra 7 req/s at 2%.

stage3 cross-dc mysql conns.png (598×1 px, 99 KB)

Observe cross-DC mcrouter queries

I added a dashboard panel with a mean connection count metric. On codfw, it is noisy but not concerning.

stage3 codfw mcrouter.png (603×1 px, 97 KB)

Observe load on local services (DB replicas, mcrouter, etc.)

  • Appserver CPU usage increased by about 0.1%
  • Appserver network TX increased by about 4 MB/s. Compare eqiad network usage 380 - 400 MB/s.
  • No obvious change in memcached or MySQL CPU usage, lost in the noise.

I made this all clusters utilization dashboard so that I could easily see if anything was running out of CPU or memory without having to check each cluster separately. AQS jumped up at around 01:33, which is not correlated with the deployment at 00:54 so probably unrelated. Similarly there was a jump on WDQS at 01:27. Nothing is showing a >2% jump in the puppet deployment window 00:54 - 01:24 so my conclusion is that there are no capacity blockers for stage 4 deployment at this time.

Diurnal variation: we are near the daily peak of codfw/ulsfo/eqsin traffic. So if it works now, it should keep working through the day.

stage3 diurnal model.png (470×1 px, 62 KB)

Change 827617 merged by Tim Starling:

[operations/puppet@production] Multi-DC stage 4: send all traffic to appservers-ro

https://gerrit.wikimedia.org/r/827617

MySQL cross-DC traffic is higher than expected, with 110 conns/s.

Appserver CPU usage is fine. Mcrouter connection rates are fine.

I captured cross-DC queries on the s3 master (db1157) using SHOW PROCESSLIST in a loop, once per second for 20 minutes. Out of 10 captured queries, 7 were serverIsReadOnly() and the other 3 were the initial SET query from DatabaseMysqlBase::open(). So while my estimate might have been off, it looks like serverIsReadOnly() dominates cross-DC DB traffic, as expected.

I did the CentralAuth tests mentioned in the task description.

The serverIsReadOnly() cache key includes the DB hostname, so I should have done my calculation per section rather than globally.

sectionCross-DC connection rate (req/s)
es40.00
es50.00
s19.21
s219.7
s353.2
s47.02
s54.33
s67.41
s721.4
s82.1
x10.25
x20.00
tstarling claimed this task.

Here's a model of the benefit of the multi-DC project for users west of codfw. The servers are 30ms closer, but codfw seems a bit slower, so if you take 30ms minus how much slower codfw is, you get a figure somewhere around 15ms.

multi-dc benefit model.png (613×1 px, 107 KB)

During deployment, we saw eqsin backend latency ramp down by about 10ms, which is close enough to count as support for this model, within a long piece of chalk.