⚓ T279664 Progressive Multi-DC roll out

Subject	Repo	Branch	Lines +/-
Multi-DC stage 4: send all traffic to appservers-ro	operations/puppet	production	+1 -2
Multi-DC stage 3: send 2% of traffic to appservers-ro	operations/puppet	production	+2 -6
Multi-DC: go back to testwiki only	operations/puppet	production	+0 -6
SessionManager: write default edit token during session creation	mediawiki/core	master	+16 -1
Discovery: codfw should be pooled for api-ro and appservers-ro	operations/puppet	production	+2 -2
Switch testwiki to multi-DC active/active mode	operations/puppet	production	+3 -7
Don't send debug log from test2wiki to testwiki.log	operations/mediawiki-config	master	+1 -2
Implement MediaWiki multi-DC traffic component	operations/puppet	production	+616 -13

Status	Assigned	Task
Resolved	aaron	T88445 MediaWiki active/active datacenter investigation and work (tracking)
Resolved	Krinkle	T270223 FY2021-2022: Enable basic Multi-DC operations for read traffic (tracking)
Resolved	tstarling	T279664 Progressive Multi-DC roll out
Resolved	Ladsgroup	T314486 Add support for multidc to auto_schema
Resolved	tstarling	T314750 Clean up testwiki experiments (Aug 2022)
Resolved	Vgutierrez	T315434 multi-dc.lua ATS script failing in production

Maintenance_bot removed a project: Patch-For-Review.Jul 5 2022, 6:30 AM

I edited the task description with a proposed rollout plan, and I renamed the task to encompass the actual work, not just deciding on the work.

tstarling updated the task description. (Show Details)Jul 13 2022, 12:20 AM

ArielGlenn subscribed.Jul 14 2022, 4:31 AM

tstarling mentioned this in T313004: Cold cache page view to metawiki times out then fails with OOM in timeout handler.Jul 14 2022, 5:38 AM

tstarling updated the task description. (Show Details)Jul 18 2022, 12:24 AM

tstarling updated the task description. (Show Details)Jul 18 2022, 12:27 AM

tstarling updated the task description. (Show Details)Jul 18 2022, 12:40 AM

tstarling updated the task description. (Show Details)Jul 18 2022, 12:54 AM

Change 815403 had a related patch set uploaded (by Tim Starling; author: Tim Starling):

[operations/puppet@production] Switch testwiki to multi-DC active/active mode

https://gerrit.wikimedia.org/r/815403

gerritbot added a project: Patch-For-Review.Jul 20 2022, 2:04 AM

tstarling mentioned this in T313102: Uncaught TimeoutError from inactivedc_request caused swift-proxy to wedge itself.Jul 20 2022, 11:46 PM

Change 815834 had a related patch set uploaded (by Tim Starling; author: Tim Starling):

[operations/mediawiki-config@master] Don't send debug log from test2wiki to testwiki.log

https://gerrit.wikimedia.org/r/815834

tstarling updated the task description. (Show Details)Jul 21 2022, 6:26 AM

Change 815834 merged by jenkins-bot:

[operations/mediawiki-config@master] Don't send debug log from test2wiki to testwiki.log

https://gerrit.wikimedia.org/r/815834

Change 815403 merged by Tim Starling:

[operations/puppet@production] Switch testwiki to multi-DC active/active mode

https://gerrit.wikimedia.org/r/815403

Maintenance_bot removed a project: Patch-For-Review.Aug 1 2022, 2:30 AM

Change 818652 had a related patch set uploaded (by Tim Starling; author: Tim Starling):

[operations/puppet@production] Discovery: codfw should be pooled for api-ro and appservers-ro

https://gerrit.wikimedia.org/r/818652

gerritbot added a project: Patch-For-Review.Aug 1 2022, 4:25 AM

tstarling updated the task description. (Show Details)Aug 1 2022, 6:36 AM

tstarling updated the task description. (Show Details)Aug 1 2022, 6:44 AM

dcausse subscribed.Aug 1 2022, 8:18 AM

xSavitar subscribed.Aug 1 2022, 8:34 AM

Volans subscribed.Aug 1 2022, 9:24 AM

TheresNoTime awarded a token.Aug 1 2022, 10:56 AM

TheresNoTime subscribed.

Nikerabbit subscribed.Aug 1 2022, 12:51 PM

tstarling updated the task description. (Show Details)Aug 1 2022, 11:52 PM

tstarling updated the task description. (Show Details)Aug 2 2022, 12:31 AM

Initial ab run:

P32126 appserver ab codfw vs eqiad

1
2	[0118][tstarling@mwmaint2002:~]$ ab -n100 -X mw2377.codfw.wmnet:80 -H 'X-Forwarded-Proto: https' http://test2.wikipedia.org/wiki/Foobar
3	This is ApacheBench, Version 2.3 <$Revision: 1843412 $>
4	Copyright 1996 Adam Twiss, Zeus Technology Ltd, http://www.zeustech.net/
5	Licensed to The Apache Software Foundation, http://www.apache.org/
6
7	Benchmarking test2.wikipedia.org [through mw2377.codfw.wmnet:80] (be patient).....done
8
9
10	Server Software: mw2377.codfw.wmnet
11	Server Hostname: test2.wikipedia.org
12	Server Port: 80
13
14	Document Path: /wiki/Foobar
15	Document Length: 28902 bytes
16
17	Concurrency Level: 1
18	Time taken for tests: 14.359 seconds
19	Complete requests: 100
20	Failed requests: 0
21	Total transferred: 2946300 bytes
22	HTML transferred: 2890200 bytes
23	Requests per second: 6.96 [#/sec] (mean)
24	Time per request: 143.592 [ms] (mean)
25	Time per request: 143.592 [ms] (mean, across all concurrent requests)
26	Transfer rate: 200.38 [Kbytes/sec] received
27
28	Connection Times (ms)
29	min mean[+/-sd] median max
30	Connect: 0 0 0.0 0 0
31	Processing: 120 143 24.7 141 367
32	Waiting: 120 143 24.7 141 366
33	Total: 121 144 24.7 142 367
34
35	Percentage of the requests served within a certain time (ms)
36	50% 142
37	66% 146
38	75% 149
39	80% 151
40	90% 155
41	95% 158
42	98% 164
43	99% 367
44	100% 367 (longest request)
45
46	[0118][tstarling@mwmaint1002:~]$ ab -n100 -X mw1441.eqiad.wmnet:80 -H 'X-Forwarded-Proto: https' http://test2.wikipedia.org/wiki/Foobar
47	This is ApacheBench, Version 2.3 <$Revision: 1843412 $>
48	Copyright 1996 Adam Twiss, Zeus Technology Ltd, http://www.zeustech.net/
49	Licensed to The Apache Software Foundation, http://www.apache.org/
50
51	Benchmarking test2.wikipedia.org [through mw1441.eqiad.wmnet:80] (be patient).....done
52
53
54	Server Software: mw1441.eqiad.wmnet
55	Server Hostname: test2.wikipedia.org
56	Server Port: 80
57
58	Document Path: /wiki/Foobar
59	Document Length: 28902 bytes
60
61	Concurrency Level: 1
62	Time taken for tests: 12.460 seconds
63	Complete requests: 100
64	Failed requests: 3
65	(Connect: 0, Receive: 0, Length: 3, Exceptions: 0)
66	Total transferred: 2946294 bytes
67	HTML transferred: 2890197 bytes
68	Requests per second: 8.03 [#/sec] (mean)
69	Time per request: 124.600 [ms] (mean)
70	Time per request: 124.600 [ms] (mean, across all concurrent requests)
71	Transfer rate: 230.92 [Kbytes/sec] received
72
73	Connection Times (ms)
74	min mean[+/-sd] median max
75	Connect: 0 0 0.0 0 0
76	Processing: 96 124 16.1 124 173
77	Waiting: 95 124 16.2 124 172
78	Total: 96 125 16.1 124 173
79
80	Percentage of the requests served within a certain time (ms)
81	50% 124
82	66% 130
83	75% 132
84	80% 135
85	90% 150
86	95% 153
87	98% 157
88	99% 173
89	100% 173 (longest request)

A possible reason for the slightly slower times on codfw is cross-DC connections for LoadBalancer::isPrimaryRunningReadOnly(). While running ab -n100, I saw 17 cross-DC database connections, i.e. about one per second. I looked more closely at one of them with tcpdump, a connection to db1157 due to an empty read-only cache, and it looked like it added about 200ms of latency.

tstarling updated the task description. (Show Details)Aug 2 2022, 5:52 AM

I looked more closely at one of them with tcpdump

So are cross-DC connections happening in plain text?

@jcrespo No, cross-DC DB connections are encrypted but you can figure out what's going on by looking at surrounding (DC-local) memcached traffic.

All cross-DC connections except the first had an associated statsd metric MediaWiki.wanobjectcache.rdbms_server_readonly.hit.refresh, which implies that the queries were deferred until post-send, and the latency should have been hidden from the benchmark.

In T279664#8122493, @tstarling wrote:

@jcrespo No, cross-DC DB connections are encrypted but you can figure out what's going on by looking at surrounding (DC-local) memcached traffic.

Ok, thank you, I got worried for a second :-D.

In T279664#8122378, @tstarling wrote:

A possible reason for the slightly slower times on codfw is cross-DC connections for LoadBalancer::isPrimaryRunningReadOnly(). While running ab -n100, I saw 17 cross-DC database connections, i.e. about one per second. I looked more closely at one of them with tcpdump, a connection to db1157 due to an empty read-only cache, and it looked like it added about 200ms of latency.

Do we expect that to happen regularly on a high percentage of requests? If 17% of all requests need to make a cross-dc connection then it would need some thought.

MoritzMuehlenhoff subscribed.Aug 2 2022, 9:51 AM

Are you proposing to do away with the concept of "active" DC, then? e.g. currently swiftrepl runs from the active DC to fix up where MW failed to create / delete objects in both ms swift clusters. This enables us to answer "bucket a in codfw has object X, bucket a in eqiad does not - should X be added to eqiad, or removed from codfw?"
Without that, I'm not sure what we can do to work around the fact that MW doesn't reliably write/delete to both swift clusters...

MatthewVernon added a project: SRE-swift-storage.Aug 2 2022, 10:55 AM

Krinkle mentioned this in T314434: Avoid ChronologyProtector queries on majory of pageviews that have no recent positions.Aug 2 2022, 8:16 PM

In T279664#8123041, @MatthewVernon wrote:

Are you proposing to do away with the concept of "active" DC, then? e.g. currently swiftrepl runs from the active DC to fix up where MW failed to create / delete objects in both ms swift clusters. […]

In short:

No, there remains a "primary" DC for write actions.
Yes, "active" in the sense of "pooled" will no longer be limited to one DC, as both DCs will generally be pooled for read traffic.

I have not emperically verified this, but it is my understanding that, today, when someone views a URL on upload.wikimedia.org and their near-by cache POP has a cache miss for that URL, and the nearest application DC is Codfw instead of Eqiad, we serve that cache miss from Swift in Codfw, not Eqiad. Thus, I believe Swift is active-active for reads today.

The MediaWiki Multi-DC project is specifcially about read traffic, to do essentially the same thing. This took many years of preparation as there were numerous operations and coordination and caching logic that assumed it was operated from one DC only.

There remains a primary DC which is where HTTP/database/swift writes take place. We do not plan to do uploads and writes in a distributed fashion, that's quite a different project indeed and one that would require many years of effort (e.g. think auto-increment primary keys in MySQL, eventual consistency, splitbrain). That's not something I expect us to consider for a good long while, if at all. The RTT between data centers is fairly small compared to everything else we have under our control that we can optimise and/or mask the latency of, before the investment and on-going maintenance cost of something that complex would be appealing for the ~30ms gain. As example, saving an edit currently takes ~1300ms from the client-side.

Having said that, it's good that you're raising the issue with swiftrepl. While MediaWiki does not rely or assume this about Swift, separate from the MW Multi-DC project, I myself did actually think that Swift was active-active today even for writes to upload originals and thumbnails. I understand SRE wanted that as it saves work around primary DC switchovers when a service supports active-active writes. Based on T125791 and T204245 and past switchovers (where Swift and MediaWiki switch two days apart from eachother), I got the impression that "Swift active-active for writes" was essentially completed. My guess is that swiftrepl uses the notion of a primary DC to inform how it breaks ties and resolves conflicts, but that those familiar with both MediaWiki and Swift (not me) have enough confidence in this working correctly even when it's "the wrong way around", that we switch these services two days apart from each other. Feel free to correct and fill in the gaps! :-)

In T279664#8122731, @Joe wrote:

Do we expect that to happen regularly on a high percentage of requests? If 17% of all requests need to make a cross-dc connection then it would need some thought.

No. If I understand it correctly (it's Aaron's code), worthRefreshExpiring() causes the probability of a refresh for a given request to increase linearly from 0 to 100% over the 5 seconds since the last refresh. With equally spaced requests, I derive a refresh rate as follows:

Request rate (req/s)	Refresh rate (req/s)
10	1.2
20	1.6
40	2.3
80	3.2
160	4.6
320	6.4
640	9.1
1280	12.8
2560	18.1
5120	25.6

I used the following code to derive the expected value of the refresh period. It's just a probability tree, nothing too advanced:

for ( $rate = 10; $rate <= 10000; $rate *= 2 ) {
	$P = 1;
	$sum = 0;
	$n = 0;
	for ( $t = 0; $P > 0; $t += 1 / $rate ) {
		$p = $t / 5;
		$sum += $t * $p * $P;
		$P *= 1 - $p;
	}
	printf( "| %d | %.1f |\n", $rate, 1 / $sum );
}

I validated it against a simulation:

<?php

$rate = 160;
$numRefreshes = 0;
$lastRefreshTime = 0;
$effectiveLowTTL = 5;

for ( $t = 0; $t < 1000000 / $rate; $t += 1 / $rate ) {
	$age = $t - $lastRefreshTime;
	$curTTL = max( 5 - $age, 0 );
	$chance = ( 1 - $curTTL / $effectiveLowTTL );
	if ( mt_rand( 1, 1000000000 ) <= 1000000000 * $chance ) {
		$numRefreshes++;
		$lastRefreshTime = $t;
	}
}

In T279664#8123041, @MatthewVernon wrote:

Without that, I'm not sure what we can do to work around the fact that MW doesn't reliably write/delete to both swift clusters...

filebackend.php configures the way MediaWiki writes to Swift. It says

	# DO NOT change the master backend unless it is fully trusted or autoRsync is off
	'backends'    => [
		[ 'template' => 'local-swift-eqiad', 'isMultiMaster' => true ],
	],
...
$localMultiWriteFileBackend['backends'][] = [ 'template' => 'local-swift-codfw' ];

Well, we're not changing the master backend at least. Referring to FileBackendMultiWrite.php line 184:

		// Propagate the operations to the clone backends if there were no unexpected errors
		// and everything didn't fail due to predicted errors. If $ops only had one operation,
		// this might avoid backend sync inconsistencies.
		if ( $masterStatus->isOK() && $masterStatus->successCount > 0 ) {

So if MediaWiki, running in codfw, attempts to do a Swift write, it will write to eqiad first, and if the eqiad write fails, it will be counted as a failure of the whole operation, and the write to codfw will not continue. So the consistency of the two Swift clusters should not be affected by this deployment.

I also reviewed the situation with reads. filebackend.php has:

		// When used by FileBackendMultiWrite, read from this cluster if it's the local one
		'readAffinity'       => ( $specificDC === $wmgDatacenter ),

So MW running in codfw should have readAffinity=true for the codfw backends, which should cause reads to go to the codfw replicas.

Ladsgroup subscribed.Aug 5 2022, 12:22 PM

I tried a warmup request followed by another request for the same page view, the second having MW logging enabled with microsecond resolution. I saw 45ms spent in LoadMonitor::getServerStates() in the second request, pre-send. It uses WANObjectCache with a TTL of 1s, with an extra layer of ramp-up over 500ms implemented in the caller:

		$ageStaleSec = mt_rand( 1, self::POLL_PERIOD_MS ) / 1e3;
		$minAsOfTime = $this->getCurrentTime() - $ageStaleSec;

which I think gives it the following refresh rates:

Request rate (req/s)	Refresh rate (req/s)
10	2.2
20	2.8
40	3.4
80	4.3
160	5.4
320	6.7
640	8.5
1280	10.6
2560	13.4
5120	16.9

In general, this kind of cluster maintenance complicates comparative benchmarking. It would be easier if we just had a single process monitoring DB states in a loop.

I tested parse times with ab -n10 -H'X-Forwarded-Proto: https' -X mw1441.eqiad.wmnet:80 'http://test2.wikipedia.org/w/api.php?action=parse&format=json&page=Greece&prop=limitreportdata&wrapoutputclass=test2perf'. While I did the test in codfw, I kept the cluster warm with page view requests to a different server. Median service times were 2749ms in eqiad and 2887ms in codfw, which I think is close enough.

The main point of these tests is to detect configuration errors which cause cross-DC requests. It's too early to detect an end-user latency benefit or to debug the lack of such a benefit. In any case, the best argument for multi-DC is that it doubles our appserver capacity, not that it improves end-user latency.

tstarling updated the task description. (Show Details)Aug 8 2022, 2:10 AM

tstarling mentioned this in T314750: Clean up testwiki experiments (Aug 2022).Aug 8 2022, 5:29 AM

tstarling closed subtask T314750: Clean up testwiki experiments (Aug 2022) as Resolved.Aug 8 2022, 11:52 PM

It looks like pywikibot is a decent test rig for session replication races. I installed pywikibot on a Linode instance in the Dallas region, which has a ping time to codfw of 3ms. I added millisecond timestamps and a raw request log (P32319). A typical invocation starts with centralauth_Token and centralauth_Session cookies from the persistent cookie jar, but no local session cookies, so a local session is created each time. The API edit request is posted about 1ms after the CSRF tokens are received.

Now I'm just waiting for codfw to be repooled in appservers-ro.

I haven't figured out a way to do a full integration test of appendShutdownCPIndexAsQuery() in stage 1.5. The only thing I've found that triggers it is CentralAuth account creation, so that may have to wait until stage 4. I can do an isolated test of that part of multi-dc.lua.

I did a pywikibot edit on testwiki from my Dallas test instance. The time between the completion of the last codfw sessionstore write and the eqiad sessionstore fetch was 109ms, so the edit was successful. Of that 109ms, about 50ms could be attributed to the time between pywikibot beginning its POST request and the first log entry appearing in testwiki.log, i.e. connection establishment, cross-DC HTTP proxying and MW startup. Then 20ms was consumed between the first log entry and the sessionstore fetch.

It's a race between Cassandra replication, and HTTP routing over the same distance plus MW startup time. Cassandra is going to win unless something goes wrong with it, like packet loss. We could make it more robust by writing the edit token to the session when the session is created, rather than waiting until the meta=tokens request arrives. But it's good enough for me for now.

tstarling updated the task description. (Show Details)Aug 11 2022, 3:34 AM

GeneralNotability subscribed.Aug 12 2022, 3:37 AM

Change 822496 had a related patch set uploaded (by Tim Starling; author: Tim Starling):

[mediawiki/core@master] SessionManager: write default edit token during session creation

https://gerrit.wikimedia.org/r/822496

Change 818652 merged by Tim Starling:

[operations/puppet@production] Discovery: codfw should be pooled for api-ro and appservers-ro

https://gerrit.wikimedia.org/r/818652

Change 822496 merged by jenkins-bot:

[mediawiki/core@master] SessionManager: write default edit token during session creation

https://gerrit.wikimedia.org/r/822496

ReleaseTaggerBot added a project: MW-1.39-notes (1.39.0-wmf.26; 2022-08-22).Aug 16 2022, 10:00 PM

Maintenance_bot removed a project: Patch-For-Review.Aug 16 2022, 10:29 PM

Vgutierrez closed subtask T315434: multi-dc.lua ATS script failing in production as Resolved.Aug 17 2022, 1:31 PM

Krinkle mentioned this in T277831: Evaluate WMF's ParserCache database setup for Multi DC.Aug 18 2022, 10:21 PM

Ladsgroup closed subtask T314486: Add support for multidc to auto_schema as Resolved.Aug 19 2022, 7:12 AM

AntiCompositeNumber subscribed.Aug 21 2022, 2:16 AM

The rollout was reverted back to stage 0 on August 15 due to T315271. I just reverted the revert, so it will be running on mediawiki.org once the puppet run finishes.

tstarling updated the task description. (Show Details)Aug 30 2022, 2:51 AM

Stage 3 (traffic percentage) is useful for capacity modelling, but it's not expected to be optimal for data store consistency, since the stability of routing of a client to a DC helps to hide session and mainstash replication delay.

I suggest deploying stage 3, doing the tests listed in the task description, and then proceeding to stage 4 on the same day. If stage 3 shows problems, we would depool codfw (i.e. revert to stage 0) rather than leaving it deployed.

Change 827616 had a related patch set uploaded (by Tim Starling; author: Tim Starling):

[operations/puppet@production] Multi-DC stage 3: send 2% of traffic to appservers-ro

https://gerrit.wikimedia.org/r/827616

Change 827617 had a related patch set uploaded (by Tim Starling; author: Tim Starling):

[operations/puppet@production] Multi-DC stage 4: send all traffic to appservers-ro

https://gerrit.wikimedia.org/r/827617

Planning for stage 3/4 capacity monitoring.

Observe cross-DC database connection rate, analyse sources

In the DBPerformance logs, we see a daily peak connection rate of around 4.7 master connections per second on GET requests. Around 27% of backend traffic comes from eqsin, ulsfo and codfw according to this dashboard, so we can expect to see around 1.3 cross-DC DB connections per second at full load, or 0.026 cross-DC DB connections per second (92 per hour) at 2% load.

This will be dwarfed by connections from LoadBalancer::isPrimaryConnectionReadOnly() which apparently do not trigger a DBPerformance log entry. My modelling at T279664#8130148 indicates that we can expect around 13 connections per second at full load or 3.7 connections per second at 2% load from this source. These connections will be detectable using tcpdump.

Observe cross-DC mcrouter queries

I think this will be a Grafana panel similar to this graph of mcrouter "FOUND" responses.

Observe load on local services (DB replicas, mcrouter, etc.)

Connection count, CPU usage.

Change 828677 had a related patch set uploaded (by Tim Starling; author: Tim Starling):

[operations/puppet@production] Multi-DC: go back to testwiki only

https://gerrit.wikimedia.org/r/828677

I'm aiming to do stage 3 and 4 on September 6.

Change 828677 merged by Tim Starling:

[operations/puppet@production] Multi-DC: go back to testwiki only

https://gerrit.wikimedia.org/r/828677

Change 827616 merged by Tim Starling:

[operations/puppet@production] Multi-DC stage 3: send 2% of traffic to appservers-ro

https://gerrit.wikimedia.org/r/827616

Observe cross-DC database connection rate, analyse sources

It's not necessary to use tcpdump since we can just look at SSL connection counts. I added a dashboard panel for this. We're seeing an extra 7 req/s at 2%.

stage3 cross-dc mysql conns.png (598×1 px, 99 KB)

Observe cross-DC mcrouter queries

I added a dashboard panel with a mean connection count metric. On codfw, it is noisy but not concerning.

stage3 codfw mcrouter.png (603×1 px, 97 KB)

Observe load on local services (DB replicas, mcrouter, etc.)

Appserver CPU usage increased by about 0.1%
Appserver network TX increased by about 4 MB/s. Compare eqiad network usage 380 - 400 MB/s.
No obvious change in memcached or MySQL CPU usage, lost in the noise.

I made this all clusters utilization dashboard so that I could easily see if anything was running out of CPU or memory without having to check each cluster separately. AQS jumped up at around 01:33, which is not correlated with the deployment at 00:54 so probably unrelated. Similarly there was a jump on WDQS at 01:27. Nothing is showing a >2% jump in the puppet deployment window 00:54 - 01:24 so my conclusion is that there are no capacity blockers for stage 4 deployment at this time.

Diurnal variation: we are near the daily peak of codfw/ulsfo/eqsin traffic. So if it works now, it should keep working through the day.

stage3 diurnal model.png (470×1 px, 62 KB)

tstarling updated the task description. (Show Details)Sep 6 2022, 3:17 AM

Change 827617 merged by Tim Starling:

[operations/puppet@production] Multi-DC stage 4: send all traffic to appservers-ro

https://gerrit.wikimedia.org/r/827617

MySQL cross-DC traffic is higher than expected, with 110 conns/s.

Appserver CPU usage is fine. Mcrouter connection rates are fine.

tstarling updated the task description. (Show Details)Sep 6 2022, 4:24 AM

I captured cross-DC queries on the s3 master (db1157) using SHOW PROCESSLIST in a loop, once per second for 20 minutes. Out of 10 captured queries, 7 were serverIsReadOnly() and the other 3 were the initial SET query from DatabaseMysqlBase::open(). So while my estimate might have been off, it looks like serverIsReadOnly() dominates cross-DC DB traffic, as expected.

I did the CentralAuth tests mentioned in the task description.

The serverIsReadOnly() cache key includes the DB hostname, so I should have done my calculation per section rather than globally.

section	Cross-DC connection rate (req/s)
es4	0.00
es5	0.00
s1	9.21
s2	19.7
s3	53.2
s4	7.02
s5	4.33
s6	7.41
s7	21.4
s8	2.1
x1	0.25
x2	0.00

kostajh mentioned this in T317187: GrowthExperiments Special:Homepage: investigate performance regression since September 6 2022.Sep 7 2022, 12:41 PM

kostajh mentioned this in T314189: 1.39.0-wmf.28 deployment blockers.Sep 7 2022, 6:52 PM

tstarling closed this task as Resolved.Sep 8 2022, 4:09 AM

tstarling claimed this task.

Here's a model of the benefit of the multi-DC project for users west of codfw. The servers are 30ms closer, but codfw seems a bit slower, so if you take 30ms minus how much slower codfw is, you get a figure somewhere around 15ms.

multi-dc benefit model.png (613×1 px, 107 KB)

During deployment, we saw eqsin backend latency ramp down by about 10ms, which is close enough to count as support for this model, within a long piece of chalk.

Progressive Multi-DC roll out
Closed, ResolvedPublic
Actions

Description

Proposed plan for discussion

Stage 0: X-Wikimedia-Debug

Stage 1: testwiki

Stage 1.5: test2wiki

Stage 2: mediawikiwiki

Stage 3: traffic percentage

Stage 4: full deployment

Considerations

Details

Related Objects
Search...

Event Timeline

	F35510284: stage3 cross-dc mysql conns.png
	Sep 6 2022, 2:13 AM

	Krinkle
	Apr 8 2021, 1:49 PM

Progressive Multi-DC roll outClosed, ResolvedPublicActions