Page MenuHomePhabricator

Run MediaWiki media originals active/active
Open, NormalPublic

Description

We have switched Thumbor/thumbnails to be active/active T201858, AFAIK we should be able to serve originals too in an active/active fashion. Said feature is driven by FileBackend being in "sync" mode between its eqiad and codfw swift backends.

A list of things to test/verify for completion:

  • Verify all FileBackend activity is effectively sync (e.g. math/captcha/score renders)
  • Run active/active originals for a limited amount of time, verify uploads work as intended from both eqiad and codfw

Event Timeline

fgiunchedi triaged this task as Normal priority.Sep 13 2018, 2:37 PM
fgiunchedi created this task.
fgiunchedi removed aaron as the assignee of this task.Sep 13 2018, 2:40 PM
Gilles claimed this task.Sep 17 2018, 8:43 PM
Gilles moved this task from Inbox to Next In This Quarter on the Performance-Team board.

As far as I can tell, the math extension is serving images through the REST API (Mathoid) and as such doesn't use the "original" urls:

https://wikimedia.org/api/rest_v1/media/math/render/svg/b9f8284fcea4d88e1ef5816226f189c6b2d2d2ee

Maybe the background Mathoid stores those renders in Swift, but that's irrelevant to making the public-facing https://upload.wikimedia.org/* URLs active-active, imho.

Same for captchas, they are served through these URLs:

https://en.wikipedia.org/w/index.php?title=Special:Captcha/image&wpCaptchaId=290077603

Musical scores are the only special type from that list served from upload.wikimedia.org:

https://upload.wikimedia.org/score/e/g/egv3toazxv8fm3whmr3weuhlx6qgj15/egv3toaz.png

$wmgScoreFileBackend is configured as "global-multiwrite". Which is actually the same used by Math and Captchas.

Here's its configuration:

$globalMultiWriteFileBackend = [
	'class'       => 'FileBackendMultiWrite',
	'name'        => 'global-multiwrite',
	'wikiId'      => "global-data",
	'lockManager' => 'redisLockManager',
	# DO NOT change the master backend unless it is fully trusted or autoRsync is off
	'backends'    => [
		[ 'template' => 'global-swift-eqiad', 'isMultiMaster' => true ],
	],
	'replication' => 'sync', // read-after-update for assets
	'syncChecks'  => ( 1 | 4 ) // (size & sha1)
];

if ( in_array( 'codfw', $datacenters ) ) {
	$localMultiWriteFileBackend['backends'][] = [ 'template' => 'local-swift-codfw' ];
	$sharedMultiwriteFileBackend['backends'][] = [ 'template' => 'shared-swift-codfw' ];
	$globalMultiWriteFileBackend['backends'][] = [ 'template' => 'global-swift-codfw' ];
	$sharedTestwikiMultiWriteFileBackend['backends'][] = [ 'template' => 'shared-testwiki-swift-codfw' ];
}

This confirms that all those 3 types are configured to write to both DCs when objects are created (even if for the purpose of this task only Score mattered). The replication is synchronous, meaning that if pushing a new object to the secondary DC fails, it should result in an error for the upload. And the web request will wait until the object is pushed to all DCs before completing.

Gilles reassigned this task from Gilles to fgiunchedi.Sep 18 2018, 9:46 AM
Gilles updated the task description. (Show Details)
fgiunchedi moved this task from Backlog to Up next on the User-fgiunchedi board.Dec 20 2018, 10:22 AM
fgiunchedi moved this task from Up next to Radar on the User-fgiunchedi board.Jan 2 2019, 11:18 AM

As far as I can tell, the math extension is serving images through the REST API (Mathoid) and as such doesn't use the "original" urls:
https://wikimedia.org/api/rest_v1/media/math/render/svg/b9f8284fcea4d88e1ef5816226f189c6b2d2d2ee
Maybe the background Mathoid stores those renders in Swift, but that's irrelevant to making the public-facing https://upload.wikimedia.org/* URLs active-active, imho.

Math/Mathoid's renders are stored in Cassandra, so are already active/active.

@fgiunchedi / @Gilles what is left to do here for us to be able to serve files from Swift in active/active (apart from config changes and potentially some quick VCL) ?

As far as I can tell, all it would take is uncommenting one line of yaml in hieradata/role/common/cache/upload.yaml

Change 496872 had a related patch set uploaded (by Mobrovac; owner: Mobrovac):
[operations/puppet@production] Varnish: serve Swift traffic in active/active mode

https://gerrit.wikimedia.org/r/496872

As far as I can tell, all it would take is uncommenting one line of yaml in hieradata/role/common/cache/upload.yaml

Awesome, thank you @Gilles ! I put up a patch for that (^) so that we don't forget it :)

Change 496872 had a related patch set uploaded (by Alexandros Kosiaris; owner: Mobrovac):
[operations/puppet@production] Varnish: serve Swift traffic in active/active mode

https://gerrit.wikimedia.org/r/496872

Change 496872 merged by Alexandros Kosiaris:
[operations/puppet@production] Varnish: serve Swift traffic in active/active mode

https://gerrit.wikimedia.org/r/496872

Change 502453 had a related patch set uploaded (by Alexandros Kosiaris; owner: Alexandros Kosiaris):
[operations/dns@master] swift-rw: Mock it as a geo-resource

https://gerrit.wikimedia.org/r/502453

Change 502456 had a related patch set uploaded (by Alexandros Kosiaris; owner: Alexandros Kosiaris):
[operations/puppet@production] swift: Revert swift-rw to active/passive

https://gerrit.wikimedia.org/r/502456

Change 502457 had a related patch set uploaded (by Alexandros Kosiaris; owner: Alexandros Kosiaris):
[operations/puppet@production] swift: Introduce new swift.discovery.wmnet stanza

https://gerrit.wikimedia.org/r/502457

Change 502458 had a related patch set uploaded (by Alexandros Kosiaris; owner: Alexandros Kosiaris):
[operations/puppet@production] trafficserver: Switch to using swift.discovery.wmnet

https://gerrit.wikimedia.org/r/502458

Change 502456 merged by Alexandros Kosiaris:
[operations/puppet@production] swift: Revert swift-rw to active/passive

https://gerrit.wikimedia.org/r/502456

Change 502457 merged by Alexandros Kosiaris:
[operations/puppet@production] swift: Introduce new swift.discovery.wmnet stanza

https://gerrit.wikimedia.org/r/502457

Change 502453 merged by Alexandros Kosiaris:
[operations/dns@master] swift-rw: Mock it as a geo-resource

https://gerrit.wikimedia.org/r/502453

Change 503274 had a related patch set uploaded (by Alexandros Kosiaris; owner: Alexandros Kosiaris):
[operations/dns@master] Add a new swift.discovery.wmnet resource

https://gerrit.wikimedia.org/r/503274

Change 503274 merged by Alexandros Kosiaris:
[operations/dns@master] Add a new swift.discovery.wmnet resource

https://gerrit.wikimedia.org/r/503274

Change 504331 had a related patch set uploaded (by Ema; owner: Ema):
[operations/puppet@production] swift: new cert for ms-fe.svc.codfw.wmnet

https://gerrit.wikimedia.org/r/504331

Change 504331 merged by Ema:
[operations/puppet@production] swift: new cert for ms-fe.svc.codfw.wmnet

https://gerrit.wikimedia.org/r/504331

Mentioned in SAL (#wikimedia-operations) [2019-04-16T14:30:22Z] <ema> swift-fe-codfw: nginx reload for new TLS certificate T204245

Change 504340 had a related patch set uploaded (by Ema; owner: Ema):
[operations/puppet@production] swift: new cert for ms-fe.svc.eqiad.wmnet

https://gerrit.wikimedia.org/r/504340

Change 504340 merged by Ema:
[operations/puppet@production] swift: new cert for ms-fe.svc.eqiad.wmnet

https://gerrit.wikimedia.org/r/504340

Mentioned in SAL (#wikimedia-operations) [2019-04-16T14:56:29Z] <ema> swift-fe-eqiad: nginx reload for new TLS certificate T204245

Change 504349 had a related patch set uploaded (by Ema; owner: Ema):
[operations/puppet@production] Revert "ATS: use 'swift-ro' as the origin server for thumb traffic"

https://gerrit.wikimedia.org/r/504349

Change 504349 merged by Ema:
[operations/puppet@production] Revert "ATS: use 'swift-ro' as the origin server for thumb traffic"

https://gerrit.wikimedia.org/r/504349

Change 502458 merged by Ema:
[operations/puppet@production] trafficserver: Switch to using swift.discovery.wmnet

https://gerrit.wikimedia.org/r/502458

CDanis added a subscriber: CDanis.Aug 23 2019, 2:56 PM

This confirms that all those 3 types are configured to write to both DCs when objects are created (even if for the purpose of this task only Score mattered). The replication is synchronous, meaning that if pushing a new object to the secondary DC fails, it should result in an error for the upload. And the web request will wait until the object is pushed to all DCs before completing.

I think we have evidence of this not always happening in T231086: Picture from Commons not found from Singapore