Page MenuHomePhabricator

Depool codfw swift cluster
Closed, ResolvedPublic

Description

In order to do backups and performance testing, as well as other maintenance, we want to depool codfw Swift from serving production traffic.

Date: 22th of February.
Revert: 3rd of March.

The intention is to be depooled, if no issues found, for a minimum of 7-10 days.

Depooled means we will avoid end-user reads, but writes will continue normally, mirroring eqiad state.

Original description:

Because 260+ GB of images have to be downloaded in a few days, the initial backup should have an acceptable level of bandwidth, a number given has been around 5Gbps.

In order to avoid user performance impact, try to achieve this download bandwidth by querying in parallel Swift API and also using Mediawiki public API (or if not available, PHP functions that abstract the download work needed).

While directly using Swift API is likely to be more direct and performant, it would also be more complex because we would have to implement a lot of mediawiki logic ourselves- So a higher level mediawiki layer would be ideal to abstract the storage details (specially knowing it could change in the future).

Depooling a DC for swift could be a dangerous actions, so it will need careful preparation, but it may be worth to avoid user impact during the initial download time. Involve as much people as possible during the test

Event Timeline

Most likely this would mean a depool of codfw. Needs ok from at least @ayounsi, probably also Traffic?

The testing should last at least a few hours, to make sure download rate can be maintained stable for a longer time.

Ok.

Let me know when you're doing it. Ideally be ready to repool if any saturation, but I'm not expecting any.

jcrespo renamed this task from Depool an entire swift cluster for a datacenter and do performance testing of batch downloads of wiki media (querying swift and/or MediaWiki) to Depool codfw swift cluster.Jan 14 2021, 3:17 PM
jcrespo updated the task description. (Show Details)
jcrespo added a subscriber: BBlack.

Adding bblack (traffic) and ayounsi, as we have a tentative date for codfw, proposed by Filippo and Jaime: week of 15th. It may change as persistence has to be ready by that date.

jcrespo updated the task description. (Show Details)
jcrespo updated the task description. (Show Details)
jcrespo triaged this task as High priority.Jan 15 2021, 9:35 AM

I asked filippo to delay the maintenance 1 week due to unexpected workload on my side, which would prevent me to be ready by next week.

Adding local dc ops on CC of this ticket- things would have to go really bad for us to need him for this test (this should be a relatively boring process), but better be sure and provide a heads up.

Clarifying expected duration and method of depooling for next week. CC @Joe

jcrespo added a subscriber: Joe.

Note that one of the eqiad/codfw links is still down due to Texas weather issues. I hope it will be back up by the 22nd, but if it's not, we should discuss the risks more.

Reassessing the situation early next week sounds good to me -- we're not terribly in a rush to do this and might as well avoid unnecessary risks

Links came back over the weekend, looks like we can proceed when ready

Just depooled swift from codfw (for reads) confctl --object-type discovery select 'dnsdisc=swift,name=codfw' set/pooled=false

I have now started 10 threads reading and retrieving commonswiki files to its temporary backup location at dbprov2003. dbprov2003 only has 8TB available, but I expect this run to finish early to tweak things about how we download the files.

Mentioned in SAL (#wikimedia-operations) [2021-02-23T15:51:11Z] <jynus> started swift codfw backup stress test at 14:38 with 10 threads T267338

Mentioned in SAL (#wikimedia-operations) [2021-02-23T15:52:26Z] <jynus> previous message should say 15:38 T267338

jcrespo updated the task description. (Show Details)
fgiunchedi claimed this task.

This has happened! We're back to swift active/active, tentatively resolving