Page MenuHomePhabricator

Put ms-be20[62-65] in service
Closed, ResolvedPublic

Description

Hosts were installed in T285809: (Need By: ASAP) rack/setup/install ms-be20[62-65] and must be put in service

  • Add hosts to puppet
  • Gradually put weight in swift

As of August 2021 codfw is the active site both for mediawiki and swift, so it is going to be a tall order to rebalance with traffic on the cluster. Path forward TBD, we should have the hosts ready to go though.

Event Timeline

Change 710964 had a related patch set uploaded (by Filippo Giunchedi; author: Filippo Giunchedi):

[operations/puppet@production] hieradata: add ms-be20[62-65]

https://gerrit.wikimedia.org/r/710964

Change 710964 merged by Filippo Giunchedi:

[operations/puppet@production] hieradata: add ms-be20[62-65]

https://gerrit.wikimedia.org/r/710964

Hosts are ready to go now, though with swift traffic fully on codfw I don't think we should rebalance now. I see at least two options:

  1. Move Swift traffic to eqiad and start rebalancing in codfw.
  2. Wait the switchover date to start rebalancing, in the meantime we should have eqiad hosts ready to go at some point (hosts are scheduled to be delivered to eqiad soon, and racking task is T285808) and we can start rebalancing in eqiad up until the switchover.

I'm tempted by option 1 so we at least get going with having the hosts in service ASAP. I don't know the exact ETA for eqiad ms-be hosts but it is a priority on the dcops side. If it is say a week to be able to start rebalancing in eqiad then might as well wait.

cc @Legoktm for visibility and also an opinion wrt switchover and swift traffic, thank you!

CC @jcrespo as moving traffic to eqiad could influence media backup execution timelines.

Hosts are ready to go now, though with swift traffic fully on codfw I don't think we should rebalance now. I see at least two options:

  1. Move Swift traffic to eqiad and start rebalancing in codfw.
  2. Wait the switchover date to start rebalancing, in the meantime we should have eqiad hosts ready to go at some point (hosts are scheduled to be delivered to eqiad soon, and racking task is T285808) and we can start rebalancing in eqiad up until the switchover.

I'm tempted by option 1 so we at least get going with having the hosts in service ASAP. I don't know the exact ETA for eqiad ms-be hosts but it is a priority on the dcops side. If it is say a week to be able to start rebalancing in eqiad then might as well wait.

cc @Legoktm for visibility and also an opinion wrt switchover and swift traffic, thank you!

Option 1 seems fine to me, you're effectively just doing the switchover for swift a bit earlier :) I just reviewed the cookbook logic, and it'll attempt to pool eqiad and depool codfw, which will just be a noop in this case.

Hosts are ready to go now, though with swift traffic fully on codfw I don't think we should rebalance now. I see at least two options:

  1. Move Swift traffic to eqiad and start rebalancing in codfw.
  2. Wait the switchover date to start rebalancing, in the meantime we should have eqiad hosts ready to go at some point (hosts are scheduled to be delivered to eqiad soon, and racking task is T285808) and we can start rebalancing in eqiad up until the switchover.

I'm tempted by option 1 so we at least get going with having the hosts in service ASAP. I don't know the exact ETA for eqiad ms-be hosts but it is a priority on the dcops side. If it is say a week to be able to start rebalancing in eqiad then might as well wait.

cc @Legoktm for visibility and also an opinion wrt switchover and swift traffic, thank you!

Option 1 seems fine to me, you're effectively just doing the switchover for swift a bit earlier :) I just reviewed the cookbook logic, and it'll attempt to pool eqiad and depool codfw, which will just be a noop in this case.

SGTM! Thank you, I'll be pooling Swift in eqiad and start rebalancing the codfw cluster with the new hardware in 24/48 hours. cc Traffic and netops for visibility (though feel free to untag)

What the best way to proceed for media backups? Should I stop operations on both dcs (because one will be rebalancing and the other will be in production)? Please advise.

What the best way to proceed for media backups? Should I stop operations on both dcs (because one will be rebalancing and the other will be in production)? Please advise.

I think it makes sense to keep using eqiad for now, at least with low concurrency. The "swift 4gs" dashboard gives a good indication of read latency and whether there's an impact (which I don't think there is with low concurrency) https://grafana.wikimedia.org/d/000000584/swift-4gs?orgId=1&from=now-3h&to=now-1m&var-DC=eqiad&var-prometheus=thanos&refresh=1m

What do you think? If sth changes then it is easy enough to switch media backups to read from codfw (?)

Thanks, that's a super-useful graph I didn't know about.

it is easy enough to switch media backups to read from codfw (?)

It is very easy- only one line configuration away. But cross-dc backups, even if still using TLS, they are super-slow, as most of the time is spent retrieving the very small files over WAN. Also, we want to produce independent backups on both DCs, so more than switching, we would stop backups somewhere and start them on the other DC, always reading locally. The priority is to have at least 1 backup anywhere of the 2 locations.

I think it makes sense to keep using eqiad for now, at least with low concurrency

That's ok to me. At the current low concurrency (8 threads), the full backup will take up to 20 days, a bit slow but not ridiculously. Hopefully maintenance completes on any of the 2 locations soon and we can speed it up after it :-D.

Thanks, that's a super-useful graph I didn't know about.

it is easy enough to switch media backups to read from codfw (?)

It is very easy- only one line configuration away. But cross-dc backups, even if still using TLS, they are super-slow, as most of the time is spent retrieving the very small files over WAN. Also, we want to produce independent backups on both DCs, so more than switching, we would stop backups somewhere and start them on the other DC, always reading locally. The priority is to have at least 1 backup anywhere of the 2 locations.

That makes sense to me (site-local backups)

I think it makes sense to keep using eqiad for now, at least with low concurrency

That's ok to me. At the current low concurrency (8 threads), the full backup will take up to 20 days, a bit slow but not ridiculously. Hopefully maintenance completes on any of the 2 locations soon and we can speed it up after it :-D.

For the record codfw is and remains available even during rebalance, though it might be slower. I see at least two options:

  • continue using eqiad, possibly even with slightly higher concurrency
  • switch to codfw, and you can bump concurrency as much as you want even during rebalance and we can observe the effects

Mentioned in SAL (#wikimedia-operations) [2021-08-25T08:17:38Z] <godog> swift codfw add ms-be20[62-65] with initial weight - T288458

Mentioned in SAL (#wikimedia-operations) [2021-08-26T06:48:56Z] <godog> more weight to ms-be20[62-65] - T288458

Mentioned in SAL (#wikimedia-operations) [2021-08-30T06:38:28Z] <godog> more weight to ms-be20[62-65] - T288458

FYI I have now started backups of commonswiki with only 4 read threads on eqiad. So far I've seen no impact on latency, and not even on the total amount of reads/s (it is very serial, so far and has lots of preprocessing).

Mentioned in SAL (#wikimedia-operations) [2021-09-03T07:10:57Z] <godog> more weight to ms-be20[62-65] - T288458

Change 731878 had a related patch set uploaded (by MVernon; author: MVernon):

[operations/software/swift-ring@master] codfw-prod: more weight to ms-be20[62-65]

https://gerrit.wikimedia.org/r/731878

Change 731878 merged by MVernon:

[operations/software/swift-ring@master] codfw-prod: more weight to ms-be20[62-65]

https://gerrit.wikimedia.org/r/731878

Change 732957 had a related patch set uploaded (by MVernon; author: MVernon):

[operations/software/swift-ring@master] codfw-prod: final weight to ms-be20[62-65]

https://gerrit.wikimedia.org/r/732957

Change 732957 merged by MVernon:

[operations/software/swift-ring@master] codfw-prod: final weight to ms-be20[62-65]

https://gerrit.wikimedia.org/r/732957

MatthewVernon updated the task description. (Show Details)