Page MenuHomePhabricator

Switchover People and Planet services to codfw
Closed, ResolvedPublic

Description

A task to track the failover of Planet and People services during the March 2023 Datacenter Switchover (T327920).

  • add manual switchover instructions on https://wikitech.wikimedia.org/wiki/Switch_Datacenter#Manual_switch for both services
  • switch *.planet.wikimedia.org sites from eqiad to codfw
  • switch people.wikimedia.org from eqiad to codfw
  • add metafo records for planet in DNS
  • add metafo records for people in DNS
  • add etcd data / be able to pool/depool DCs in confctl for planet
  • add etcd data / be able to pool/depool DCs in confctl for people
  • add planet to service::catalog
  • add people to service::catalog
  • add service IPs (which) to loopback interfaces
  • ... everything else that is relevant from https://wikitech-static.wikimedia.org/wiki/LVS ?
  • Add a switchover of the people.wikimedia.org entry to codfw to the DC switchover cookbook
  • Add a switchover of the planet.wikimedia.org entry to codfw to the DC switchover cookbook

Event Timeline

LSobanski moved this task from Incoming to Backlog on the collaboration-services board.

Change 891369 had a related patch set uploaded (by Dzahn; author: Dzahn):

[operations/dns@master] switch planet from eqiad to codfw

https://gerrit.wikimedia.org/r/891369

Hi @Clement_Goubert see my explanation in the commit message of the Gerrit change above.

I confirmed that the planet service does not need any puppet change or manual action to switch between DCs. Everything simply runs on both instances and both can be used at will. It's just not that both are used at the same time.

The only thing that is needed is a DNS change of a discovery record. We do have a discovery name but no geo DNS. So it's minimized but still that one DNS merge. I can do this myself tomorrow or I can do it coordinated with you at a specific time.

What I was wondering is that "add to cookbook" part that is in the ticket description. Would it in this case just mean "add geo DNS" (where this is not really "just" as in "easy to do")? Or does it mean "let a cookbook deploy the DNS change" or "add it to a manual to do list" ?

Cheers

Change 891381 had a related patch set uploaded (by Dzahn; author: Dzahn):

[operations/dns@master] switch peopleweb from eqiad to codfw

https://gerrit.wikimedia.org/r/891381

Change 891382 had a related patch set uploaded (by Dzahn; author: Dzahn):

[operations/puppet@production] peopleweb: switch rsync source and dest between eqiad and codfw

https://gerrit.wikimedia.org/r/891382

Hi again, @Clement_Goubert and others,

this time about people.wikimedia.org.

Here it is a DNS change of a discovery record, but no geo DNS, just like with planet but additionally there is a concept of an "active server" in puppet as wel. Or more specifically there is a "source" and a "dest" server for rsyncing data. And end users naturally need to upload files where the source is and be told to not use the destination for uploads.

We have also already made the server MOTD smart about that and based on the same setting it tells people this is the right or the wrong server, like we do on other machines like the deployment server.

So yea, one DNS change and one puppet change but also one manual rsync.

The puppet change and rsync can be done first and the actual switching of the ATS backend can be done afterwards. I can do both anytime.. or coordinate one or both with you.

I confirmed that the planet service does not need any puppet change or manual action to switch between DCs. Everything simply runs on both instances and both can be used at will. It's just not that both are used at the same time.

The only thing that is needed is a DNS change of a discovery record. We do have a discovery name but no geo DNS. So it's minimized but still that one DNS merge. I can do this myself tomorrow or I can do it coordinated with you at a specific time.

What I was wondering is that "add to cookbook" part that is in the ticket description. Would it in this case just mean "add geo DNS" (where this is not really "just" as in "easy to do")? Or does it mean "let a cookbook deploy the DNS change" or "add it to a manual to do list" ?

If you want it to be included in the cookbook that does the switch for services, it needs an entry in service::catalog and either a metafo or a geoip record.
I'd highly recommend, if not for this switchover, for the next, adding a metafo record (since it's not really active/active) and the corresponding service::catalog entry so it gets moved easily and automatically with the rest.

In the meantime, if you want me to switch it over during the Service please add your service and its switchover procedure under https://wikitech.wikimedia.org/wiki/Switch_Datacenter#Manual_switch

If it does not need to move at the same time as the other services, you can switch it over yourself, as you prefer.

Hi again, @Clement_Goubert and others,

this time about people.wikimedia.org.

Here it is a DNS change of a discovery record, but no geo DNS, just like with planet but additionally there is a concept of an "active server" in puppet as wel. Or more specifically there is a "source" and a "dest" server for rsyncing data. And end users naturally need to upload files where the source is and be told to not use the destination for uploads.

We have also already made the server MOTD smart about that and based on the same setting it tells people this is the right or the wrong server, like we do on other machines like the deployment server.

So yea, one DNS change and one puppet change but also one manual rsync.

The puppet change and rsync can be done first and the actual switching of the ATS backend can be done afterwards. I can do both anytime.. or coordinate one or both with you.

Same as above for planet, except if I understand correctly it can't be fully automatic since it needs a puppet change. I'd still recommend adding the proper switchover procedure in https://wikitech.wikimedia.org/wiki/Switch_Datacenter#Manual_switch so if it is ever needed to do it in a hurry, all steps are documented.

As far the switchover proper goes, it's your preference, keeping in mind that if I do it, it'll be very early for you if anything goes wrong and I need your help. It can also be switched over later in the day after the traffic switchover if you prefer.

@Clement_Goubert Alright, I am totally fine and even prefer to just do it like today or tomorrow on my own as long as that's fine with you. I am going to add the procedure to the wikitech page you linked right now, regardless. I will also look at the service::catalog entries.

Change 891369 merged by Dzahn:

[operations/dns@master] switch planet from eqiad to codfw

https://gerrit.wikimedia.org/r/891369

*.planet.wikimedia.org sites have been switched from eqiad to codfw just now. (DNS only)

Change 891730 had a related patch set uploaded (by Dzahn; author: Dzahn):

[operations/dns@master] add metafo records for planet

https://gerrit.wikimedia.org/r/891730

Change 891731 had a related patch set uploaded (by Dzahn; author: Dzahn):

[operations/dns@master] add metafo records for people.wikimedia.org

https://gerrit.wikimedia.org/r/891731

A problem with people is that users have been using their home and/or public_html in both data centers. We have rsynced the last time we switched but since then both have drifted and users are not only using this as a webserver but also just for general file storage.

So we can't simply rsync all of /home/ with --delete and make them the same. It's more involved.

Something like "first make a backup of all public_html dirs on the passive server and move them; then rsync all public_html dirs from active to passive, ignore the rest of the home dirs, switch backends between DCs, send an email to all users explaining whats going on.. that we made sure all files that used to be actually served over https are still there.. but that they now need to use the other server..."

Then in another step we might want to make the rsync fully automatic but then people can't use the passive host anymore to just dump random files.

Or the automatic rsync needs to be limited to public_html and not all of /home.

Mentioned in SAL (#wikimedia-operations) [2023-02-24T21:06:00Z] <mutante> ganeti2021 - adding a virtual 20G disk to people2002 - to temp get some space for backups and syncing T330091

Mentioned in SAL (#wikimedia-operations) [2023-02-24T23:15:49Z] <mutante> people2002 - for each user who has a public_html dir that is not empty (for pubdir in $(find . -name public_html -type d -not -empty); ..); rsync it from people1003 with --delete (rsync -avp rsync://people1003.eqiad.wmnet/people-home/${pubdiruser}/public_html/ /home/${pubdiruser}/public_html/); T330091

Change 891927 had a related patch set uploaded (by Dzahn; author: Dzahn):

[operations/puppet@production] httpbb: add /~cdanis/sremap to test URLs for people.wikimedia.org

https://gerrit.wikimedia.org/r/891927

Change 891927 merged by Dzahn:

[operations/puppet@production] httpbb: add /~cdanis/sremap to test URLs for people.wikimedia.org

https://gerrit.wikimedia.org/r/891927

Change 891382 merged by Dzahn:

[operations/puppet@production] peopleweb: switch rsync source and dest between eqiad and codfw

https://gerrit.wikimedia.org/r/891382

Change 891381 merged by Dzahn:

[operations/dns@master] switch peopleweb from eqiad to codfw

https://gerrit.wikimedia.org/r/891381

@LSobanski @Clement_Goubert Both *.planet.wikimedia.org and people.wikimedia.org have been switched to codfw.

So this ticket is resolved. All the remaining checkboxes are about making it more automatic for next time.

Dzahn closed this task as Resolved.EditedFeb 24 2023, 11:51 PM

Let's talk on Monday whether we put this on a new ticket or reopen this. But I wanted to make obvious both services ARE switched now and nothing has to be done by you.

(This included some rsync trickery, adding a new virtual disk to have enough space, adding a test, using httpbb to confirm everything works, adding a new path to Bacula backups.. and will have some follow-up to make the syncing automatic)

23:53 < mutante> people.wikimedia.org has been switched to codfw. all the public_html dirs have been synced and made identical. anything outside public_html is untouched.

23:54 < mutante> (I also added a test to httpbb to check the "SRE map" is behind IDP and you can see it works here: https://people.wikimedia.org/~cdanis/sremap/)

23:54 < mutante> contents of public_html on the passive one before syncing are backed up and will end up in Bacula next week

23:57 < mutante> the new host name to upload to is people2002 (peopleweb.discovery.wmnet to look it up). if you are new and have never heard of this.. any shell user can upload files to a public_html in their home dir there and they are public.

Let's talk on Monday whether we put this on a new ticket or reopen this. But I wanted to make obvious both services ARE switched now and nothing has to be done by you.

(This included some rsync trickery, adding a new virtual disk to have enough space, adding a test, using httpbb to confirm everything works, adding a new path to Bacula backups.. and will have some follow-up to make the syncing automatic)

Thanks for handling it!

[22:41]  <kindrobot> My people.wikimedia.org seems to have stopped working. Any link to anything under ~/public_html is 404ing, including the index.html.
[22:45]  <kindrobot> Oh, I see we switched servers. My files weren't rsynced over. :<
[23:23]  <    bd808> kindrobot: :(( mutante did an rsync before cutting over. You might check with him to see if it is obvious why your content was missed.
[23:25]  <kindrobot> I rsynced my files over manually, so I'm back up-and-running, but yeah, it'd be interesting to know why I got passed-over. :o
[23:25]  <darkblueb> I see https://people.wikimedia.org/  working here
[23:25]  <darkblueb> (Berkeley) yay
[23:28]  <     zabe> not sure when the last rsync happened, when checking some home dirs, it seems that quite a few are missing
[23:29]  <    bd808> https://sal.toolforge.org/log/gz22hYYBtR_B8fLxGRT-
[23:30]  <    bd808> so about 2023-02-24T2315Z
[23:31]  <kindrobot> I definitely had a home dir with stuff in it before then.
[23:31]  <    bd808> kindrobot: I think that SAL log might explain it. It looks to me like he pulled to 2002 and did the walk based on what was already there
[23:32]  <    bd808> my hunch is that your $HOME/public_html was not on people2002 when that ran
[23:32]  <kindrobot> Ah! Yes, I did not have anything on people2002.
[23:33]  <    bd808> I'll add a comment on T330091
Dzahn reopened this task as In Progress.Mar 6 2023, 11:28 PM

Change 894744 had a related patch set uploaded (by Dzahn; author: Dzahn):

[operations/puppet@production] peopleweb: ensure each user automatically gets a public_html dir

https://gerrit.wikimedia.org/r/894744

Hello @SDunlap, @bd808

thanks for reporting this and sorry for the trouble.

Yes, you guessed right. Affected were users who had a public_html dir in eqiad but not in codfw.

So my "for each user.. rsync public_html"-loops failed where there was no target to sync in codfw.

There were 7 additional users affected: @bking, @EJoseph, @jnuche, mvernon (@MatthewVernon), @pfischer, samtar (@TheresNoTime) and @taavi where this was the case.

I have now created the public_html for all of them in codfw and synced the files over, so everything should be fixed.

Additionally I uploaded a follow-up to ensure the public_html gets auto-created by puppet in the future to avoid this happening again.

Mentioned in SAL (#wikimedia-operations) [2023-03-07T00:23:08Z] <mutante> people* - determined which users did not have a public_html dir in codfw but did in eqiad. created that dir, rsynced via push from people1003 to people2002 for the 7 affected users. re-enabled temp disabled puppet to restore live-hacked rsync config. T330091

Change 894744 merged by Dzahn:

[operations/puppet@production] peopleweb: ensure each user automatically gets a public_html dir

https://gerrit.wikimedia.org/r/894744

Change 891731 abandoned by Dzahn:

[operations/dns@master] add metafo records for people.wikimedia.org

Reason:

https://phabricator.wikimedia.org/T263506#8996173

https://gerrit.wikimedia.org/r/891731

Change 891730 abandoned by Dzahn:

[operations/dns@master] add metafo records for planet

Reason:

https://phabricator.wikimedia.org/T263506#8996173

https://gerrit.wikimedia.org/r/891730