Page MenuHomePhabricator

Migrate doc hosts to Bullseye
Closed, ResolvedPublic

Assigned To
Authored By
Dzahn
Oct 5 2022, 9:00 PM
Referenced Files
None
Tokens
"Yellow Medal" token, awarded by thcipriani."Mountain of Wealth" token, awarded by Dzahn."The World Burns" token, awarded by Daimona."Like" token, awarded by Jdforrester-WMF.

Description

Introduction:

The doc1002 and doc2001 hosts are on Debian Buster with PHP 7.3 They should be on Debian Bullseye with PHP 7.4.


Pre requisites:


Users:

As of now, we don't have standardized fleetwide uid and gid mappings for the Doc instances. This can cause issues like T314972 where one instances uid is different than the other instances uid causing commands like rsync to write files with a different USER:GROUP mapping.

Ex. In the doc1002 instance the uid 499 is mapped to the`doc-uploader` user and in the doc1003 instance the uid 498 is mapped to the doc-uploader.

  • Ensure that the the UID/GID for the 'doc-uploader' user are consistent across the 'doc' instances.
    • Update the UID/GID of the 'doc-uploader' user in the 'doc1002' instance.
    • Update the UID/GID of the 'doc-uploader' user in the 'doc1003' instance.
    • Update the UID/GID of the 'doc-uploader' user in the 'doc2001' instance.
    • Update the UID/GID of the 'doc-uploader' user in the 'doc2002' instance.
  • Ensure that the UID/GID of the files belonging to the 'doc-uploader' user are consistent across the 'doc' instances.
    • Update UID/GID of the files in the 'doc1002' instance.
    • Update UID/GID of the files in the 'doc1003' instance.
    • Update UID/GID of the files in the 'doc2001' instance.
    • Update UID/GID of the files in the 'doc2002' instance.
  • doc: Reserve UID/GID for the doc-uploader system user.
  • Add the doc-uploaders' UID/GID to the Reserved UIDs & GIDs documentation.

Failover:


  1. Decommission

Event Timeline

LSobanski triaged this task as Medium priority.Nov 23 2022, 7:51 PM

May collaboration-services create the Bullseye Ganeti VM to replace the hosts? I am guessing the hostname will be:

doc1003.eqiad.wmnet
doc2002.codfw.wmnet

They will need the Puppet role doc and to be added to some Hiera settings notably to make them become scap targets (among other effect). Then if they seem to work fine, the DNS service entry doc.discovery.wmnet will be switched to one of the new host.

Yes, the hostnames will be doc1003 and doc2002, confirmed. And yes, serviceops-collab will create them. We will soon assign it to a person.

contint2001 migration is higher prio though for sure, btw.

andrea.denisse renamed this task from migrate doc hosts to bullseye to Migrate doc hosts to Bullseye.Mar 23 2023, 1:08 AM
andrea.denisse updated the task description. (Show Details)

Change 902222 had a related patch set uploaded (by Andrea Denisse; author: Andrea Denisse):

[operations/puppet@production] doc: Add role::doc to doc1003

https://gerrit.wikimedia.org/r/902222

Change 902222 merged by Andrea Denisse:

[operations/puppet@production] doc: Add role::doc to doc1003

https://gerrit.wikimedia.org/r/902222

Change 902505 had a related patch set uploaded (by Andrea Denisse; author: Andrea Denisse):

[operations/puppet@production] doc: Add role::doc to doc2002

https://gerrit.wikimedia.org/r/902505

Change 902825 had a related patch set uploaded (by Andrea Denisse; author: Andrea Denisse):

[operations/puppet@production] doc: Add support for passive_hosts synchronization via rsync

https://gerrit.wikimedia.org/r/902825

Change 902505 merged by Andrea Denisse:

[operations/puppet@production] doc: Add role::doc to doc2002

https://gerrit.wikimedia.org/r/902505

Change 903319 had a related patch set uploaded (by Andrea Denisse; author: Andrea Denisse):

[operations/puppet@production] doc: Reserve UID/GID for the doc-uploader system user

https://gerrit.wikimedia.org/r/903319

Change 903319 merged by Andrea Denisse:

[operations/puppet@production] doc: Reserve UID/GID for the doc-uploader system user

https://gerrit.wikimedia.org/r/903319

Change 903319 merged by Andrea Denisse:

[operations/puppet@production] doc: Reserve UID/GID for the doc-uploader system user

https://gerrit.wikimedia.org/r/903319

I do not know why that was made, but that broke the documentation publishing since files under /srv/doc did not get changed to the new UID/GID. I have noticed it this morning since some jobs started failed and filed it as T333294. Will ask to have the chown issued.

I do not know why that was made, but that broke the documentation publishing since files under /srv/doc did not get changed to the new UID/GID. I have noticed it this morning since some jobs started failed and filed it as T333294. Will ask to have the chown issued.

Fixed by issuing a recursive chown:

In T333294#8732722, @Clement_Goubert wrote:
cgoubert@cumin1001:~$ sudo cumin 'P:doc' 'chown -R doc-uploader:doc-uploader /srv/doc'
4 hosts will be targeted:
doc[2001-2002].codfw.wmnet,doc[1002-1003].eqiad.wmnet
OK to proceed on 4 hosts? Enter the number of affected hosts to confirm or "q" to quit: 4
===== NO OUTPUT =====                                                                                          
PASS |█████████████████████████████████████████████████████████████████| 100% (4/4) [01:11<00:00, 17.79s/hosts]
FAIL |                                                                         |   0% (0/4) [01:11<?, ?hosts/s]
100.0% (4/4) success ratio (>= 100.0% threshold) for command: 'chown -R doc-upl...ploader /srv/doc'.
100.0% (4/4) success ratio (>= 100.0% threshold) of nodes successfully executed all commands.
``

:)

Hi @hashar , I explained the rationale for changing the UID/GID (patch #903319) in the task:

As of now, we don't have standardized fleetwide uid and gid mappings for the Doc instances. This can cause issues like T314972 where one instances uid is different than the other instances uid causing commands like rsync to write files with a different USER:GROUP mapping.
Ex. In the doc1002 instance the uid 499 is mapped to the`doc-uploader` user and in the doc1003 instance the uid 498 is mapped to the doc-uploader.

Giving more context the UID 499 in the doc1002 was mapped to the 'doc-uploader' user but in the new hosts the UID 499 is mapped to 'debmonitor':
debmonitor:x:499:499:DebMonitor system user:/nonexistent:/bin/bash

The UID/GID should've been reserved otherwise we would would run into the issues I explained above, where the UID/GID don't match in the new instances so without reserving the UID/GID we would be unable to migrate to Bullseye without disrupting other services like debmonitor.

Another reason why this issue arose is because the code that syncs the doc hosts is not using the standardized way of synchronizing instances which is quickdatacopy. It uses plain rsync nested inside conditionals, making it hard to read and generates this problems with UID/GID.

Also, I added support for quickdatacopy to receive username/groupname mappings when synchronizing files in patch #823748.

Best. :)

Thanks for the detailed explanation. Indeed the uid mismatch is a mess. Although rsync (running as root) is able to do the uid translation between two hosts for a given user. It does not work cause our config has use chroot = yes which disables the mechanism. I guess it is "easier" to hardcode the uid fleet wide. I will keep that in mind for the contint* hosts which require at least 5 different system users. Anyway for the doc hosts the fix was simple.

Should we have profile::doc to use rsync::quickdatacopy as well? At a quick glance it looks like it can be applied along the Rsync::Server::Module['doc'] which is used by CI to upload the artifacts.

Yes, the issue with mismatched UIDs on hosts that rsync has come up multiple times before and the preferred fix in SRE is definitely to use reserved UIDs. We have applied this to other hosts and it ends the problems once and for the future.

And yes, we should use rsync::quickdatacopy wherever possible to reduce the number of special cases. We have chatted about this before and it would change the direction we sync:

now: single active host has one time for each passive host it pushes to, each passive host has rsyncd.
then: single active host has rsyncd, each passive host has one timer to pull from it

But I think that is just fine as long as we keep it in mind and check things work after merging a change.

And since the fact that doc hosts currently have it "the other way around" from most setups makes this more confusing.. I would welcome it if we could replace that with standard quickdatacopy.

Mentioned in SAL (#wikimedia-operations) [2023-04-13T20:46:49Z] <mutante> doc2001 - systemctl stop php7.3-fpm; systemctl restart php7.4-fpm - needed because after gerrit:901612 we had BOTH PHP versions, 7.3 and 7.4 running their own php-fpm process, also packages for both versions are installed, so also manual package removal needed - apt-get remove php7.3* T322357 T319477

Mentioned in SAL (#wikimedia-operations) [2023-04-13T21:02:29Z] <mutante> doc1002 (doc.wikimedia.org) - switching from PHP 7.3 to 7.4 - systemctl stop php7.3-fpm, restart php7.4-fpm, apt-get remove --purge php7.3*, systemctl restart apache2. - all tests still working (on deployment server: httpbb --hosts doc1002.eqiad.wmnet /srv/deployment/httpbb-tests/doc/test_doc.yaml) T322357 T319477

With the original reason being T322357, PHP on the existing _buster_ doc* machines has now been upgraded from PHP 7.3 to PHP 7.4.

It solved that ticket but after initial hesitation I got convinced that it's also good for unblocking this ticket, the bullseye migration.

Because now we can say we already tested on 7.4 and know it works which should be a big chunk of the entire upgrade.

The patch that did this was https://gerrit.wikimedia.org/r/c/operations/puppet/+/901612

But.. it wasn't all rosy, puppet had errors on first and second run.. but after the third run the issues fixed themselves. (dependency issues/ order of steps).

And after that we had both 7.3 and 7.4 packages installed in parallel and each version was running its own php-fpm at the same time.

So some additonal steps were needed, which were~:

systemctl stop php7.3-fpm
systemctl restart php7.4-fpm
apt-get remove --purge php7.3*
systemctl restart apache2.

The following additional packages will be installed:
  php-fpm php-xml
The following packages will be REMOVED:
  php7.3-cli php7.3-common php7.3-fpm php7.3-json php7.3-opcache php7.3-readline php7.3-xml
The following packages will be upgraded:
  php-fpm php-xml
2 upgraded, 0 newly installed, 7 to remove and 3 not upgraded.

and then I ran httpbb --hosts doc1002.eqiad.wmnet /srv/deployment/httpbb-tests/doc/test_doc.yaml on a deployment server to confirm everything still works.

That tests 10 different URLs (it actually has relatively complex setup thats why it's so many). They all passed before and after.

I did this first on the passive machine and then finally same steps on the active machine.

@andrea.denisse @LSobanski @hashar This should make us more confident to switch to the bullseye machines, since the PHP version upgrade is already done now and gone from the "diff" when that happens.

Puppet had errors on first and second run.. but after the third run the issues fixed themselves. (dependency issues/ order of steps

Ideally the Puppet manifest would apply on the first try, then I am guessing there were some low level conflict due to PHP.

And after that we had both 7.3 and 7.4 packages installed in parallel and each version was running its own php-fpm at the same time.

Yup I have shortly mentioned the php 7.3 removal in the commit message, but should probably have boldly highlighted that part.

httpbb --hosts doc1002.eqiad.wmnet /srv/deployment/httpbb-tests/doc/test_doc.yaml on a deployment server to confirm everything still works.

Those testing scenarios are very priceless. They definitely help to ensure critical paths of the application and the bricks that sustain it have at least the basis. That should catch most of the basic issues and builds confidences. I am quite happy to see those used for monitoring as well. Thanks for those!

Change 922487 had a related patch set uploaded (by EoghanGaffney; author: EoghanGaffney):

[operations/puppet@production] Switch doc host from doc1002 to doc1003

https://gerrit.wikimedia.org/r/922487

Change 922493 had a related patch set uploaded (by EoghanGaffney; author: EoghanGaffney):

[operations/dns@master] Move doc.discovery.wmnet to new bullseye hosts

https://gerrit.wikimedia.org/r/922493

I think we're ready to switch over doc.discovery.wmnet to point at doc1003 (bullseye) instead of doc1002 (buster). I plan to do this on Wednesday, May 24th at 11am Irish time (UTC+1).

  1. Ensure that no jobs are writing new docs that would get lost during the transition (it's better that these jobs fail then succeed and have the data lost)
    • systemctl stop rsync on doc1002
  1. Ensure that doc1003 is up to date
    • /usr/bin/rsync -avp --delete /srv/doc/ rsync://doc1003.eqiad.wmnet/doc-between-nodes
  1. Merge the DNS change
  1. Merge the puppet change
  1. Clear DNS cache
    • ~sudo cookbook sre.dns.wipe-cache on cumin1001~ Not needed
  1. Check that publishing jobs pass
  2. Shut down old hosts and remove from list

Rolling back in the event of a problem will be to revert the DNS and puppet changes (steps 3 and 4), and running puppet.

Sounds good! I will make my best to be around though I got kids incoming at that time. We shall see.

A word of caution, there are systemd timers set on the primary (doc1002). They trigger every hour to rsync data to the three other hosts:

rsync-doc-doc1003.eqiad.wmnet.timer
rsync-doc-doc2002.codfw.wmnet.timer
rsync-doc-doc2001.codfw.wmnet.timer

Then given rsync is stopped on doc1002 it will not receive any doc updates so even if one of the timer kicks in it will not change any state.

At step 4, you need to sudo run-puppet-agent on the old primary doc1002 to remove the timers.

When Puppet runs on the host, it will remove them from doc1002 and add them to doc1003.

@hashar Thanks for that! Yeah, I'm aware of the timers. I had intended for puppet-agent to be run on both doc1003 and doc1002 at the same time to allow for the timers to be removed. I thought I fixed that but must have forgotten I've updated the plan to reflect that!

Mentioned in SAL (#wikimedia-releng) [2023-05-24T10:09:35Z] <eoghan> Switching doc.wikimedia.org from doc1002 -> doc1003, T319477

Change 922493 merged by EoghanGaffney:

[operations/dns@master] Move doc.discovery.wmnet to new bullseye hosts

https://gerrit.wikimedia.org/r/922493

Change 922487 merged by EoghanGaffney:

[operations/puppet@production] Switch doc host from doc1002 to doc1003

https://gerrit.wikimedia.org/r/922487

The CI job publishing is https://integration.wikimedia.org/ci/job/publish-to-doc/ and the sole failure it had was for the job which generates mediawiki/core documentation (mediawiki-core-doxygen-docker).

It managed to rsync (targetting the discovery hostname):

+ rsync --archive --stats --compress --delete-after . rsync://doc.discovery.wmnet/doc/mediawiki-core/REL1_38/php
Number of files: 29,819 (reg: 29,817, dir: 2)
Number of created files: 0
Number of deleted files: 0
Number of regular files transferred: 29,817
Total file size: 3,015,516,833 bytes
Total transferred file size: 3,015,516,833 bytes
Literal data: 6,430,855 bytes
Matched data: 3,009,085,978 bytes
File list size: 1,572,500
File list generation time: 0.203 seconds
File list transfer time: 0.000 seconds
Total bytes sent: 5,419,162
Total bytes received: 14,464,604

sent 5,419,162 bytes  received 14,464,604 bytes  418,605.60 bytes/sec
total size is 3,015,516,833  speedup is 151.66

Published at https://doc.wikimedia.org/mediawiki-core/REL1_38/php/
Done.

And on the host I can see updated files:

doc1003:~$ ls -lta /srv/doc/mediawiki-core/REL1_38/php|head  -n4
total 3002236
-rw-rw-r-- 1 doc-uploader doc-uploader      616 May 24 10:40 folderclosed.png
-rw-rw-r-- 1 doc-uploader doc-uploader      597 May 24 10:40 folderopen.png
-rw-rw-r-- 1 doc-uploader doc-uploader      314 May 24 10:40 splitbar.png

So I guess it is a success :]

The migration went smoothly. I'm going to remove the two buster hosts from service this afternoon, and we can decommission them later.

Change 922872 had a related patch set uploaded (by EoghanGaffney; author: EoghanGaffney):

[operations/puppet@production] Remove buster hosts from doc rotation

https://gerrit.wikimedia.org/r/922872

Change 922872 merged by EoghanGaffney:

[operations/puppet@production] Remove buster hosts from doc rotation

https://gerrit.wikimedia.org/r/922872

Change 922893 had a related patch set uploaded (by Dzahn; author: Dzahn):

[operations/puppet@production] doc: replace doc1002 with doc1003 in test examples

https://gerrit.wikimedia.org/r/922893

The migration went smoothly. I'm going to remove the two buster hosts from service this afternoon, and we can decommission them later.

Great work!:)

Macro antoine-approve:

Change 922893 merged by Dzahn:

[operations/puppet@production] doc: update test example command to use 2 new hosts

https://gerrit.wikimedia.org/r/922893

cookbooks.sre.hosts.decommission executed by eoghan@cumin1001 for hosts: doc1002.eqiad.wmnet

  • doc1002.eqiad.wmnet (PASS)
    • Downtimed host on Icinga/Alertmanager
    • Found Ganeti VM
    • VM shutdown
    • Started forced sync of VMs in Ganeti cluster eqiad to Netbox
    • Removed from DebMonitor
    • Removed from Puppet master and PuppetDB
    • VM removed
    • Started forced sync of VMs in Ganeti cluster eqiad to Netbox

cookbooks.sre.hosts.decommission executed by eoghan@cumin1001 for hosts: doc2001.codfw.wmnet

  • doc2001.codfw.wmnet (PASS)
    • Downtimed host on Icinga/Alertmanager
    • Found Ganeti VM
    • VM shutdown
    • Started forced sync of VMs in Ganeti cluster codfw to Netbox
    • Removed from DebMonitor
    • Removed from Puppet master and PuppetDB
    • VM removed
    • Started forced sync of VMs in Ganeti cluster codfw to Netbox

The two remaining buster hosts were decommissioned and old rsync timer jobs have been removed.

Change 902825 abandoned by Andrea Denisse:

[operations/puppet@production] doc: Add support for passive_hosts synchronization via rsync

Reason:

https://gerrit.wikimedia.org/r/902825