⚓ T319477 Migrate doc hosts to Bullseye

Subject	Repo	Branch	Lines +/-
doc: Add support for passive_hosts synchronization via rsync	operations/puppet	production	+34 -37
doc: update test example command to use 2 new hosts	operations/puppet	production	+3 -3
Remove buster hosts from doc rotation	operations/puppet	production	+0 -2
Switch doc host from doc1002 to doc1003	operations/puppet	production	+1 -1
Move doc.discovery.wmnet to new bullseye hosts	operations/dns	master	+2 -2
doc: Reserve UID/GID for the doc-uploader system user	operations/puppet	production	+21 -4
doc: Add role::doc to doc2002	operations/puppet	production	+3 -7
doc: Add role::doc to doc1003	operations/puppet	production	+2 -5

Status	Assigned	Task
Open	None	T291916 Tracking task for Bullseye migrations in production
Resolved	Dzahn	T327068 Bullseye upgrade for remaining Collab hosts
Resolved	• eoghan	T319477 Migrate doc hosts to Bullseye
Resolved	andrea.denisse	T332812 Site: 1 VM request for doc1003
Resolved	andrea.denisse	T332819 Site: 1 VM request for doc2002
Resolved	hashar	T333294 MediaWiki periodic Doxygen Jenkins job fails to publish

Dzahn created this task.Oct 5 2022, 9:00 PM

Restricted Application added a subscriber: Aklapper. · View Herald TranscriptOct 5 2022, 9:00 PM

Dzahn added a parent task: T291916: Tracking task for Bullseye migrations in production.Oct 5 2022, 9:01 PM

Dzahn updated the task description. (Show Details)

Dzahn added a project: Continuous-Integration-Infrastructure.Oct 5 2022, 9:03 PM

Jdforrester-WMF awarded a token.Oct 5 2022, 9:25 PM

LSobanski moved this task from Incoming to Backlog on the collaboration-services board.Oct 18 2022, 3:55 PM

Jdforrester-WMF added a parent task: T322357: OOUI PHP demos page is broken (again).Nov 3 2022, 6:05 PM

Daimona subscribed.Nov 3 2022, 6:11 PM

Daimona awarded a token.Nov 3 2022, 6:14 PM

LSobanski triaged this task as Medium priority.Nov 23 2022, 7:51 PM

LSobanski mentioned this in T327068: Bullseye upgrade for remaining Collab hosts.Jan 16 2023, 1:37 PM

LSobanski added a parent task: T327068: Bullseye upgrade for remaining Collab hosts.Jan 16 2023, 1:39 PM

Krinkle subscribed.Jan 16 2023, 5:00 PM

May collaboration-services create the Bullseye Ganeti VM to replace the hosts? I am guessing the hostname will be:

doc1003.eqiad.wmnet

doc2002.codfw.wmnet

They will need the Puppet role doc and to be added to some Hiera settings notably to make them become scap targets (among other effect). Then if they seem to work fine, the DNS service entry doc.discovery.wmnet will be switched to one of the new host.

Yes, the hostnames will be doc1003 and doc2002, confirmed. And yes, serviceops-collab will create them. We will soon assign it to a person.

contint2001 migration is higher prio though for sure, btw.

andrea.denisse subscribed.Mar 21 2023, 3:02 AM

andrea.denisse claimed this task.Mar 21 2023, 3:12 AM

hashar removed a parent task: T322357: OOUI PHP demos page is broken (again).Mar 21 2023, 3:08 PM

andrea.denisse renamed this task from migrate doc hosts to bullseye to Migrate doc hosts to Bullseye.Mar 23 2023, 1:08 AM

andrea.denisse added a subtask: T332812: Site: 1 VM request for doc1003.

andrea.denisse updated the task description. (Show Details)

andrea.denisse updated the task description. (Show Details)Mar 23 2023, 1:14 AM

Change 902222 had a related patch set uploaded (by Andrea Denisse; author: Andrea Denisse):

[operations/puppet@production] doc: Add role::doc to doc1003

https://gerrit.wikimedia.org/r/902222

gerritbot added a project: Patch-For-Review.Mar 23 2023, 1:20 AM

Dzahn awarded a token.Mar 23 2023, 3:33 PM

Change 902222 merged by Andrea Denisse:

[operations/puppet@production] doc: Add role::doc to doc1003

https://gerrit.wikimedia.org/r/902222

Maintenance_bot removed a project: Patch-For-Review.Mar 23 2023, 9:30 PM

andrea.denisse updated the task description. (Show Details)Mar 23 2023, 10:07 PM

Change 902505 had a related patch set uploaded (by Andrea Denisse; author: Andrea Denisse):

[operations/puppet@production] doc: Add role::doc to doc2002

https://gerrit.wikimedia.org/r/902505

gerritbot added a project: Patch-For-Review.Mar 23 2023, 10:11 PM

Change 902825 had a related patch set uploaded (by Andrea Denisse; author: Andrea Denisse):

[operations/puppet@production] doc: Add support for passive_hosts synchronization via rsync

https://gerrit.wikimedia.org/r/902825

Change 902505 merged by Andrea Denisse:

[operations/puppet@production] doc: Add role::doc to doc2002

https://gerrit.wikimedia.org/r/902505

Change 903319 had a related patch set uploaded (by Andrea Denisse; author: Andrea Denisse):

[operations/puppet@production] doc: Reserve UID/GID for the doc-uploader system user

https://gerrit.wikimedia.org/r/903319

Change 903319 merged by Andrea Denisse:

[operations/puppet@production] doc: Reserve UID/GID for the doc-uploader system user

https://gerrit.wikimedia.org/r/903319

andrea.denisse updated the task description. (Show Details)Mar 27 2023, 9:32 PM

andrea.denisse updated the task description. (Show Details)

andrea.denisse added a subscriber: joanna_borun.

andrea.denisse added a subtask: T332819: Site: 1 VM request for doc2002.Mar 27 2023, 9:35 PM

andrea.denisse updated the task description. (Show Details)Mar 27 2023, 9:51 PM

hashar mentioned this in T333294: MediaWiki periodic Doxygen Jenkins job fails to publish.Mar 28 2023, 8:17 AM

In T319477#8731480, @gerritbot wrote:

Change 903319 merged by Andrea Denisse:

[operations/puppet@production] doc: Reserve UID/GID for the doc-uploader system user

https://gerrit.wikimedia.org/r/903319

I do not know why that was made, but that broke the documentation publishing since files under /srv/doc did not get changed to the new UID/GID. I have noticed it this morning since some jobs started failed and filed it as T333294. Will ask to have the chown issued.

In T319477#8732649, @hashar wrote:

I do not know why that was made, but that broke the documentation publishing since files under /srv/doc did not get changed to the new UID/GID. I have noticed it this morning since some jobs started failed and filed it as T333294. Will ask to have the chown issued.

Fixed by issuing a recursive chown:

In T333294#8732722, @Clement_Goubert wrote:

cgoubert@cumin1001:~$ sudo cumin 'P:doc' 'chown -R doc-uploader:doc-uploader /srv/doc'
4 hosts will be targeted:
doc[2001-2002].codfw.wmnet,doc[1002-1003].eqiad.wmnet
OK to proceed on 4 hosts? Enter the number of affected hosts to confirm or "q" to quit: 4
===== NO OUTPUT =====                                                                                          
PASS |█████████████████████████████████████████████████████████████████| 100% (4/4) [01:11<00:00, 17.79s/hosts]
FAIL |                                                                         |   0% (0/4) [01:11<?, ?hosts/s]
100.0% (4/4) success ratio (>= 100.0% threshold) for command: 'chown -R doc-upl...ploader /srv/doc'.
100.0% (4/4) success ratio (>= 100.0% threshold) of nodes successfully executed all commands.
``

:)

Hi @hashar , I explained the rationale for changing the UID/GID (patch #903319) in the task:

As of now, we don't have standardized fleetwide uid and gid mappings for the Doc instances. This can cause issues like T314972 where one instances uid is different than the other instances uid causing commands like rsync to write files with a different USER:GROUP mapping.
Ex. In the doc1002 instance the uid 499 is mapped to the`doc-uploader` user and in the doc1003 instance the uid 498 is mapped to the doc-uploader.

Giving more context the UID 499 in the doc1002 was mapped to the 'doc-uploader' user but in the new hosts the UID 499 is mapped to 'debmonitor':
debmonitor:x:499:499:DebMonitor system user:/nonexistent:/bin/bash

The UID/GID should've been reserved otherwise we would would run into the issues I explained above, where the UID/GID don't match in the new instances so without reserving the UID/GID we would be unable to migrate to Bullseye without disrupting other services like debmonitor.

Another reason why this issue arose is because the code that syncs the doc hosts is not using the standardized way of synchronizing instances which is quickdatacopy. It uses plain rsync nested inside conditionals, making it hard to read and generates this problems with UID/GID.

Also, I added support for quickdatacopy to receive username/groupname mappings when synchronizing files in patch #823748.

Best. :)

andrea.denisse updated the task description. (Show Details)Mar 28 2023, 6:33 PM

andrea.denisse updated the task description. (Show Details)

Thanks for the detailed explanation. Indeed the uid mismatch is a mess. Although rsync (running as root) is able to do the uid translation between two hosts for a given user. It does not work cause our config has use chroot = yes which disables the mechanism. I guess it is "easier" to hardcode the uid fleet wide. I will keep that in mind for the contint* hosts which require at least 5 different system users. Anyway for the doc hosts the fix was simple.

Should we have profile::doc to use rsync::quickdatacopy as well? At a quick glance it looks like it can be applied along the Rsync::Server::Module['doc'] which is used by CI to upload the artifacts.

andrea.denisse updated the task description. (Show Details)Mar 29 2023, 2:04 PM

Yes, the issue with mismatched UIDs on hosts that rsync has come up multiple times before and the preferred fix in SRE is definitely to use reserved UIDs. We have applied this to other hosts and it ends the problems once and for the future.

And yes, we should use rsync::quickdatacopy wherever possible to reduce the number of special cases. We have chatted about this before and it would change the direction we sync:

now: single active host has one time for each passive host it pushes to, each passive host has rsyncd.
then: single active host has rsyncd, each passive host has one timer to pull from it

But I think that is just fine as long as we keep it in mind and check things work after merging a change.

And since the fact that doc hosts currently have it "the other way around" from most setups makes this more confusing.. I would welcome it if we could replace that with standard quickdatacopy.

Dzahn moved this task from Backlog to Work in Progress on the collaboration-services board.Mar 30 2023, 1:37 AM

Mentioned in SAL (#wikimedia-operations) [2023-04-13T20:46:49Z] <mutante> doc2001 - systemctl stop php7.3-fpm; systemctl restart php7.4-fpm - needed because after gerrit:901612 we had BOTH PHP versions, 7.3 and 7.4 running their own php-fpm process, also packages for both versions are installed, so also manual package removal needed - apt-get remove php7.3* T322357 T319477

Mentioned in SAL (#wikimedia-operations) [2023-04-13T21:02:29Z] <mutante> doc1002 (doc.wikimedia.org) - switching from PHP 7.3 to 7.4 - systemctl stop php7.3-fpm, restart php7.4-fpm, apt-get remove --purge php7.3*, systemctl restart apache2. - all tests still working (on deployment server: httpbb --hosts doc1002.eqiad.wmnet /srv/deployment/httpbb-tests/doc/test_doc.yaml) T322357 T319477

With the original reason being T322357, PHP on the existing _buster_ doc* machines has now been upgraded from PHP 7.3 to PHP 7.4.

It solved that ticket but after initial hesitation I got convinced that it's also good for unblocking this ticket, the bullseye migration.

Because now we can say we already tested on 7.4 and know it works which should be a big chunk of the entire upgrade.

The patch that did this was https://gerrit.wikimedia.org/r/c/operations/puppet/+/901612

But.. it wasn't all rosy, puppet had errors on first and second run.. but after the third run the issues fixed themselves. (dependency issues/ order of steps).

And after that we had both 7.3 and 7.4 packages installed in parallel and each version was running its own php-fpm at the same time.

So some additonal steps were needed, which were~:

systemctl stop php7.3-fpm
systemctl restart php7.4-fpm
apt-get remove --purge php7.3*
systemctl restart apache2.

The following additional packages will be installed:
  php-fpm php-xml
The following packages will be REMOVED:
  php7.3-cli php7.3-common php7.3-fpm php7.3-json php7.3-opcache php7.3-readline php7.3-xml
The following packages will be upgraded:
  php-fpm php-xml
2 upgraded, 0 newly installed, 7 to remove and 3 not upgraded.

and then I ran httpbb --hosts doc1002.eqiad.wmnet /srv/deployment/httpbb-tests/doc/test_doc.yaml on a deployment server to confirm everything still works.

That tests 10 different URLs (it actually has relatively complex setup thats why it's so many). They all passed before and after.

I did this first on the passive machine and then finally same steps on the active machine.

@andrea.denisse @LSobanski @hashar This should make us more confident to switch to the bullseye machines, since the PHP version upgrade is already done now and gone from the "diff" when that happens.

Puppet had errors on first and second run.. but after the third run the issues fixed themselves. (dependency issues/ order of steps

Ideally the Puppet manifest would apply on the first try, then I am guessing there were some low level conflict due to PHP.

And after that we had both 7.3 and 7.4 packages installed in parallel and each version was running its own php-fpm at the same time.

Yup I have shortly mentioned the php 7.3 removal in the commit message, but should probably have boldly highlighted that part.

httpbb --hosts doc1002.eqiad.wmnet /srv/deployment/httpbb-tests/doc/test_doc.yaml on a deployment server to confirm everything still works.

Those testing scenarios are very priceless. They definitely help to ensure critical paths of the application and the bricks that sustain it have at least the basis. That should catch most of the basic issues and builds confidences. I am quite happy to see those used for monitoring as well. Thanks for those!

Jdforrester-WMF mentioned this in T334954: Raise version of PHP on integration.wikimedia.org from 7.3 to 7.4+.Apr 18 2023, 2:32 PM

Dzahn mentioned this in T336168: Provide mechanism to publish to doc.wikimedia.org from GitLab CI.May 9 2023, 4:27 PM

Dzahn reassigned this task from andrea.denisse to • eoghan.May 22 2023, 3:22 PM

Change 922487 had a related patch set uploaded (by EoghanGaffney; author: EoghanGaffney):

[operations/puppet@production] Switch doc host from doc1002 to doc1003

https://gerrit.wikimedia.org/r/922487

Change 922493 had a related patch set uploaded (by EoghanGaffney; author: EoghanGaffney):

[operations/dns@master] Move doc.discovery.wmnet to new bullseye hosts

https://gerrit.wikimedia.org/r/922493

I think we're ready to switch over doc.discovery.wmnet to point at doc1003 (bullseye) instead of doc1002 (buster). I plan to do this on Wednesday, May 24th at 11am Irish time (UTC+1).

Ensure that no jobs are writing new docs that would get lost during the transition (it's better that these jobs fail then succeed and have the data lost)
- systemctl stop rsync on doc1002

Ensure that doc1003 is up to date
- /usr/bin/rsync -avp --delete /srv/doc/ rsync://doc1003.eqiad.wmnet/doc-between-nodes

Merge the DNS change
- operations/dns/+/922493
- sudo authdns-merge on dns1001

Merge the puppet change
- operations/puppet/+/922487
- sudo puppet-merge on puppetmaster1001.eqiad.wmnet
- sudo run-puppet-agent on doc1003 and doc1002

Clear DNS cache
- ~sudo cookbook sre.dns.wipe-cache on cumin1001~ Not needed

Check that publishing jobs pass
Shut down old hosts and remove from list

Rolling back in the event of a problem will be to revert the DNS and puppet changes (steps 3 and 4), and running puppet.

Sounds good! I will make my best to be around though I got kids incoming at that time. We shall see.

A word of caution, there are systemd timers set on the primary (doc1002). They trigger every hour to rsync data to the three other hosts:

rsync-doc-doc1003.eqiad.wmnet.timer

rsync-doc-doc2002.codfw.wmnet.timer

rsync-doc-doc2001.codfw.wmnet.timer

Then given rsync is stopped on doc1002 it will not receive any doc updates so even if one of the timer kicks in it will not change any state.

At step 4, you need to sudo run-puppet-agent on the old primary doc1002 to remove the timers.

When Puppet runs on the host, it will remove them from doc1002 and add them to doc1003.

@hashar Thanks for that! Yeah, I'm aware of the timers. I had intended for puppet-agent to be run on both doc1003 and doc1002 at the same time to allow for the timers to be removed. I thought I fixed that but must have forgotten I've updated the plan to reflect that!

Mentioned in SAL (#wikimedia-releng) [2023-05-24T10:09:35Z] <eoghan> Switching doc.wikimedia.org from doc1002 -> doc1003, T319477

Change 922493 merged by EoghanGaffney:

[operations/dns@master] Move doc.discovery.wmnet to new bullseye hosts

https://gerrit.wikimedia.org/r/922493

Change 922487 merged by EoghanGaffney:

[operations/puppet@production] Switch doc host from doc1002 to doc1003

https://gerrit.wikimedia.org/r/922487

The CI job publishing is https://integration.wikimedia.org/ci/job/publish-to-doc/ and the sole failure it had was for the job which generates mediawiki/core documentation (mediawiki-core-doxygen-docker).

It managed to rsync (targetting the discovery hostname):

+ rsync --archive --stats --compress --delete-after . rsync://doc.discovery.wmnet/doc/mediawiki-core/REL1_38/php
Number of files: 29,819 (reg: 29,817, dir: 2)
Number of created files: 0
Number of deleted files: 0
Number of regular files transferred: 29,817
Total file size: 3,015,516,833 bytes
Total transferred file size: 3,015,516,833 bytes
Literal data: 6,430,855 bytes
Matched data: 3,009,085,978 bytes
File list size: 1,572,500
File list generation time: 0.203 seconds
File list transfer time: 0.000 seconds
Total bytes sent: 5,419,162
Total bytes received: 14,464,604

sent 5,419,162 bytes  received 14,464,604 bytes  418,605.60 bytes/sec
total size is 3,015,516,833  speedup is 151.66

Published at https://doc.wikimedia.org/mediawiki-core/REL1_38/php/
Done.

And on the host I can see updated files:

doc1003:~$ ls -lta /srv/doc/mediawiki-core/REL1_38/php|head  -n4
total 3002236
-rw-rw-r-- 1 doc-uploader doc-uploader      616 May 24 10:40 folderclosed.png
-rw-rw-r-- 1 doc-uploader doc-uploader      597 May 24 10:40 folderopen.png
-rw-rw-r-- 1 doc-uploader doc-uploader      314 May 24 10:40 splitbar.png

So I guess it is a success :]

The migration went smoothly. I'm going to remove the two buster hosts from service this afternoon, and we can decommission them later.

Change 922872 had a related patch set uploaded (by EoghanGaffney; author: EoghanGaffney):

[operations/puppet@production] Remove buster hosts from doc rotation

https://gerrit.wikimedia.org/r/922872

Change 922872 merged by EoghanGaffney:

[operations/puppet@production] Remove buster hosts from doc rotation

https://gerrit.wikimedia.org/r/922872

Change 922893 had a related patch set uploaded (by Dzahn; author: Dzahn):

[operations/puppet@production] doc: replace doc1002 with doc1003 in test examples

https://gerrit.wikimedia.org/r/922893

In T319477#8876245, @eoghan wrote:

The migration went smoothly. I'm going to remove the two buster hosts from service this afternoon, and we can decommission them later.

Great work!:)

Macro antoine-approve:

thcipriani awarded a token.May 24 2023, 10:12 PM

Change 922893 merged by Dzahn:

[operations/puppet@production] doc: update test example command to use 2 new hosts

https://gerrit.wikimedia.org/r/922893

cookbooks.sre.hosts.decommission executed by eoghan@cumin1001 for hosts: doc1002.eqiad.wmnet

doc1002.eqiad.wmnet (PASS)
- Downtimed host on Icinga/Alertmanager
- Found Ganeti VM
- VM shutdown
- Started forced sync of VMs in Ganeti cluster eqiad to Netbox
- Removed from DebMonitor
- Removed from Puppet master and PuppetDB
- VM removed
- Started forced sync of VMs in Ganeti cluster eqiad to Netbox

cookbooks.sre.hosts.decommission executed by eoghan@cumin1001 for hosts: doc2001.codfw.wmnet

doc2001.codfw.wmnet (PASS)
- Downtimed host on Icinga/Alertmanager
- Found Ganeti VM
- VM shutdown
- Started forced sync of VMs in Ganeti cluster codfw to Netbox
- Removed from DebMonitor
- Removed from Puppet master and PuppetDB
- VM removed
- Started forced sync of VMs in Ganeti cluster codfw to Netbox

The two remaining buster hosts were decommissioned and old rsync timer jobs have been removed.

Thank you @eoghan!

Change 902825 abandoned by Andrea Denisse:

[operations/puppet@production] doc: Add support for passive_hosts synchronization via rsync

Reason:

https://gerrit.wikimedia.org/r/902825

Maintenance_bot removed a project: Patch-For-Review.Jul 19 2023, 2:11 PM

Migrate doc hosts to Bullseye
Closed, ResolvedPublic
Actions

Description

Introduction:

Pre requisites:

Users:

Failover:

Details

Related Objects
Search...

Event Timeline

Migrate doc hosts to BullseyeClosed, ResolvedPublicActions

Description

Introduction:

Pre requisites:

Users:

Failover:

Details

Related ObjectsSearch...

Event Timeline

Migrate doc hosts to Bullseye
Closed, ResolvedPublic
Actions

Related Objects
Search...