Page MenuHomePhabricator

replace doc1001.eqiad.wmnet with a buster VM and create the codfw equivalent
Closed, ResolvedPublic

Description

doc1001.eqiad.wmnet is the backend for https://doc.wikimedia.org and running stretch

It should be replaced with doc1002 running on buster.

And possible doc2001 should be created as well.

time frame: by the end of Q3 2020

Event Timeline

There are a very large number of changes, so older changes are hidden. Show Older Changes
  1. It is using a document root that is not /srv/deployment/integration/docroot

https://gerrit.wikimedia.org/r/c/operations/puppet/+/625644/6/modules/profile/manifests/doc.pp

That's fine I think. This change moved the location of CI-generated publications from /srv/docroot/org/wikimedia/doc to /srv/doc.

The issue I'm reporting is that changes to the doc.wikimedia.org site itself (from integration/docroot) appear to not be reflected. This was (seemingly) moved to Scap in this puppet change but for some reason deployments aren't working. The code is going to doc1001 correctly, and from what I can tell the src/deployment scap directory is indeed the document root. Yet, the change is not reflected on doc.wikimedia.org.

There is definitely only doc1001 in the caching layer. The DNS name used by caching servers is doc.discovery.wmnet and that is hardcoded in DNS repo to point to doc1001, there is no geo DNS for it.

The document root of the webserver on doc1001 is really DocumentRoot /srv/deployment/integration/docroot/org/wikimedia/doc

And yes, /srv/deployment/integration/docroot/org/wikimedia/doc/opensource.yaml has the Minify entry.

regarding caching there is:

26     # Lower caching length (T184255)
27     Header set Cache-Control "s-maxage=3600, must-revalidate, max-age=0"
Got error 'PHP message: PHP Parse error:  syntax error, unexpected 'const' (T_CONST), expecting variable (T_VARIABLE) in /srv/deployment/integration/docroot-cache/revs/672e79ffce05ef863b0b508a89ecb5f67ca9b916/shared/Page.php on line 5\n'

This is a stretch server with PHP 7.0. But "The ability to specify the visibility of class constants was only added in PHP 7.1" source

The line 5 in Page.php has "public const INDEX_ALLOW_SKIP = 1;" but the "public" part won't work here?

Switching to doc1002 on buster should actually fix that issue.

Apparently it was just not noticed before due to caching.

There is definitely only doc1001 in the caching layer. The DNS name used by caching servers is doc.discovery.wmnet and that is hardcoded in DNS repo to point to doc1001, there is no geo DNS for it.

The document root of the webserver on doc1001 is really DocumentRoot /srv/deployment/integration/docroot/org/wikimedia/doc

And yes, /srv/deployment/integration/docroot/org/wikimedia/doc/opensource.yaml has the Minify entry.

regarding caching there is:

26     # Lower caching length (T184255)
27     Header set Cache-Control "s-maxage=3600, must-revalidate, max-age=0"

And yet:

$ krinkle@doc1001:~$ curl 'http://doc1001.eqiad.wmnet' -H 'Host: doc.wikimedia.org'

shows a perfectly fine, but outdated HTML response, with various libs, but not "Minify".

Got error 'PHP message: PHP Parse error:  syntax error, unexpected 'const' (T_CONST), expecting variable (T_VARIABLE) in /srv/deployment/integration/docroot-cache/revs/672e79ffce05ef863b0b508a89ecb5f67ca9b916/shared/Page.php on line 5\n'

Huh, and indeed it shows HTTP 500 now. But where was it cached? Does the local Apache have an HTTP cache proxy that serves stale responses if it gets 500 from PHP?

Change 663093 had a related patch set uploaded (by Krinkle; owner: Krinkle):
[integration/docroot@master] shared: Unbreak Page.php for old php 7.0 doc1001 server

https://gerrit.wikimedia.org/r/663093

Change 663093 merged by jenkins-bot:
[integration/docroot@master] shared: Unbreak Page.php for old php 7.0 doc1001 server

https://gerrit.wikimedia.org/r/663093

after the changes above and rebooting the VM things are working again.

let's just switch to doc1002, Hashar

I can not tell why the homepage of doc.wikimedia.org would be stall after a deployment of integration/docroot.git . I have noticed that a few days ago after the introduction of a change to Shellbox. Surely I should have reported it but I was on MediaWiki train and focused on that.

The issue was no more showing this morning. Maybe the unexpected 'const' (T_CONST) is what caused the page to be stall? At least the paths refers to 672e79ffce05ef863b0b508a89ecb5f67ca9b916 which is the latest of integration/docroot so at least it is using the proper version of the code. I haven't captured the response headers I had yesterday, maybe the cache layer just keep serving them (there is a one hour max-age) and maybe the cached page is not invalidated in ATS/Varnish if the backend serves a 500 (if it was indeed serving a 500 for that constant error).

doc1001 does run php 7.0, we should have kept the php7.0 linting job on the repository to prevent the issue we had with a class constant being declared public. But the php70 linter got removed as part of phasing out 7.0 from CI entirely, I guess that is a shortcoming. I am pretty sure I reported about it on the related change but then given we really really had to drop php 7.0 I guess it was an acceptable risk to take.

What we can do in the interim is to upgrade php on doc.wikimedia.org. It is running Stretch so we can get the php 7.2 packages from component/php72. I think that needs:

  • some changes to the Apache configuration in case it points to php7.0-fpm
  • ensure that our integration/docroot webapp serving https://doc.wikimedia.org/ does work for php 7.2 (pretty sure it does)
  • check a Doxygen generated documentation still has its search working (it was broken on Stretch cause the doxygen version shipped did not support php 7.0). Hopefully it works fine for php 7.2. That can be checked after the upgrade.

As for this migration, I haven't commented on this task yet cause I have no availability to conduct it myself. I have stopped scaling ages ago and that has been clearly identified a year or so ago when I have left for a couple months. The reality is I simply have too many tasks to manage.

The good news is that it has been identified. The plan we have with Tyler is for me to pair with @LarsWirzenius to conduct the various Buster upgrades we have to do. But that needs some availability from both Lars and I, a few pairing session to explain what doc.wikimedia.org is doing and capture that in documentation while at it ( https://wikitech.wikimedia.org/wiki/doc.wikimedia.org ). Then I do some iterations with him to prepare the migration and finally actually do it.

Tyler and I talked about that a few weeks ago, I have been upgrading Gerrit, running the MediaWiki train and other CI related maintenance burden and I haven't reached out to Lars yet about it.

The issue was no more showing this morning.

As mentioned above, this was due to the hot fixes merged above and rebooting the VM.

Maybe the unexpected 'const' (T_CONST) is what caused the page to be stall?

Yes, that caused the 500 errors that popped up after it wasn't serving stale content anymore.

doc1001 does run php 7.0, we should have kept the php7.0 linting job on the repository

Yea, Krinkle is aware of that.

What we can do in the interim is to upgrade php on doc.wikimedia.org. It is running Stretch so we can get the php 7.2 packages from component/php72. I think that needs:

  • some changes to the Apache configuration in case it points to php7.0-fpm

Or we could switch to the buster replacement VM that has been sitting there since a while now, only waiting for an OK to switch. Which doesn't need any new changes and isn't another interim solution.

Change 650625 abandoned by Dzahn:
[operations/dns@master] switch doc.wikimedia.org to doc1002 backend

Reason:
per chat with Krinkle, this isn't ready yet to be switched

https://gerrit.wikimedia.org/r/650625

I've not managed to do anything for this task yet, but I have a question: since doc.wikimedia.org seems to primarly be a site with static content, why do we run PHP on it?

Please see details on T211974 where @hashar requested it to have PHP.

I've not managed to do anything for this task yet, but I have a question: since doc.wikimedia.org seems to primarly be a site with static content, why do we run PHP on it?

PHP is used for:

  • Basic things that vary only based on stuff in the portal repo and its config (potentially statically generatable):
    • Home page, based on YAML/JSON.
    • Shared layout, header, and footer for all pages.
  • Dynamic things only known at run-time, vary based on what is continously published by various's repos CI post-merge to the "doc" directory (not the docroot):
    • navigation pages for browsing coverage reports.
    • navigation pages for browsing doc pages.
    • navigation pages for browsing versions and sub components of the same software (dirlist).
    • redirect to subdirs for discovery of coverage and api docs (dirlist/404).
  • Dynamic things used by individual microsites served from the "doc" directory, e.g. for a specific software/component/version such as "mediawiki/php/REL1_35"
    • Search (Doxygen sites).
Dzahn changed the task status from Open to Stalled.Apr 6 2021, 6:45 PM

replacement VMs with buster, applied puppet role for doc and no puppet errors are on standby since quite some time.

contint-admins group is applied, granting shell access to all members

doc1002 and the new codfw equivalent doc2001 are there to replace the SPOF doc1001

I am not sure if there is actually a technical blocker here that keeps us from simply switching.

Maybe someone could run some tests from the deployment server to check for differences between doc1001 and doc1002/doc2001?

for example with httpbb (httpbb --hosts doc1002.eqiad.wmnet /srv/deployment/httpbb-tests/doc/test_doc.yaml ), httpbb with their own test yaml file, curl or something else?

Change 650306 had a related patch set uploaded (by Dzahn; author: Dzahn):

[operations/puppet@production] scap/dsh: add doc1002/doc2001 to ci-docroot hosts

https://gerrit.wikimedia.org/r/650306

@Dzahn Just an idea, but if we create an alias of some kind (doc.svc?) we can reduce the amount of hardcoded references to doc1001, and thus make the switch a bit easier to do, and especially easier to revert.

The main one that I'm thinking about is the Jenkins jobs that hardcode doc1001 currently as their rsync target.

Also, should the new hosts be added to the scap target for docroot in hieradata/common/scap/dsh.yaml? (This is just the site homepage, not the main content. I'm guessing the spare host remains an empty/cold standby.)

@Krinkle sure, always a good idea to replace hardcoded host names. We already have this !:)

[deploy1002:~] $ host doc.discovery.wmnet
doc.discovery.wmnet is an alias for doc1001.eqiad.wmnet.

We use it in ATS and as cert name:

common/profile/trafficserver/backend.yaml:      replacement: https://doc.discovery.wmnet
role/common/doc.yaml:profile::tlsproxy::envoy::global_cert_name: "doc.discovery.wmnet"

should the new hosts be added to the scap target for docroot in hieradata/common/scap/dsh.yaml?

I don't know that, I'm not a deployer of it and releng isn't on this ticket anymore.

Change 650306 restored by Hashar:

[operations/puppet@production] scap/dsh: add doc1002/doc2001 to ci-docroot hosts

https://gerrit.wikimedia.org/r/650306

Change 741713 had a related patch set uploaded (by Majavah; author: Majavah):

[operations/puppet@production] P::doc: sync data to non-active servers

https://gerrit.wikimedia.org/r/741713

Change 741715 had a related patch set uploaded (by Majavah; author: Majavah):

[operations/puppet@production] P::doc: use correct php_fpm path

https://gerrit.wikimedia.org/r/741715

should the new hosts be added to the scap target for docroot in hieradata/common/scap/dsh.yaml?

I don't know that, I'm not a deployer of it and releng isn't on this ticket anymore.

I'm not a deployer either but Stretch is starting to get really old and we should get rid of it, so yes if possible.

In T247653#7527732, @Majavah wrote:

Stretch is starting to get really old and we should get rid of it, so yes if possible.

Of course I agree, that's why I made this ticket.

taavi changed the task status from Stalled to Open.Nov 24 2021, 9:00 PM

I don't think this is stalled on anything, let's get it done.

Dzahn changed the task status from Open to Stalled.Nov 24 2021, 10:06 PM

It's stalled on bandwith of releng and per @Aklapper bug status should reflect reality not wishes.

Change 650306 merged by Dzahn:

[operations/puppet@production] scap/dsh: add doc1002/doc2001 to ci-docroot hosts

https://gerrit.wikimedia.org/r/650306

Change 741715 merged by Dzahn:

[operations/puppet@production] P::doc: use correct php_fpm path

https://gerrit.wikimedia.org/r/741715

Change 741752 had a related patch set uploaded (by Majavah; author: Majavah):

[integration/config@master] publish: use discovery name

https://gerrit.wikimedia.org/r/741752

Change 741713 merged by Jbond:

[operations/puppet@production] P::doc: sync data to non-active servers

https://gerrit.wikimedia.org/r/741713

Change 743485 had a related patch set uploaded (by Majavah; author: Majavah):

[operations/puppet@production] httpbb: fix doc tests

https://gerrit.wikimedia.org/r/743485

Change 743485 merged by RLazarus:

[operations/puppet@production] httpbb: fix doc tests

https://gerrit.wikimedia.org/r/743485

Change 741752 merged by jenkins-bot:

[integration/config@master] publish: use discovery name

https://gerrit.wikimedia.org/r/741752

Change 744762 had a related patch set uploaded (by Majavah; author: Majavah):

[operations/dns@master] discovery: switchover doc to doc1002

https://gerrit.wikimedia.org/r/744762

Change 744763 had a related patch set uploaded (by Majavah; author: Majavah):

[operations/puppet@production] hieradata: switchover doc to doc1002

https://gerrit.wikimedia.org/r/744763

@hashar @Krinkle Content sync between instances, the jenkins publish job and integration/docroot scap setup are all now fixed. The site also seems to work fine when browsing it via an SSH tunnel (ssh -L 8083:doc1002.eqiad.wmnet:80 deployment.eqiad.wmnet). Are you aware of any remaining blockers or can we try switching over doc1001->doc1002?

Krinkle changed the task status from Stalled to Open.Dec 7 2021, 7:11 PM
Krinkle reassigned this task from Dzahn to hashar.

A few people poked me about that instance. Looks like the bulk of the work has been accomplished by @Dzahn and @Majavah

On the CI configuration side, we push artifacts via the Jenkins job publish-to-doc which rsync to the DNS entry doc.discovery.wmnet. It is a CNAME with a 300 seconds time to live.

Thus the switch can be done by restoring and deploying the DNS change https://gerrit.wikimedia.org/r/c/operations/dns/+/650625/

The Jenkins job runs on contint2001.wikimedia.org so possibly the DNS cache will have to be flushed.

There is a rsync job triggered once per hour on the active doc server which rsync all published docs to the other servers. The switch can be done with https://gerrit.wikimedia.org/r/c/operations/puppet/+/744763

Thus the sole issues I see are the time based race conditions when doing:

  • the DNS change (some doc might end up published on the old host) and applying the puppet patch
  • the puppet patch, a doc published on the new host might be erased if the timer kicks in before puppet has been run on the old host

Those are probably not a big concern :)

I propose the following rollout:

  1. change 744763 (puppet), this sets doc::active_host to doc1002.
  2. apply or wait for, puppet run on doc hosts.

This has the effect of disallowing incoming rsync from Jenkins on doc1001, and opening up incoming on doc1002, and inverting the hourly rsync cron. It has the indirect effect of making these Jenkins jobs fail until the next step, but no publications will be lost, and the jobs can be retried if we want to, but there's generally only one every few minutes, not continuous.

  1. change 650625 (dns). this sets doc.discovery.wmnet to doc1002.

This has the effect of switching public read traffic for doc.wikimedia.org from doc1001 to doc1002, and within a few minutes, also for Jenkins jobs to write to doc1002.

If we want, we can clear the dns cache on contint100x at this point to get the Jobs to start passing or retrying sooner. Public traffic may be stale for upto 5min depending where we are in the CDN server's DNS cache interval. This is harmless imho.

@Krinkle Yep, that summary sounds right to me. That's what we had in mind. It's just that some time ago you had said it's not ready yet to be switched on that change https://gerrit.wikimedia.org/r/c/operations/dns/+/650625/. I don't recall what the specific reasons were for it not being ready. But if there is no concern anymore now then this should be ready to go anytime. Feel free to invite me via calendar to make this happen. I can deal with reasonably early time in my timezone.

As pointed out in T311732 (now merged as duplicate of this one), we are blocked on doc1001 due to some envoy upgrades which are difficult to do on stretch (and doc1001 is the only host that matters that is still stretch). It would be awesome if we could push this forward soon, let me know if I can somehow help (I can definitely perform the puppet/dns deploys if they have been greenlighted).

@Krinkle and I agreed on doing this tomorrow at 14:00 PST

Dzahn changed the task status from Open to In Progress.Jun 30 2022, 10:11 PM

Change 650625 restored by Dzahn:

[operations/dns@master] switch doc.wikimedia.org to doc1002 backend

https://gerrit.wikimedia.org/r/650625

  1. change 744763 (puppet), this sets doc::active_host to doc1002.

rebased

also: https://gerrit.wikimedia.org/r/q/topic:doc.wikimedia.org

  1. apply or wait for, puppet run on doc hosts.

will do that via cumin

This has the effect of disallowing incoming rsync from Jenkins on doc1001, and opening up incoming on doc1002, and inverting the hourly rsync cron.

planning to manually run the rsync command instead of waiting for the timer (formerly cron). but exact same command that would be the timer command, as the same user (root)

It has the indirect effect of making these Jenkins jobs fail until the next step, but no publications will be lost, and the jobs can be retried if we want to, but there's generally only one every few minutes, not continuous.

great. thanks for confirming that

  1. change 650625 (dns). this sets doc.discovery.wmnet to doc1002.

This has the effect of switching public read traffic for doc.wikimedia.org from doc1001 to doc1002, and within a few minutes, also for Jenkins jobs to write to doc1002.

needed manual rebase. fixed. yea, TTL is 5 min already and it will be quick

If we want, we can clear the dns cache on contint100x at this point to get the Jobs to start passing or retrying sooner. Public traffic may be stale for upto 5min depending where we are in the CDN server's DNS cache interval. This is harmless imho.

agree, I don't think we have to worry about that

Change 810399 had a related patch set uploaded (by Dzahn; author: Dzahn):

[operations/puppet@production] doc: remove doc1001 from doc::all_hosts and scap dsh groups

https://gerrit.wikimedia.org/r/810399

Change 810400 had a related patch set uploaded (by Dzahn; author: Dzahn):

[operations/puppet@production] site/DHCP: decom doc1001.eqiad.wmnet

https://gerrit.wikimedia.org/r/810400

I propose the following rollout:

added 3 more steps:

  • remove doc1001 from scap dsh groups and from doc::all_hosts group (used to determine which hosts has rsync/ferm)

https://gerrit.wikimedia.org/r/c/operations/puppet/+/810399

  • run decom cookbook (this does not have to happen today)
  • remove doc1001 from DHCP and site.pp

https://gerrit.wikimedia.org/r/c/operations/puppet/+/810400

Change 810401 had a related patch set uploaded (by Dzahn; author: Dzahn):

[operations/puppet@production] doc: remove support for stretch / PHP7.0

https://gerrit.wikimedia.org/r/810401

I propose the following rollout:

I suggest:

Mentioned in SAL (#wikimedia-operations) [2022-07-01T21:09:01Z] <mutante> https://doc.wikimedia.org - scheduled maintenance period - switching to buster backend doc1002 (T247653)

Change 744763 merged by Dzahn:

[operations/puppet@production] hieradata: switchover doc to doc1002

https://gerrit.wikimedia.org/r/744763

Change 810399 merged by Dzahn:

[operations/puppet@production] doc: remove doc1001 from doc::all_hosts and scap dsh groups

https://gerrit.wikimedia.org/r/810399

Change 650625 merged by Dzahn:

[operations/dns@master] switch doc.wikimedia.org to doc1002 backend

https://gerrit.wikimedia.org/r/650625

Many thanks for the work on this one @Dzahn!

Same, thanks @Dzahn for making the OOUIPHP demos work again!

Mentioned in SAL (#wikimedia-operations) [2022-07-07T20:03:38Z] <mutante> destroying former strech backend of doc.wikimedia.org, replaced by doc1002 on buster (T247653)

cookbooks.sre.hosts.decommission executed by dzahn@cumin2002 for hosts: doc1001.eqiad.wmnet

  • doc1001.eqiad.wmnet (PASS)
    • Downtimed host on Icinga/Alertmanager
    • Found Ganeti VM
    • VM shutdown
    • Started forced sync of VMs in Ganeti cluster eqiad to Netbox
    • Removed from DebMonitor
    • Removed from Puppet master and PuppetDB
    • VM removed
    • Started forced sync of VMs in Ganeti cluster eqiad to Netbox

:) yw

doc1001.eqiad.wmnet has now been destroyed (via decom cookbook).

the original ticket is resolved. doc1001 is gone and doc2001 exists.

modulo questions like "now, can it be active-active or does it need to stay active-passive".

We do rsync data from 1002 to 2001 and it's a DNS flip of where we point doc.discovery.wmnet now.

When this was created we did not even have a discovery record, and now we use that in other places instead of hardcoding hostnames. so much better already.

Change 810400 merged by Dzahn:

[operations/puppet@production] site/DHCP: decom doc1001.eqiad.wmnet

https://gerrit.wikimedia.org/r/810400

Change 810401 merged by Dzahn:

[operations/puppet@production] doc: remove support for stretch, add support for bullseye

https://gerrit.wikimedia.org/r/810401

Change 744762 abandoned by Majavah:

[operations/dns@master] discovery: switchover doc to doc1002

Reason:

not needed

https://gerrit.wikimedia.org/r/744762