doc1001.eqiad.wmnet is the backend for https://doc.wikimedia.org and running stretch
It should be replaced with doc1002 running on buster.
And possible doc2001 should be created as well.
time frame: by the end of Q3 2020
doc1001.eqiad.wmnet is the backend for https://doc.wikimedia.org and running stretch
It should be replaced with doc1002 running on buster.
And possible doc2001 should be created as well.
time frame: by the end of Q3 2020
Status | Subtype | Assigned | Task | ||
---|---|---|---|---|---|
Stalled | None | T302086 Set scap minimum python version to 3.7 | |||
Resolved | None | T247045 Migrate all of production metal and VMs to Buster or later | |||
Resolved | BUG REPORT | Krinkle | T297035 Demos page for OOUI in php is broken | ||
Resolved | Dzahn | T247653 replace doc1001.eqiad.wmnet with a buster VM and create the codfw equivalent | |||
Resolved | Dzahn | T269977 eqiad: 1 VM request for doc (doc1002) | |||
Resolved | Dzahn | T269978 codfw: 1 VM request for doc.wikimedia.org (doc2001) |
There is definitely only doc1001 in the caching layer. The DNS name used by caching servers is doc.discovery.wmnet and that is hardcoded in DNS repo to point to doc1001, there is no geo DNS for it.
The document root of the webserver on doc1001 is really DocumentRoot /srv/deployment/integration/docroot/org/wikimedia/doc
And yes, /srv/deployment/integration/docroot/org/wikimedia/doc/opensource.yaml has the Minify entry.
regarding caching there is:
26 # Lower caching length (T184255) 27 Header set Cache-Control "s-maxage=3600, must-revalidate, max-age=0"
Got error 'PHP message: PHP Parse error: syntax error, unexpected 'const' (T_CONST), expecting variable (T_VARIABLE) in /srv/deployment/integration/docroot-cache/revs/672e79ffce05ef863b0b508a89ecb5f67ca9b916/shared/Page.php on line 5\n'
This is a stretch server with PHP 7.0. But "The ability to specify the visibility of class constants was only added in PHP 7.1" source
The line 5 in Page.php has "public const INDEX_ALLOW_SKIP = 1;" but the "public" part won't work here?
Switching to doc1002 on buster should actually fix that issue.
Apparently it was just not noticed before due to caching.
And yet:
shows a perfectly fine, but outdated HTML response, with various libs, but not "Minify".
Huh, and indeed it shows HTTP 500 now. But where was it cached? Does the local Apache have an HTTP cache proxy that serves stale responses if it gets 500 from PHP?
Change 663093 had a related patch set uploaded (by Krinkle; owner: Krinkle):
[integration/docroot@master] shared: Unbreak Page.php for old php 7.0 doc1001 server
Change 663093 merged by jenkins-bot:
[integration/docroot@master] shared: Unbreak Page.php for old php 7.0 doc1001 server
after the changes above and rebooting the VM things are working again.
let's just switch to doc1002, Hashar
I can not tell why the homepage of doc.wikimedia.org would be stall after a deployment of integration/docroot.git . I have noticed that a few days ago after the introduction of a change to Shellbox. Surely I should have reported it but I was on MediaWiki train and focused on that.
The issue was no more showing this morning. Maybe the unexpected 'const' (T_CONST) is what caused the page to be stall? At least the paths refers to 672e79ffce05ef863b0b508a89ecb5f67ca9b916 which is the latest of integration/docroot so at least it is using the proper version of the code. I haven't captured the response headers I had yesterday, maybe the cache layer just keep serving them (there is a one hour max-age) and maybe the cached page is not invalidated in ATS/Varnish if the backend serves a 500 (if it was indeed serving a 500 for that constant error).
doc1001 does run php 7.0, we should have kept the php7.0 linting job on the repository to prevent the issue we had with a class constant being declared public. But the php70 linter got removed as part of phasing out 7.0 from CI entirely, I guess that is a shortcoming. I am pretty sure I reported about it on the related change but then given we really really had to drop php 7.0 I guess it was an acceptable risk to take.
What we can do in the interim is to upgrade php on doc.wikimedia.org. It is running Stretch so we can get the php 7.2 packages from component/php72. I think that needs:
As for this migration, I haven't commented on this task yet cause I have no availability to conduct it myself. I have stopped scaling ages ago and that has been clearly identified a year or so ago when I have left for a couple months. The reality is I simply have too many tasks to manage.
The good news is that it has been identified. The plan we have with Tyler is for me to pair with @LarsWirzenius to conduct the various Buster upgrades we have to do. But that needs some availability from both Lars and I, a few pairing session to explain what doc.wikimedia.org is doing and capture that in documentation while at it ( https://wikitech.wikimedia.org/wiki/doc.wikimedia.org ). Then I do some iterations with him to prepare the migration and finally actually do it.
Tyler and I talked about that a few weeks ago, I have been upgrading Gerrit, running the MediaWiki train and other CI related maintenance burden and I haven't reached out to Lars yet about it.
As mentioned above, this was due to the hot fixes merged above and rebooting the VM.
Maybe the unexpected 'const' (T_CONST) is what caused the page to be stall?
Yes, that caused the 500 errors that popped up after it wasn't serving stale content anymore.
doc1001 does run php 7.0, we should have kept the php7.0 linting job on the repository
Yea, Krinkle is aware of that.
What we can do in the interim is to upgrade php on doc.wikimedia.org. It is running Stretch so we can get the php 7.2 packages from component/php72. I think that needs:
- some changes to the Apache configuration in case it points to php7.0-fpm
Or we could switch to the buster replacement VM that has been sitting there since a while now, only waiting for an OK to switch. Which doesn't need any new changes and isn't another interim solution.
Change 650625 abandoned by Dzahn:
[operations/dns@master] switch doc.wikimedia.org to doc1002 backend
Reason:
per chat with Krinkle, this isn't ready yet to be switched
I've not managed to do anything for this task yet, but I have a question: since doc.wikimedia.org seems to primarly be a site with static content, why do we run PHP on it?
PHP is used for:
replacement VMs with buster, applied puppet role for doc and no puppet errors are on standby since quite some time.
contint-admins group is applied, granting shell access to all members
doc1002 and the new codfw equivalent doc2001 are there to replace the SPOF doc1001
I am not sure if there is actually a technical blocker here that keeps us from simply switching.
Maybe someone could run some tests from the deployment server to check for differences between doc1001 and doc1002/doc2001?
for example with httpbb (httpbb --hosts doc1002.eqiad.wmnet /srv/deployment/httpbb-tests/doc/test_doc.yaml ), httpbb with their own test yaml file, curl or something else?
Change 650306 had a related patch set uploaded (by Dzahn; author: Dzahn):
[operations/puppet@production] scap/dsh: add doc1002/doc2001 to ci-docroot hosts
@Dzahn Just an idea, but if we create an alias of some kind (doc.svc?) we can reduce the amount of hardcoded references to doc1001, and thus make the switch a bit easier to do, and especially easier to revert.
The main one that I'm thinking about is the Jenkins jobs that hardcode doc1001 currently as their rsync target.
Also, should the new hosts be added to the scap target for docroot in hieradata/common/scap/dsh.yaml? (This is just the site homepage, not the main content. I'm guessing the spare host remains an empty/cold standby.)
@Krinkle sure, always a good idea to replace hardcoded host names. We already have this !:)
[deploy1002:~] $ host doc.discovery.wmnet doc.discovery.wmnet is an alias for doc1001.eqiad.wmnet.
We use it in ATS and as cert name:
common/profile/trafficserver/backend.yaml: replacement: https://doc.discovery.wmnet role/common/doc.yaml:profile::tlsproxy::envoy::global_cert_name: "doc.discovery.wmnet"
should the new hosts be added to the scap target for docroot in hieradata/common/scap/dsh.yaml?
I don't know that, I'm not a deployer of it and releng isn't on this ticket anymore.
Change 650306 restored by Hashar:
[operations/puppet@production] scap/dsh: add doc1002/doc2001 to ci-docroot hosts
Change 741713 had a related patch set uploaded (by Majavah; author: Majavah):
[operations/puppet@production] P::doc: sync data to non-active servers
Change 741715 had a related patch set uploaded (by Majavah; author: Majavah):
[operations/puppet@production] P::doc: use correct php_fpm path
I'm not a deployer either but Stretch is starting to get really old and we should get rid of it, so yes if possible.
It's stalled on bandwith of releng and per @Aklapper bug status should reflect reality not wishes.
Change 650306 merged by Dzahn:
[operations/puppet@production] scap/dsh: add doc1002/doc2001 to ci-docroot hosts
Change 741715 merged by Dzahn:
[operations/puppet@production] P::doc: use correct php_fpm path
Change 741752 had a related patch set uploaded (by Majavah; author: Majavah):
[integration/config@master] publish: use discovery name
Change 741713 merged by Jbond:
[operations/puppet@production] P::doc: sync data to non-active servers
Change 743485 had a related patch set uploaded (by Majavah; author: Majavah):
[operations/puppet@production] httpbb: fix doc tests
Change 743485 merged by RLazarus:
[operations/puppet@production] httpbb: fix doc tests
Change 741752 merged by jenkins-bot:
[integration/config@master] publish: use discovery name
Change 744762 had a related patch set uploaded (by Majavah; author: Majavah):
[operations/dns@master] discovery: switchover doc to doc1002
Change 744763 had a related patch set uploaded (by Majavah; author: Majavah):
[operations/puppet@production] hieradata: switchover doc to doc1002
@hashar @Krinkle Content sync between instances, the jenkins publish job and integration/docroot scap setup are all now fixed. The site also seems to work fine when browsing it via an SSH tunnel (ssh -L 8083:doc1002.eqiad.wmnet:80 deployment.eqiad.wmnet). Are you aware of any remaining blockers or can we try switching over doc1001->doc1002?
A few people poked me about that instance. Looks like the bulk of the work has been accomplished by @Dzahn and @Majavah
On the CI configuration side, we push artifacts via the Jenkins job publish-to-doc which rsync to the DNS entry doc.discovery.wmnet. It is a CNAME with a 300 seconds time to live.
Thus the switch can be done by restoring and deploying the DNS change https://gerrit.wikimedia.org/r/c/operations/dns/+/650625/
The Jenkins job runs on contint2001.wikimedia.org so possibly the DNS cache will have to be flushed.
There is a rsync job triggered once per hour on the active doc server which rsync all published docs to the other servers. The switch can be done with https://gerrit.wikimedia.org/r/c/operations/puppet/+/744763
Thus the sole issues I see are the time based race conditions when doing:
Those are probably not a big concern :)
I propose the following rollout:
This has the effect of disallowing incoming rsync from Jenkins on doc1001, and opening up incoming on doc1002, and inverting the hourly rsync cron. It has the indirect effect of making these Jenkins jobs fail until the next step, but no publications will be lost, and the jobs can be retried if we want to, but there's generally only one every few minutes, not continuous.
This has the effect of switching public read traffic for doc.wikimedia.org from doc1001 to doc1002, and within a few minutes, also for Jenkins jobs to write to doc1002.
If we want, we can clear the dns cache on contint100x at this point to get the Jobs to start passing or retrying sooner. Public traffic may be stale for upto 5min depending where we are in the CDN server's DNS cache interval. This is harmless imho.
@Krinkle Yep, that summary sounds right to me. That's what we had in mind. It's just that some time ago you had said it's not ready yet to be switched on that change https://gerrit.wikimedia.org/r/c/operations/dns/+/650625/. I don't recall what the specific reasons were for it not being ready. But if there is no concern anymore now then this should be ready to go anytime. Feel free to invite me via calendar to make this happen. I can deal with reasonably early time in my timezone.
As pointed out in T311732 (now merged as duplicate of this one), we are blocked on doc1001 due to some envoy upgrades which are difficult to do on stretch (and doc1001 is the only host that matters that is still stretch). It would be awesome if we could push this forward soon, let me know if I can somehow help (I can definitely perform the puppet/dns deploys if they have been greenlighted).
Change 650625 restored by Dzahn:
[operations/dns@master] switch doc.wikimedia.org to doc1002 backend
rebased
also: https://gerrit.wikimedia.org/r/q/topic:doc.wikimedia.org
- apply or wait for, puppet run on doc hosts.
will do that via cumin
This has the effect of disallowing incoming rsync from Jenkins on doc1001, and opening up incoming on doc1002, and inverting the hourly rsync cron.
planning to manually run the rsync command instead of waiting for the timer (formerly cron). but exact same command that would be the timer command, as the same user (root)
It has the indirect effect of making these Jenkins jobs fail until the next step, but no publications will be lost, and the jobs can be retried if we want to, but there's generally only one every few minutes, not continuous.
great. thanks for confirming that
- change 650625 (dns). this sets doc.discovery.wmnet to doc1002.
This has the effect of switching public read traffic for doc.wikimedia.org from doc1001 to doc1002, and within a few minutes, also for Jenkins jobs to write to doc1002.
needed manual rebase. fixed. yea, TTL is 5 min already and it will be quick
If we want, we can clear the dns cache on contint100x at this point to get the Jobs to start passing or retrying sooner. Public traffic may be stale for upto 5min depending where we are in the CDN server's DNS cache interval. This is harmless imho.
agree, I don't think we have to worry about that
Change 810399 had a related patch set uploaded (by Dzahn; author: Dzahn):
[operations/puppet@production] doc: remove doc1001 from doc::all_hosts and scap dsh groups
Change 810400 had a related patch set uploaded (by Dzahn; author: Dzahn):
[operations/puppet@production] site/DHCP: decom doc1001.eqiad.wmnet
added 3 more steps:
https://gerrit.wikimedia.org/r/c/operations/puppet/+/810399
Change 810401 had a related patch set uploaded (by Dzahn; author: Dzahn):
[operations/puppet@production] doc: remove support for stretch / PHP7.0
I suggest:
Mentioned in SAL (#wikimedia-operations) [2022-07-01T21:09:01Z] <mutante> https://doc.wikimedia.org - scheduled maintenance period - switching to buster backend doc1002 (T247653)
Change 744763 merged by Dzahn:
[operations/puppet@production] hieradata: switchover doc to doc1002
Change 810399 merged by Dzahn:
[operations/puppet@production] doc: remove doc1001 from doc::all_hosts and scap dsh groups
Change 650625 merged by Dzahn:
[operations/dns@master] switch doc.wikimedia.org to doc1002 backend
Mentioned in SAL (#wikimedia-operations) [2022-07-01T21:48:45Z] <mutante> https://doc.wikimedia.org switched to doc1002 backend on buster T247653
Mentioned in SAL (#wikimedia-operations) [2022-07-07T20:03:38Z] <mutante> destroying former strech backend of doc.wikimedia.org, replaced by doc1002 on buster (T247653)
cookbooks.sre.hosts.decommission executed by dzahn@cumin2002 for hosts: doc1001.eqiad.wmnet
the original ticket is resolved. doc1001 is gone and doc2001 exists.
modulo questions like "now, can it be active-active or does it need to stay active-passive".
We do rsync data from 1002 to 2001 and it's a DNS flip of where we point doc.discovery.wmnet now.
When this was created we did not even have a discovery record, and now we use that in other places instead of hardcoding hostnames. so much better already.
Change 810400 merged by Dzahn:
[operations/puppet@production] site/DHCP: decom doc1001.eqiad.wmnet
Change 810401 merged by Dzahn:
[operations/puppet@production] doc: remove support for stretch, add support for bullseye
Change 744762 abandoned by Majavah:
[operations/dns@master] discovery: switchover doc to doc1002
Reason:
not needed