Page MenuHomePhabricator

Move the MW Beta appservers to Debian
Closed, ResolvedPublic

Description

Macro list of clusters:

  • Deployment servers - tin / mira
  • Appservers - deployment-mediawiki 01 02 03
  • Jobrunners - deployment-jobrunner01.deployment-prep.eqiad.wmflabs
  • Videoscalers (Got removed, never setup in deployment-prep)

The hostnames can be checked in hieradata/labs/deployment-prep/common.yaml

scap::dsh::groups:
    mediawiki-installation:
        hosts:
            - deployment-jobrunner01.deployment-prep.eqiad.wmflabs
            - deployment-mediawiki01.deployment-prep.eqiad.wmflabs
            - deployment-mediawiki02.deployment-prep.eqiad.wmflabs
            - deployment-mediawiki03.deployment-prep.eqiad.wmflabs
            - deployment-tmh01.deployment-prep.eqiad.wmflabs
            - deployment-tin.deployment-prep.eqiad.wmflabs
            - mira.deployment-prep.eqiad.wmflabs

Deployment steps

Jenkins runs scap as user jenkins-deploy on deployment-tin. You will need to manually accept the new host ssh fingerprint by running:

sudo -u jenkins-deploy -H SSH_AUTH_SOCK=/run/keyholder/proxy.sock ssh mwdeploy@deployment-mediawiki04.deployment-prep.eqiad.wmflabs

Manually refresh all CA certificates symlinks due to T145609:

update-ca-certificates --verbose --fresh
  • For MediaWiki web servers **

Once the puppet patch(es) are merged, the beta cluster puppetmaster needs a rebase:

ssh deployment-puppetmaster.deployment-prep.eqiad.wmflabs
sudo su -
cd /var/lib/git/operations/puppet
git pull

Then get the host pooled on the Varnish cache by running puppet on deployment-cache-text04.deployment-prep.eqiad.wmflabs. That will update the Varnish list of directors. Then: service varnish reload.

Test:

curl --silent https://en.wikipedia.beta.wmflabs.org/wiki/Special:Version?date=`date +%s`|grep deployment

And alternatively look at https://logstash-beta.wmflabs.org/

Related Objects

Event Timeline

There are a very large number of changes, so older changes are hidden. Show Older Changes

We had two 8 CPU / 16 G instances created to migrate the databases to Jessie T138778 that is scheduled for Thursday. Once migrated I guess they will be deleted and free up 16 CPU / 32 G of RAM :]

There is also a task to purge/consolidate instances on beta T142288 which is ongoing. Will free up even more of the quota.

Then I guess we can ask to lower the quota to keep the project under control.

Change 309999 had a related patch set uploaded (by Elukey):
Add mediawiki04 to the list of labs appservers in deployment-prep

https://gerrit.wikimedia.org/r/309999

Mentioned in SAL [2016-09-12T14:41:35Z] <elukey> applied base::firewall, beta::deployaccess, mediawiki::conftool, role::mediawiki::appserver to deployment-mediawiki04.deployment-prep.eqiad.wmflabs (Debian jessie instance) - T144006

Change 309999 abandoned by Elukey:
Add mediawiki04 to the list of labs appservers in deployment-prep

Reason:
Will split into two CRs

https://gerrit.wikimedia.org/r/309999

Change 310034 had a related patch set uploaded (by Elukey):
Add deployment-mediawiki04 to the deployment-prep scap dsh

https://gerrit.wikimedia.org/r/310034

Change 310035 had a related patch set uploaded (by Elukey):
Add deployment-mediawiki04 to the deployment-prep Varnish config

https://gerrit.wikimedia.org/r/310035

https://wikitech.wikimedia.org/wiki/HHVM/Troubleshooting has some interesting bits

furl http://en.wikipedia.beta.wmflabs.org/wiki/Main_Page

Does fcgi requests on localhost \O/ bypassing apache

Change 310034 merged by Elukey:
Add deployment-mediawiki04 to the deployment-prep scap dsh

https://gerrit.wikimedia.org/r/310034

hashar updated the task description. (Show Details)Sep 13 2016, 9:05 AM

Change 310035 merged by Elukey:
Add deployment-mediawiki04 to the deployment-prep Varnish config

https://gerrit.wikimedia.org/r/310035

hashar updated the task description. (Show Details)Sep 13 2016, 9:16 AM

After some mess with scap mwdeploy keys solved by running on deployment-tin:

sudo -u jenkins-deploy -H SSH_AUTH_SOCK=/run/keyholder/proxy.sock ssh mwdeploy@deployment-mediawiki04.deployment-prep.eqiad.wmflabs

deployment-mediawiki04 is in the pool:

$ curl --silent https://en.wikipedia.beta.wmflabs.org/wiki/Special:Version|grep deployment
... mw.config.set({"wgBackendResponseTime":674,"wgHostname":"deployment-mediawiki04"}) ...
                                                                        ^^^^^^^^^^^
hashar updated the task description. (Show Details)Sep 13 2016, 9:25 AM

Noticed in logstash:

Warning: failed to mkdir "/srv/mediawiki/php-master/images/thumb/2/20/Order_of_St_John_(UK)_ribbon.png"
mode 0777 [Called from wfMkdirParents in /srv/mediawiki/php-master/includes/Glob

Noticed in logstash:

Warning: failed to mkdir "/srv/mediawiki/php-master/images/thumb/2/20/Order_of_St_John_(UK)_ribbon.png"
mode 0777 [Called from wfMkdirParents in /srv/mediawiki/php-master/includes/Glob

That is unrelated to the Jessie reimaging. Filled T145496 about it.

Change 310256 had a related patch set uploaded (by Elukey):
Replace mediawiki01 with mediawiki04 (Debian Jessie) in deployment-prep

https://gerrit.wikimedia.org/r/310256

Change 310256 merged by Elukey:
Replace mediawiki01 with mediawiki04 (Debian Jessie) in deployment-prep

https://gerrit.wikimedia.org/r/310256

Change 310264 had a related patch set uploaded (by Hashar):
beta: change canary mw server from 01 to 04

https://gerrit.wikimedia.org/r/310264

Change 310264 merged by Elukey:
beta: change canary mw server from 01 to 04

https://gerrit.wikimedia.org/r/310264

Mentioned in SAL (#wikimedia-releng) [2016-09-14T09:27:50Z] <hashar> Deleting deployment-mediawiki01 , replaced by deployment-mediawiki04 T144006

elukey claimed this task.Sep 15 2016, 8:23 AM

Change 310749 had a related patch set uploaded (by Elukey):
Remove mediawiki03 from deployment-prep

https://gerrit.wikimedia.org/r/310749

Change 310749 merged by Elukey:
Remove mediawiki03 from deployment-prep

https://gerrit.wikimedia.org/r/310749

Change 310756 had a related patch set uploaded (by Elukey):
Add mediawiki06 to the deployment-prep scap dsh

https://gerrit.wikimedia.org/r/310756

Change 310756 merged by Elukey:
Add mediawiki06 to the deployment-prep scap dsh

https://gerrit.wikimedia.org/r/310756

Mentioned in SAL (#wikimedia-releng) [2016-09-15T09:33:36Z] <hashar> T144006 sudo -u jenkins-deploy -H SSH_AUTH_SOCK=/run/keyholder/proxy.sock ssh mwdeploy@deployment-mediawiki06.deployment-prep.eqiad.wmflabs

Change 310773 had a related patch set uploaded (by Elukey):
Set mediawiki06 (replacement of mediawiki03) as security audit target

https://gerrit.wikimedia.org/r/310773

Change 310773 merged by Elukey:
Set mediawiki06 (replacement of mediawiki03) as security audit target

https://gerrit.wikimedia.org/r/310773

Change 310796 had a related patch set uploaded (by Elukey):
Remove mediawiki02 from deployment prep

https://gerrit.wikimedia.org/r/310796

Change 310796 merged by Elukey:
Remove mediawiki02 from deployment prep

https://gerrit.wikimedia.org/r/310796

Change 310818 had a related patch set uploaded (by Elukey):
Add mediawiki05 to deployment-prep

https://gerrit.wikimedia.org/r/310818

Change 310818 merged by Elukey:
Add mediawiki05 to deployment-prep

https://gerrit.wikimedia.org/r/310818

elukey updated the task description. (Show Details)Sep 15 2016, 1:36 PM

Mentioned in SAL (#wikimedia-releng) [2016-09-15T14:44:08Z] <hashar> T144006 sudo -u jenkins-deploy -H SSH_AUTH_SOCK=/run/keyholder/proxy.sock ssh mwdeploy@deployment-mediawiki05.deployment-prep.eqiad.wmflabs

Mentioned in SAL (#wikimedia-releng) [2016-09-15T14:45:00Z] <hashar> T144006 sudo -u jenkins-deploy -H SSH_AUTH_SOCK=/run/keyholder/proxy.sock ssh mwdeploy@mira02.deployment-prep.eqiad.wmflabs

Mentioned in SAL (#wikimedia-releng) [2016-09-15T15:05:04Z] <hashar> T144006 Applying class role::labs::lvm::srv to mira02 (it is out of disk space :D )

Mentioned in SAL (#wikimedia-releng) [2016-09-15T15:08:21Z] <hashar> T144006 Disabled Jenkins job beta-scap-eqiad. On mira02 rm -fR /srv/* . Applying puppet for role::labs::lvm::srv

As @AlexMonk-WMF reported, we caused an issue when dealing with restbase configs: https://phabricator.wikimedia.org/T146053

As follow up would it be worth to write a page for Deployment-prep best practices? Ops maintains https://wikitech.wikimedia.org/wiki/Service_restarts, so we could either add a section in there or create a different page and link it in there. It would be great to know:

  1. general tribal knowledge about how things are handled in there (self-hosted puppet master, auto-rebase cron, etc..)
  2. impact of an outage - who are the users? Who should be notified about these maintenance work?

Change 311681 had a related patch set uploaded (by Elukey):
Add jobrunner02 to deployment-prep

https://gerrit.wikimedia.org/r/311681

Change 311681 merged by Elukey:
Add jobrunner02 to deployment-prep

https://gerrit.wikimedia.org/r/311681

As @AlexMonk-WMF reported, we caused an issue when dealing with restbase configs: https://phabricator.wikimedia.org/T146053

As follow up would it be worth to write a page for Deployment-prep best practices?

It's not a deployment-prep specific thing, the same could happen in production.

As @AlexMonk-WMF reported, we caused an issue when dealing with restbase configs: https://phabricator.wikimedia.org/T146053

As follow up would it be worth to write a page for Deployment-prep best practices?

It's not a deployment-prep specific thing, the same could happen in production.

Yes ok you are right, I'll update also prod documentation, but this does not imply that deployment-prep shouldn't have its own documentation :)

Change 311710 had a related patch set uploaded (by Muehlenhoff):
mira02 "reimaged" as deployment-mira02

https://gerrit.wikimedia.org/r/311710

Change 311710 merged by Muehlenhoff:
mira02 "reimaged" as deployment-mira02

https://gerrit.wikimedia.org/r/311710

Change 311717 had a related patch set uploaded (by Elukey):
Remove jobrunner01 from deployment-prep

https://gerrit.wikimedia.org/r/311717

Mentioned in SAL (#wikimedia-releng) [2016-09-20T18:33:31Z] <hashar> on deployment-mira02 ran sudo -u jenkins-deploy -H SSH_AUTH_SOCK=/run/keyholder/proxy.sock ssh mwdeploy@deployment-mediawiki04.deployment-prep.eqiad.wmflabs per T144006

Mentioned in SAL (#wikimedia-releng) [2016-09-20T18:38:10Z] <hashar> on tin: sudo -u jenkins-deploy -H SSH_AUTH_SOCK=/run/keyholder/proxy.sock ssh mwdeploy@deployment-mira02.deployment-prep.eqiad.wmflabs - T144006

Change 311717 merged by Elukey:
Remove jobrunner01 from deployment-prep

https://gerrit.wikimedia.org/r/311717

elukey updated the task description. (Show Details)Sep 21 2016, 8:11 AM
hashar updated the task description. (Show Details)Sep 26 2016, 10:36 AM

Change 312654 had a related patch set uploaded (by Hashar):
beta: drop deployment-tin add deployment-tin02

https://gerrit.wikimedia.org/r/312654

The deployment servers have been reimaged to Jessie:

  • deployment-mira
  • deployment-tin02

Last patch to land is https://gerrit.wikimedia.org/r/#/c/312654/

Mentioned in SAL (#wikimedia-releng) [2016-09-28T11:48:28Z] <hashar> Deleting deployment-tin Trusty instance and recreate one with same hostname as Jessie; Meant to replace deployment-tin02 T144006

Mentioned in SAL (#wikimedia-releng) [2016-09-28T19:49:30Z] <hasharAway> Dropping deployment-tin02 , replacing it with deployment-tin which has been rebuild to Jessie T144006

I have dropped deployment-tin02 it was confusing people and create a deployment-tin which is now the master https://gerrit.wikimedia.org/r/#/c/312654/

Change 312654 merged by Elukey:
beta: update deployment-tin IP and make it master

https://gerrit.wikimedia.org/r/312654

hashar added a comment.Oct 4 2016, 5:34 PM

The deployment servers on beta cluster are now fully migrated to Jessie. We ended up keeping the same hostname and have:

  • deployment-tin.eqiad.wmflabs (primary, Jenkins slave)
  • deployment-mira.eqiad.wmflabs (secondary)

What is left is deployment-tmh01 which needs some packaging work for Jessie as I understood it.

elukey removed elukey as the assignee of this task.Dec 14 2016, 11:05 AM
elukey added a project: User-Elukey.
elukey moved this task from Backlog to Ops Backlog on the User-Elukey board.Dec 14 2016, 11:08 AM
elukey moved this task from Ops Backlog to Stalled on the User-Elukey board.Dec 14 2016, 5:42 PM
greg added a comment.Jul 7 2017, 11:39 PM

What is left is deployment-tmh01 which needs some packaging work for Jessie as I understood it.

That was Oct 2016 :)

This still is accurate, at least wrt tmh:

greg@x230  ~ % ssh deployment-tmh01.eqiad.wmflabs
Linux deployment-tmh01 3.13.0-121-generic #170-Ubuntu SMP Wed Jun 14 09:04:33 UTC 2017 x86_64
Ubuntu 14.04.5 LTS
deployment-tmh01 is mediawiki::videoscaler

Looks like there is recent activity on T145742, yay!

All the packaging work for jessie is complete (and the servers in production have been migrated). If deployment-tmh01 is still used it can be reimaged as well.

EddieGP added a subscriber: EddieGP.EditedApr 4 2018, 10:29 PM

There's 4 trusty instances left in deployment-prep:

  • deployment-tmh01 is to be deleted per T174477/T191293 (done)
  • deployment-redis0[12] are to be replaced by deployment-redis0[56] (stretch) per T179371
  • deployment-mx is to be replaced by deployment-mx02 (stretch) per T184244.

Also (labeled at "UNKNOWN" in openstack browser, but logging in there and looking at /etc/os-release) these are still trusty:

  • deployment-urldownloader
  • deployment-zotero01

Both were created by @akosiaris .

Joe added a subscriber: Joe.Apr 9 2018, 5:33 AM

I think this task is resolved as it's about the MediaWiki appservers and AFAICS they're all converted to jessie at least.

Joe closed this task as Resolved.Apr 9 2018, 5:34 AM