setup replacements for maintenance_server (terbium, wasat) on Stretch
Open, HighPublic

Description

setup a new maintenance server to replace terbium

use stretch

<s>pick another element name</s> use "mwmaint1001" (see T192185#4152332)

Related Objects

StatusAssignedTask
OpenNone
OpenNone
OpenNone
ResolvedMoritzMuehlenhoff
ResolvedMoritzMuehlenhoff
ResolvedMoritzMuehlenhoff
ResolvedNone
ResolvedQuiddity
ResolvedLadsgroup
ResolvedJoe
ResolvedLegoktm
ResolvedLegoktm
Resolvedhashar
Resolvedhashar
Resolvedssastry
ResolvedSmalyshev
ResolvedLegoktm
OpenKrinkle
OpenNone
ResolvedNone
OpenNone
Resolvedaaron
OpenJoe
ResolvedJdforrester-WMF
ResolvedNone
OpenNone
ResolvedRobH
ResolvedCmjohnson
ResolvedMoritzMuehlenhoff
Resolvedjcrespo
ResolvedJdforrester-WMF
There are a very large number of changes, so older changes are hidden. Show Older Changes

Change 430674 had a related patch set uploaded (by Dzahn; owner: Dzahn):
[operations/puppet@production] mwmaint1001: add mediawiki-maintenance role

https://gerrit.wikimedia.org/r/430674

Change 430674 merged by Dzahn:
[operations/puppet@production] mwmaint1001: add mediawiki-maintenance role

https://gerrit.wikimedia.org/r/430674

Change 430817 had a related patch set uploaded (by Dzahn; owner: Dzahn):
[operations/puppet@production] mw-maintenance: add PHP7/stretch support

https://gerrit.wikimedia.org/r/430817

Change 430939 had a related patch set uploaded (by Dzahn; owner: Dzahn):
[operations/puppet@production] mwmaint: add mapped IPv6 address on mwmaint1001

https://gerrit.wikimedia.org/r/430939

Change 430939 merged by Dzahn:
[operations/puppet@production] mwmaint: add mapped IPv6 address on mwmaint1001

https://gerrit.wikimedia.org/r/430939

Change 430817 merged by Dzahn:
[operations/puppet@production] mw-maintenance: add PHP7 support, php-readline version

https://gerrit.wikimedia.org/r/430817

Change 430959 had a related patch set uploaded (by Dzahn; owner: Dzahn):
[operations/dns@master] add IPv6 records for mwmaint1001

https://gerrit.wikimedia.org/r/430959

Change 430959 merged by Dzahn:
[operations/dns@master] add IPv6 records for mwmaint1001

https://gerrit.wikimedia.org/r/430959

Change 430529 merged by Dzahn:
[operations/puppet@production] tcpircbot: add mwmaint1001 to ferm rules

https://gerrit.wikimedia.org/r/430529

Change 431039 had a related patch set uploaded (by Dzahn; owner: Dzahn):
[operations/puppet@production] switch mw-maintenance server from terbium to mwmaint1001

https://gerrit.wikimedia.org/r/431039

Change 431041 had a related patch set uploaded (by Dzahn; owner: Dzahn):
[operations/puppet@production] decom terbium: rm from scap,site,dhcp,network constants

https://gerrit.wikimedia.org/r/431041

Change 431042 had a related patch set uploaded (by Dzahn; owner: Dzahn):
[operations/puppet@production] mariadb: remove grants for terbium (do not merge)

https://gerrit.wikimedia.org/r/431042

Change 431047 had a related patch set uploaded (by Dzahn; owner: Dzahn):
[operations/puppet@production] mw-maintenance: enable crons based on fqdn, not mw_primary

https://gerrit.wikimedia.org/r/431047

Change 431047 merged by Dzahn:
[operations/puppet@production] mw-maintenance: enable crons based on fqdn, not mw_primary

https://gerrit.wikimedia.org/r/431047

Dzahn added a comment.May 5 2018, 12:20 AM
  • mwmaint1001 is now up and running with stretch

things that have already happened:

https://gerrit.wikimedia.org/r/#/q/status:merged+project:operations/puppet+branch:production+topic:terbium

things not working that need to be fixed:

  • nutcracker process not running, fails to start
  • proxysql processes not running

things yet to be done (not necessarily in that order)

things to be done for decom:

Change 431054 had a related patch set uploaded (by Dzahn; owner: Dzahn):
[operations/puppet@production] mwmaint1001: ensure tendril crons are disabled

https://gerrit.wikimedia.org/r/431054

Change 431054 merged by Dzahn:
[operations/puppet@production] mwmaint1001: ensure tendril crons are disabled

https://gerrit.wikimedia.org/r/431054

Dzahn added a comment.EditedMay 5 2018, 12:55 AM

why nutcracker fails:

30 [2018-05-05 00:50:23.047] nc.c:189 run, rabbit run / dig that hole, forget the sun / and when at last the work is done / don't sit down / it's time to dig anothe r one
31 [2018-05-05 00:50:23.057] nc_proxy.c:148 bind on p 43 to addr '/var/run/nutcracker/nutcracker.sock 0666' failed: No such file or directory

FIX: mkdir /var/run/nutcracker ; chown nutcracker:nutcracker /var/run/nutcracker
review/merge: https://gerrit.wikimedia.org/r/#/c/431057/

why proxysql fails:

Process: 19140 ExecStart=/usr/bin/proxysql -f (code=exited, status=203/EXEC)
bash: /usr/bin/proxysql: No such file or directory

FIX: see T193919

Change 431057 had a related patch set uploaded (by Dzahn; owner: Dzahn):
[operations/puppet@production] nutcracker: puppetize missing /var/run/nutcracker dir

https://gerrit.wikimedia.org/r/431057

Change 430521 merged by Dzahn:
[operations/puppet@production] add mwmaint1001 to scap hosts

https://gerrit.wikimedia.org/r/430521

Change 430522 merged by Dzahn:
[operations/puppet@production] network: add mwmaint1001 to network constants

https://gerrit.wikimedia.org/r/430522

Change 426295 abandoned by Dzahn:
add mgmt DNS for nihonium, new eqiad maintenance server

Reason:
mwmaint1001 has been setup to replace terbium instead

https://gerrit.wikimedia.org/r/426295

Change 431057 abandoned by Dzahn:
nutcracker: puppetize missing /var/run/nutcracker dir

Reason:
per IRC chat with Moritz

https://gerrit.wikimedia.org/r/431057

Mentioned in SAL (#wikimedia-operations) [2018-05-08T16:36:53Z] <mutante> mwmaint1001 - reinstalling one more time after proxysql issues are resolved, PXE booting (T192092)

Change 431810 had a related patch set uploaded (by Dzahn; owner: Dzahn):
[operations/puppet@production] mw-maintenance/wikidata: set $ensure for rebuildTermSqlIndex.log

https://gerrit.wikimedia.org/r/431810

Change 431810 merged by Dzahn:
[operations/puppet@production] mw-maintenance/wikidata: set $ensure for rebuildTermSqlIndex.log

https://gerrit.wikimedia.org/r/431810

Change 430524 merged by Marostegui:
[operations/puppet@production] mariadb: add mwmaint1001 to grants for production-m5

https://gerrit.wikimedia.org/r/430524

hoo added a subscriber: hoo.May 30 2018, 2:37 PM

Change 440070 had a related patch set uploaded (by Dzahn; owner: Dzahn):
[operations/puppet@production] rm mwmaint1001.yaml - activate mariadb::maintenance

https://gerrit.wikimedia.org/r/440070

Change 440070 merged by Dzahn:
[operations/puppet@production] rm mwmaint1001.yaml - activate mariadb::maintenance

https://gerrit.wikimedia.org/r/440070

Change 440099 had a related patch set uploaded (by Dzahn; owner: Dzahn):
[operations/puppet@production] mw-maintenance: rsync home dirs from terbium to mwmaint1001

https://gerrit.wikimedia.org/r/440099

Change 440099 merged by Dzahn:
[operations/puppet@production] mw-maintenance: rsync home dirs from terbium to mwmaint1001

https://gerrit.wikimedia.org/r/440099

Change 440139 had a related patch set uploaded (by Dzahn; owner: Dzahn):
[operations/puppet@production] mw-maintenance: require GNU time from time package

https://gerrit.wikimedia.org/r/440139

Change 440139 merged by Dzahn:
[operations/puppet@production] mw-maintenance: require GNU time from time package

https://gerrit.wikimedia.org/r/440139

Change 440142 had a related patch set uploaded (by Ladsgroup; owner: Amir Sarabadani):
[operations/puppet@production] mediawiki: Stop Wikidata dispatching

https://gerrit.wikimedia.org/r/440142

Mentioned in SAL (#wikimedia-operations) [2018-06-13T16:14:21Z] <mutante> rsyncing /home dirs from terbium to mwmaint1001, they will appear later in a subdir "home-terbium" like it was done for tin->deploy1001 (T192092)

Change 440267 had a related patch set uploaded (by Dzahn; owner: Dzahn):
[operations/puppet@production] cache::misc: switch backend for dbtree from terbium to mwmaint1001

https://gerrit.wikimedia.org/r/440267

Change 440268 had a related patch set uploaded (by Dzahn; owner: Dzahn):
[operations/puppet@production] tendril: add grants for tendril_web from mwmaint1001

https://gerrit.wikimedia.org/r/440268

Change 440268 merged by Dzahn:
[operations/puppet@production] tendril: add grants for tendril_web from mwmaint1001

https://gerrit.wikimedia.org/r/440268

Change 440267 merged by Dzahn:
[operations/puppet@production] cache::misc: switch backend for dbtree from terbium to mwmaint1001

https://gerrit.wikimedia.org/r/440267

Mentioned in SAL (#wikimedia-operations) [2018-06-14T08:08:54Z] <mutante> switch backend for dbtree.wikimedia.org away from terbium to mwmaint1001 (T192092)

Change 440328 had a related patch set uploaded (by Dzahn; owner: Dzahn):
[operations/puppet@production] mw-maintenance: run wikidata maint jobs on old and new server

https://gerrit.wikimedia.org/r/440328

Change 440142 abandoned by Ladsgroup:
mediawiki: Stop Wikidata dispatching

Reason:
In favor of Daniel's patch

https://gerrit.wikimedia.org/r/440142

Change 440328 merged by Dzahn:
[operations/puppet@production] mw-maintenance: switch only wikidata maint jobs to mwmaint1001

https://gerrit.wikimedia.org/r/440328

Mentioned in SAL (#wikimedia-operations) [2018-06-14T14:40:09Z] <mutante> moving wikidata query dispatcher from terbium to mwmaint1001 - scheduled downtime - check turned into a WARN - disabling puppet on mwmaint1001, removing crons on terbium, waiting a couple minutes for them to finish, re-enabling puppet on mwmaint1001 (T192092)

Change 440542 had a related patch set uploaded (by Dzahn; owner: Dzahn):
[operations/puppet@production] noc/dbtree: require libapache-mod-php

https://gerrit.wikimedia.org/r/440542

Change 440542 merged by Dzahn:
[operations/puppet@production] noc/dbtree: require libapache-mod-php

https://gerrit.wikimedia.org/r/440542

Change 430527 merged by Dzahn:
[operations/puppet@production] cache::misc: switch noc.wm backend to mwmaint1001

https://gerrit.wikimedia.org/r/430527

Mentioned in SAL (#wikimedia-operations) [2018-06-15T15:37:35Z] <mutante> switching noc.wikimedia.org site from terbium to mwamiant1001 backend, running puppet on all cache::misc cp servers (T192092)

Dzahn removed Dzahn as the assignee of this task.Thu, Jun 21, 8:13 AM

Unassigning this ticket from me temporarily while i'm on vacation. I will take it back once i return but also want to make clear it's free for grabs by anyone while i'm gone and if you want/can continue on it that would be appreciated.

Dzahn added a comment.EditedThu, Jun 21, 8:14 AM

status: wikidata related crons are moved, other mw crons are still to be moved (by switching maintenance server with https://gerrit.wikimedia.org/r/#/c/operations/puppet/+/441346)

Change 441346 had a related patch set uploaded (by Dzahn; owner: Dzahn):
[operations/puppet@production] switch mw_maintenance server to mwmaint1001

https://gerrit.wikimedia.org/r/441346

Dzahn added a comment.Thu, Jun 21, 8:17 AM

other pending changes, mostly to decom terbium once switch is complete:

https://gerrit.wikimedia.org/r/#/q/topic:terbium+(status:open)

Change 441381 had a related patch set uploaded (by Dzahn; owner: Dzahn):
[operations/puppet@production] mw_maintenace: remove temp change for wikidata crons

https://gerrit.wikimedia.org/r/441381

Change 443792 had a related patch set uploaded (by Giuseppe Lavagetto; owner: Giuseppe Lavagetto):
[operations/puppet@production] mw-maintenance: switch to mwmaint1001

https://gerrit.wikimedia.org/r/443792

Change 443801 had a related patch set uploaded (by Giuseppe Lavagetto; owner: Giuseppe Lavagetto):
[operations/puppet@production] terbium: Add a decommission notice.

https://gerrit.wikimedia.org/r/443801

Change 443792 merged by Giuseppe Lavagetto:
[operations/puppet@production] mw-maintenance: switch to mwmaint1001.eqiad.wmnet

https://gerrit.wikimedia.org/r/443792

Mentioned in SAL (#wikimedia-operations) [2018-07-04T09:44:31Z] <_joe_> stopping all cronjobs via a puppet run on terbium, T192092

Change 443801 merged by Giuseppe Lavagetto:
[operations/puppet@production] terbium: Add a decommission notice.

https://gerrit.wikimedia.org/r/443801

Krinkle renamed this task from setup replacement for terbium (maintenance_server) on stretch to setup replacements for maintenance_server (terbium, wasat) on Stretch.Fri, Jul 6, 11:53 PM

Change 431039 abandoned by Muehlenhoff:
switch mw-maintenance server from terbium to mwmaint1001

Reason:
Superceded/replaced by a033370fbcd

https://gerrit.wikimedia.org/r/431039

Change 441346 abandoned by Muehlenhoff:
switch mw_maintenance server to mwmaint1001

Reason:
Superceded/replaced by a033370fdcb

https://gerrit.wikimedia.org/r/441346

Change 441381 abandoned by Muehlenhoff:
mw_maintenace: remove temp change for wikidata crons

Reason:
Replaced/superceded by a033370fdcb

https://gerrit.wikimedia.org/r/441381

Change 445118 had a related patch set uploaded (by Muehlenhoff; owner: Muehlenhoff):
[operations/puppet@production] Remove terbium from allowed hosts/ferm rules

https://gerrit.wikimedia.org/r/445118

Change 445118 merged by Muehlenhoff:
[operations/puppet@production] Remove terbium from allowed hosts/ferm rules

https://gerrit.wikimedia.org/r/445118

Change 445149 had a related patch set uploaded (by Muehlenhoff; owner: Muehlenhoff):
[operations/puppet@production] Reimage wasat with stretch and rename to mwmaint2001

https://gerrit.wikimedia.org/r/445149

Change 445149 merged by Muehlenhoff:
[operations/puppet@production] Reimage wasat with stretch and rename to mwmaint2001

https://gerrit.wikimedia.org/r/445149

Change 430530 abandoned by Muehlenhoff:
tcpircbot: remove terbium from ferm rules

Reason:
Obsoleted by 1e4e64dc67

https://gerrit.wikimedia.org/r/430530

Change 445421 had a related patch set uploaded (by Muehlenhoff; owner: Muehlenhoff):
[operations/puppet@production] Update grants for terbium->mwmaint1001 migration and wasat rename

https://gerrit.wikimedia.org/r/445421

I have seen this error (one in the last 8 hours):

cli_argv	       	/srv/mediawiki/multiversion/MWScript.php maintenance/cleanupUploadStash.php --wiki=labtestwiki
t  db_name	       	labtestwiki
t  db_server	       	10.64.16.79
t  db_user	       	wikiadmin
t  error	       	Access denied for user 'wikiadmin'@'%' to database 'labtestwiki'
t  host	       	mwmaint1001
t  level	       	ERROR
t  message	       	Error connecting to 10.64.16.79: Access denied for user 'wikiadmin'@'%' to database 'labtestwiki'

So I assume this will be fixed once https://gerrit.wikimedia.org/r/#/c/operations/puppet/+/445421/ is applied, right?

Change 445423 had a related patch set uploaded (by Muehlenhoff; owner: Muehlenhoff):
[operations/puppet@production] Decommission terbium

https://gerrit.wikimedia.org/r/445423

Change 445421 merged by Muehlenhoff:
[operations/puppet@production] Update grants for terbium->mwmaint1001 migration and wasat rename

https://gerrit.wikimedia.org/r/445421

There is an undocumented grant from californium.wikimedia.org to striker @bd808 - I will delete it if it is not puppetized it. I will create a separate ticket if this is offtopic here.

Let's wait for confirmation by Bryan, but californium is up for decom (replaced by the labweb* hosts), so 99.9% sure this can go away.

I have created T199518.

No more grants on m5 referencing 10.64.32.13 (terbium):

$ ./software/dbtools/section m5 | while read host port; do mysql.py -BN -h$host:$port -e "select user, host from mysql.user WHERE host='10.64.32.13';"; done

Change 445597 had a related patch set uploaded (by Jcrespo; owner: Jcrespo):
[operations/puppet@production] dbtree: move dbtree outside of mwmaint hosts

https://gerrit.wikimedia.org/r/445597