Page MenuHomePhabricator

rack/setup/install mwmaint1002.eqiad.wmnet
Closed, ResolvedPublic

Description

This task will track the racking, setup, and installation of mwmaint1002.eqiad.wmnet.

Please note that mwmaint1001.eqiad.wmnet was a temporary reallocation of an image scaler. Once this new host is fully online and in service, a new task needs to be created (or existing linked) to track the return/reimage of mwmaint1001 back to image scaler duties.

Racking Proposal: Any 1G rack will do, this will have an internal vlan/ip.

mwmaint1002:

  • - receive in system on procurement task T195418
  • - rack system with proposed racking plan (see above) & update racktables (include all system info plus location)
  • - bios/drac/serial setup/testing
  • - mgmt dns entries added for both asset tag and hostname
  • - network port setup (description, enable, vlan)
    • end on-site specific steps
  • - production dns entries added
  • - operations/puppet update (install_server at minimum, other files if possible)
  • - OS installation
  • - puppet accept/initial run
  • - handoff for service implementation

Details

SubjectRepoBranchLines +/-
operations/puppetproduction+3 -9
operations/puppetproduction+0 -2
operations/puppetproduction+2 -2
operations/puppetproduction+0 -4
operations/puppetproduction+2 -2
operations/puppetproduction+4 -3
operations/puppetproduction+19 -3
operations/puppetproduction+1 -1
operations/puppetproduction+1 -1
operations/puppetproduction+1 -0
operations/puppetproduction+3 -2
operations/puppetproduction+1 -1
operations/puppetproduction+9 -3
operations/puppetproduction+1 -1
operations/puppetproduction+2 -0
operations/puppetproduction+3 -0
operations/dnsmaster+2 -0
operations/puppetproduction+6 -0
operations/puppetproduction+5 -0
operations/dnsmaster+4 -3
operations/dnsmaster+32 -0
operations/dnsmaster+185 -153
Show related patches Customize query in gerrit

Event Timeline

There are a very large number of changes, so older changes are hidden. Show Older Changes

Change 451680 had a related patch set uploaded (by Cmjohnson; owner: Cmjohnson):
[operations/dns@master] Adding mgmt dns for several new servers

https://gerrit.wikimedia.org/r/451680

Change 451680 abandoned by Cmjohnson:
Adding mgmt dns for several new servers

https://gerrit.wikimedia.org/r/451680

Change 452718 had a related patch set uploaded (by Cmjohnson; owner: Cmjohnson):
[operations/dns@master] Adding mgmt dns for newly racked servers

https://gerrit.wikimedia.org/r/452718

Change 452718 merged by Cmjohnson:
[operations/dns@master] Adding mgmt dns for newly racked servers

https://gerrit.wikimedia.org/r/452718

Cmjohnson updated the task description. (Show Details)
Cmjohnson moved this task from Racking Tasks to Blocked on the ops-eqiad board.

assigning to @RobH for installs

I'll take this. We should do it while eqiad is still not active, that will make it a lot easier than doing it later after the switch back.

Change 461202 had a related patch set uploaded (by Dzahn; owner: Dzahn):
[operations/dns@master] mwmaint1002: add prod DNS entries (v4)

https://gerrit.wikimedia.org/r/461202

Change 461204 had a related patch set uploaded (by Dzahn; owner: Dzahn):
[operations/puppet@production] DHCP: add mwmaint1002 MAC address

https://gerrit.wikimedia.org/r/461204

Change 461202 merged by Dzahn:
[operations/dns@master] mwmaint1002: add prod DNS entries (v4)

https://gerrit.wikimedia.org/r/461202

Change 461204 merged by Dzahn:
[operations/puppet@production] DHCP: add mwmaint1002 MAC address

https://gerrit.wikimedia.org/r/461204

Script wmf-auto-reimage was launched by dzahn on neodymium.eqiad.wmnet for hosts:

['mwmaint1002.eqiad.wmnet']

The log can be found in /var/log/wmf-auto-reimage/201809190022_dzahn_9192.log.

Completed auto-reimage of hosts:

['mwmaint1002.eqiad.wmnet']

and were ALL successful.

Change 461261 had a related patch set uploaded (by Dzahn; owner: Dzahn):
[operations/puppet@production] add mwmaint1002 to site.pp with spare role

https://gerrit.wikimedia.org/r/461261

Change 461261 merged by Dzahn:
[operations/puppet@production] add mwmaint1002 to site.pp with spare role and IPv6

https://gerrit.wikimedia.org/r/461261

Change 461427 had a related patch set uploaded (by Dzahn; owner: Dzahn):
[operations/dns@master] add IPv6 records for mwmaint1002

https://gerrit.wikimedia.org/r/461427

Change 461427 merged by Dzahn:
[operations/dns@master] add IPv6 records for mwmaint1002

https://gerrit.wikimedia.org/r/461427

mwmaint1002.eqiad.wmnet has address 10.64.16.77
mwmaint1002.eqiad.wmnet has IPv6 address 2620:0:861:102:10:64:16:77

[mwmaint1001:~] $ host 2620:0:861:102:10:64:16:77
7.7.0.0.6.1.0.0.4.6.0.0.0.1.0.0.2.0.1.0.1.6.8.0.0.0.0.0.0.2.6.2.ip6.arpa domain name pointer mwmaint1002.eqiad.wmnet.

[mwmaint1001:~] $ host 10.64.16.77
77.16.64.10.in-addr.arpa domain name pointer mwmaint1002.eqiad.wmnet.

[mwmaint1001:~] $ ping6 mwmaint1002.eqiad.wmnet
PING mwmaint1002.eqiad.wmnet(mwmaint1002.eqiad.wmnet (2620:0:861:102:10:64:16:77)) 56 data bytes
64 bytes from mwmaint1002.eqiad.wmnet (2620:0:861:102:10:64:16:77): icmp_seq=1 ttl=64 time=0.454 ms

[mwmaint1001:~] $ ping mwmaint1002.eqiad.wmnet
PING mwmaint1002.eqiad.wmnet(mwmaint1002.eqiad.wmnet (2620:0:861:102:10:64:16:77)) 56 data bytes
64 bytes from mwmaint1002.eqiad.wmnet (2620:0:861:102:10:64:16:77): icmp_seq=1 ttl=64 time=0.255 ms

Change 461486 had a related patch set uploaded (by Dzahn; owner: Dzahn):
[operations/puppet@production] tcpircbot: add mwmaint1002 to allowed hosts

https://gerrit.wikimedia.org/r/461486

Change 461486 merged by Dzahn:
[operations/puppet@production] tcpircbot: add mwmaint1002 to allowed hosts

https://gerrit.wikimedia.org/r/461486

Change 461487 had a related patch set uploaded (by Dzahn; owner: Dzahn):
[operations/puppet@production] site/admin: replace mwmaint1001 with 1002 in comments

https://gerrit.wikimedia.org/r/461487

Change 461488 had a related patch set uploaded (by Dzahn; owner: Dzahn):
[operations/puppet@production] network::constants:: add mwmaint1002 to maintenance servers

https://gerrit.wikimedia.org/r/461488

Change 461488 merged by Dzahn:
[operations/puppet@production] network::constants:: add mwmaint1002 to maintenance servers

https://gerrit.wikimedia.org/r/461488

Change 461489 had a related patch set uploaded (by Dzahn; owner: Dzahn):
[operations/puppet@production] scap/dsh: add mwmaint1002 to mw scap hosts

https://gerrit.wikimedia.org/r/461489

Change 461490 had a related patch set uploaded (by Dzahn; owner: Dzahn):
[operations/puppet@production] trafficserver: replace mwmaint1001 with 1002 as noc.wm.org backend

https://gerrit.wikimedia.org/r/461490

Change 461491 had a related patch set uploaded (by Dzahn; owner: Dzahn):
[operations/puppet@production] site: add mediawiki_maintenance role to mwmaint1002

https://gerrit.wikimedia.org/r/461491

Change 461492 had a related patch set uploaded (by Dzahn; owner: Dzahn):
[operations/puppet@production] site: turn mwmaint1001 into a spare::system

https://gerrit.wikimedia.org/r/461492

Change 461493 had a related patch set uploaded (by Dzahn; owner: Dzahn):
[operations/puppet@production] mariadb: add mwmaint1002 to grants for production-m5

https://gerrit.wikimedia.org/r/461493

Change 461494 had a related patch set uploaded (by Dzahn; owner: Dzahn):
[operations/puppet@production] mediawiki_maintenance: allow rsyncing home dirs from 1001 to 1002

https://gerrit.wikimedia.org/r/461494

Change 461494 merged by Dzahn:
[operations/puppet@production] mediawiki_maintenance: allow rsyncing home dirs from 1001 to 1002

https://gerrit.wikimedia.org/r/461494

Change 461493 merged by Jcrespo:
[operations/puppet@production] mariadb: add mwmaint1002 to grants for production-m5

https://gerrit.wikimedia.org/r/461493

Change 461491 merged by Dzahn:
[operations/puppet@production] site: add mediawiki_maintenance role to mwmaint1002

https://gerrit.wikimedia.org/r/461491

role applied, now blocked on puppet error due to missing cert for mcrouter in the private repo. How to generate these certs isn't documented yet. working on changing that

Mentioned in SAL (#wikimedia-operations) [2018-09-21T16:29:36Z] <mutante> puppetmaster: running mcrouter_generate_certs to add an mcrouter cert for mwmaint1002 (T201343) https://wikitech.wikimedia.org/wiki/Mcrouter#Generate_certs_for_a_new_host

Mentioned in SAL (#wikimedia-operations) [2018-09-21T22:13:38Z] <mutante> mwmaint1002 - re-enabling puppet, now that it has mcrouter certs, that works (T201343)

Dzahn changed the task status from Stalled to Open.Sep 21 2018, 10:14 PM

Change 461487 merged by Dzahn:
[operations/puppet@production] site/admin: replace mwmaint1001 with 1002 in comments

https://gerrit.wikimedia.org/r/461487

Change 461489 merged by Dzahn:
[operations/puppet@production] scap/dsh: add mwmaint1002 to mw scap hosts

https://gerrit.wikimedia.org/r/461489

Change 462036 had a related patch set uploaded (by Dzahn; owner: Dzahn):
[operations/puppet@production] cache::text: replace (commented) mwmaint1001 with mwmaint1002

https://gerrit.wikimedia.org/r/462036

Mentioned in SAL (#wikimedia-operations) [2018-09-21T23:47:15Z] <mutante> mwmaint1002 - rsyncing home dirs over from mwmaint1001 which synced them from terbium. still large old files from terbium here, like a NASA video :) (T201343)

Dzahn raised the priority of this task from Medium to High.Sep 21 2018, 11:52 PM

Disk space on the root partition of mwmaint1002 is depleted, which results in failing puppet runs

the new root partition is smaller than before on mwmaint1001 and also terbium. So copying home dirs from there would not work anymore, not enough space. Fixed by deleting a large 60G videos directory

Mentioned in SAL (#wikimedia-operations) [2018-09-25T00:38:05Z] <mutante> mwmaint1002 - created /var/run/nutcracker dir and fixed permissions on it, then started nutcracker with systemctl. this fixed icinga alerts T201343

Change 461490 merged by Dzahn:
[operations/puppet@production] trafficserver: replace mwmaint1001 with 1002 as noc.wm.org backend

https://gerrit.wikimedia.org/r/461490

The nutcracker issue is T204450 and could have also been solved by rebooting or applying the role without using spare first.

Change 462036 merged by Dzahn:
[operations/puppet@production] cache::text: replace (commented) mwmaint1001 with mwmaint1002

https://gerrit.wikimedia.org/r/462036

Change 463563 had a related patch set uploaded (by Dzahn; owner: Dzahn):
[operations/puppet@production] mw_maintenance: temp hack to avoid duplicate crons on switch to eqiad

https://gerrit.wikimedia.org/r/463563

This should be all done now. The last open change is the one above to add a temp hack to avoid that both mwmaint servers in eqiad become activated at the same time when we switch back to eqiad.

Or we could remove mwmaint1001 completely before.. but i thought this way is nicer in case there is an unexpected issue with mwmaint1002. then mwmaint1001 would still be available to use and not already decom'ed.

Change 463563 merged by Dzahn:
[operations/puppet@production] mw_maintenance: temp hack to avoid duplicate crons on switch to eqiad

https://gerrit.wikimedia.org/r/463563

This server will become active when we switch Mediawiki back to eqiad on October 10th.

Change 465646 had a related patch set uploaded (by Dzahn; owner: Dzahn):
[operations/puppet@production] mw_maintenance: use $ensure, not $motd_ensure for warning motd

https://gerrit.wikimedia.org/r/465646

Change 465646 merged by Dzahn:
[operations/puppet@production] mw_maintenance: let $motd_ensure be based on $ensure for warning motd

https://gerrit.wikimedia.org/r/465646

Change 461492 merged by Dzahn:
[operations/puppet@production] site: turn mwmaint1001 into a spare::system

https://gerrit.wikimedia.org/r/461492

Change 465681 had a related patch set uploaded (by Dzahn; owner: Dzahn):
[operations/puppet@production] scap/tcpircbot: remove mwmaint1001 from scap and allowed hosts

https://gerrit.wikimedia.org/r/465681

Change 465681 merged by Dzahn:
[operations/puppet@production] scap/tcpircbot: remove mwmaint1001 from scap and allowed hosts

https://gerrit.wikimedia.org/r/465681

Change 465685 had a related patch set uploaded (by Dzahn; owner: Dzahn):
[operations/puppet@production] mariadb: remove mwmaint1001 from prod-m5 SQL grants

https://gerrit.wikimedia.org/r/465685

Change 465686 had a related patch set uploaded (by Dzahn; owner: Dzahn):
[operations/puppet@production] network::constants: remove mwmaint1001

https://gerrit.wikimedia.org/r/465686

Yes.

mariadb: remove mwmaint1001 from prod-m5 SQL grants - https://gerrit.wikimedia.org/r/#/c/operations/puppet/+/465645/
network::constants: remove mwmaint1001 - https://gerrit.wikimedia.org/r/#/c/operations/puppet/+/465686/
Revert "rename wmf6936 from mw1297 to mwmaint1001" - https://gerrit.wikimedia.org/r/#/c/operations/dns/+/465689/
Revert "mw_maintenance: temp hack to avoid duplicate crons on switch to eqiad" https://gerrit.wikimedia.org/r/#/c/operations/puppet/+/465645/

But i can pretty much self-merge those except the mariadb one that needs deployment. Some need rebasing.

And then there is renaming mwmaint1001 in DHCP and DNS and reinstalling it.

(Yea, i could have done this in a separate task but already started doing both here, 1002 setup and removing 1001.)

Yes.

mariadb: remove mwmaint1001 from prod-m5 SQL grants - https://gerrit.wikimedia.org/r/#/c/operations/puppet/+/465645/
But i can pretty much self-merge those except the mariadb one that needs deployment. Some need rebasing.

Regarding that one, we were added to the patch yesterday EU night, and today we've had two pretty massive incidents, so we have not been able to deal with this, probably we won't be able to do so till late next week once things have stabilized.

There is no urgency at all and it wasn't expected. I only listed what is left to be done. Please dont worry about this at all, especially in the current situation.

Dzahn lowered the priority of this task from High to Medium.Oct 11 2018, 6:34 PM

lowering priority because mwmaint1002 is in production and the remaining steps are all just cleanup

Ok, let's keep the ticket within the original focus.. setting up mwmaint1002. That is done.

Normally there would be a separate "decom mwmaint1001" ticket but in this case there is only reclaiming it to be an appserver again. And that is already T192457.

So nevermind. resolving here.

Change 466731 had a related patch set uploaded (by Dzahn; owner: Dzahn):
[operations/puppet@production] mediawiki_maintenance: switch home rsync to 1002->2001

https://gerrit.wikimedia.org/r/466731

Change 466731 merged by Dzahn:
[operations/puppet@production] mediawiki_maintenance: switch home rsync to 1002->2001

https://gerrit.wikimedia.org/r/466731

Change 465686 merged by Dzahn:
[operations/puppet@production] network::constants: remove mwmaint1001

https://gerrit.wikimedia.org/r/465686

Change 465685 merged by Marostegui:
[operations/puppet@production] mariadb: remove mwmaint1001 from prod-m5 SQL grants

https://gerrit.wikimedia.org/r/465685

Yes.

mariadb: remove mwmaint1001 from prod-m5 SQL grants - https://gerrit.wikimedia.org/r/#/c/operations/puppet/+/465645/

Change merged and grants removed on MySQL.
Thanks for the patience!