Page MenuHomePhabricator

serve tftpboot environment from the install servers and create one in each edge POP
Closed, ResolvedPublic

Description

Per T242602 and T252382#6125029 and IRC discussion,

the install_servers (install*) should have a webserver like apt* servers and serve the tftp environment.

Then we should create an install server in each of the edge POPs.


install3001.wikimedia.org has address 91.198.174.63
install3001.wikimedia.org has IPv6 address 2620:0:862:1:91:198:174:63

install4001.wikimedia.org has address 198.35.26.12
install4001.wikimedia.org has IPv6 address 2620:0:863:1:198:35:26:12

install5001.wikimedia.org has address 103.102.166.13
install5001.wikimedia.org has IPv6 address 2001:df2:e500:1:103:102:166:13

all changes: https://gerrit.wikimedia.org/r/q/topic:%22installserver-pops%22+(status:open%20OR%20status:merged)

☑️ eqsin tested - reinstalled testvm5001 and confirmed in syslog that install5001 is serving both DHCP and TFTP now; not bast5001 (tftp) or install2003 (dhcp) anymore; and OS install is working

☑️ ulsfo tested - reinstalled testvm4001 and confirmed in syslog that install4001 is serving both DHCP and TFTP now; not bast4002 or install2003 (dhcp) anymore; and OS install is working

☑️ esams tested - reinstalled testvm3001 and confirmed in syslog that install4001 is serving both DHCP and TFTP now; not bast3004 or install1003(dhcp) anymore; and OS install is working

Details

ProjectBranchLines +/-Subject
operations/homer/publicmaster+3 -3
operations/puppetproduction+0 -2
operations/puppetproduction+0 -9
operations/puppetproduction+1 -4
operations/puppetproduction+3 -3
operations/puppetproduction+5 -4
operations/puppetproduction+4 -3
operations/puppetproduction+2 -1
operations/puppetproduction+1 -0
operations/puppetproduction+2 -3
operations/puppetproduction+1 -1
operations/puppetproduction+1 -0
operations/puppetproduction+5 -0
operations/dnsmaster+4 -0
operations/puppetproduction+1 -1
operations/puppetproduction+15 -0
operations/puppetproduction+2 -2
operations/puppetproduction+1 -0
operations/puppetproduction+10 -0
operations/dnsmaster+12 -1
operations/puppetproduction+11 -0
operations/puppetproduction+3 -2
operations/puppetproduction+48 -0
Show related patches Customize query in gerrit

Event Timeline

Dzahn triaged this task as Medium priority.

Change 597781 had a related patch set uploaded (by Dzahn; owner: Dzahn):
[operations/puppet@production] install_server: add an nginx also on servers with "light" role

https://gerrit.wikimedia.org/r/597781

Change 597781 merged by Dzahn:
[operations/puppet@production] install_server: add an nginx also on servers with "light" role

https://gerrit.wikimedia.org/r/597781

wmcs team reported not being able to do installs from the cloudvirt VLAN (cloudnet1004)

debugging it I saw DHCP worked but after that the firewall was dropping connections to port 80 from it to install1003

So i took the nginx class used on apt.wm.org and removed the whole https part and adjusted the server name etc and put it on here knowing we wanted that anyways for install_servers in POPs.

And that fixed the issue and installs worked.

What I don't know for sure yet is if it affected everything (unlikely since it has been like this for weeks and there must have been other installs) or just cloud and if so why just cloud.

But it fixed it and we wanted it anyways.. so...

Change 598012 had a related patch set uploaded (by Dzahn; owner: Dzahn):
[operations/puppet@production] installserver/light: limit access to webserver to DOMAIN_NETWORKS

https://gerrit.wikimedia.org/r/598012

Change 598012 merged by Dzahn:
[operations/puppet@production] installserver/light: limit access to webserver to DOMAIN_NETWORKS

https://gerrit.wikimedia.org/r/598012

Change 599817 had a related patch set uploaded (by Dzahn; owner: Dzahn):
[operations/puppet@production] bastionhost: include TFTP and DHCP profiles in bastion hosts in POPs

https://gerrit.wikimedia.org/r/599817

Change 599883 had a related patch set uploaded (by Dzahn; owner: Dzahn):
[operations/dns@master] add IPs for installservers in POPs

https://gerrit.wikimedia.org/r/599883

Change 599817 abandoned by Dzahn:
bastionhost: include TFTP and DHCP profiles in bastion hosts in POPs

Reason:
creating dedicated ganeti VMs right away instead

https://gerrit.wikimedia.org/r/599817

Change 601342 had a related patch set uploaded (by Dzahn; owner: Dzahn):
[operations/puppet@production] site: add new POP install servers with insetup role

https://gerrit.wikimedia.org/r/601342

Change 599883 merged by Dzahn:
[operations/dns@master] add IPs for installservers in POPs

https://gerrit.wikimedia.org/r/599883

Change 606718 had a related patch set uploaded (by Dzahn; owner: Dzahn):
[operations/puppet@production] site/DHCP: add install4001.wikimedia.org

https://gerrit.wikimedia.org/r/606718

Change 606718 merged by Dzahn:
[operations/puppet@production] site/DHCP: add install4001.wikimedia.org

https://gerrit.wikimedia.org/r/606718

Change 606720 had a related patch set uploaded (by Dzahn; owner: Dzahn):
[operations/puppet@production] DHCP: configure install2003 as next-server for install4001

https://gerrit.wikimedia.org/r/606720

Change 606720 merged by Dzahn:
[operations/puppet@production] DHCP: configure install2003 as next-server for install4001

https://gerrit.wikimedia.org/r/606720

Change 601342 merged by Dzahn:
[operations/puppet@production] site: add new POP install servers with insetup role

https://gerrit.wikimedia.org/r/601342

Dzahn changed the task status from Stalled to Open.Sep 23 2020, 11:46 PM
Dzahn raised the priority of this task from Medium to High.

Change 629493 had a related patch set uploaded (by Dzahn; owner: Dzahn):
[operations/puppet@production] add dc-ops admin group to new install servers, ensure services stopped

https://gerrit.wikimedia.org/r/629493

Change 629493 merged by Dzahn:
[operations/puppet@production] add dc-ops admin group to new install servers, ensure services stopped

https://gerrit.wikimedia.org/r/629493

Change 629495 had a related patch set uploaded (by Dzahn; owner: Dzahn):
[operations/puppet@production] site: add installserver::light role to new install servers

https://gerrit.wikimedia.org/r/629495

Change 629495 merged by Dzahn:
[operations/puppet@production] site: add installserver::light role to new install servers

https://gerrit.wikimedia.org/r/629495

Change 629496 had a related patch set uploaded (by Dzahn; owner: Dzahn):
[operations/puppet@production] bastionhost::pop: remove tftp from bastions

https://gerrit.wikimedia.org/r/629496

Change 630313 had a related patch set uploaded (by Dzahn; owner: Dzahn):
[operations/dns@master] add testvm5001 to test install5001

https://gerrit.wikimedia.org/r/630313

Change 630313 merged by Dzahn:
[operations/dns@master] add testvm5001 to test install5001

https://gerrit.wikimedia.org/r/630313

Change 629849 had a related patch set uploaded (by Dzahn; owner: Dzahn):
[operations/puppet@production] start DHCP service on install5001, stop it on bast5001

https://gerrit.wikimedia.org/r/629849

Change 630320 had a related patch set uploaded (by Dzahn; owner: Dzahn):
[operations/puppet@production] DHCP: add testvm5001 MAC address

https://gerrit.wikimedia.org/r/630320

Change 630320 merged by Dzahn:
[operations/puppet@production] DHCP: add testvm5001 MAC address

https://gerrit.wikimedia.org/r/630320

Change 630695 had a related patch set uploaded (by Dzahn; owner: Dzahn):
[operations/puppet@production] install_server: let testvm5001 use install5001 as TFTP server

https://gerrit.wikimedia.org/r/630695

Change 630695 merged by Dzahn:
[operations/puppet@production] install_server: let testvm5001 use install5001 as TFTP server

https://gerrit.wikimedia.org/r/630695

Change 630696 had a related patch set uploaded (by Dzahn; owner: Dzahn):
[operations/puppet@production] DHCP: switch TFTP server for eqsin from bast5001 to install5001

https://gerrit.wikimedia.org/r/630696

Change 630699 had a related patch set uploaded (by Dzahn; owner: Dzahn):
[operations/puppet@production] enable tftp service on install5001

https://gerrit.wikimedia.org/r/630699

Change 630699 merged by Dzahn:
[operations/puppet@production] enable tftp service on install5001

https://gerrit.wikimedia.org/r/630699

Change 630696 merged by Dzahn:
[operations/puppet@production] DHCP: switch TFTP server for eqsin from bast5001 to install5001

https://gerrit.wikimedia.org/r/630696

Change 630702 had a related patch set uploaded (by Dzahn; owner: Dzahn):
[operations/puppet@production] stop tftp service on bast5001

https://gerrit.wikimedia.org/r/630702

Change 630702 merged by Dzahn:
[operations/puppet@production] stop tftp service on bast5001

https://gerrit.wikimedia.org/r/630702

Mentioned in SAL (#wikimedia-operations) [2020-09-29T00:09:56Z] <mutante> TFTP/install server for eqsin switched from bast5001 to install5001 - T252526

Change 629849 merged by Dzahn:
[operations/puppet@production] start DHCP service on install5001, stop it on bast5001

https://gerrit.wikimedia.org/r/629849

Mentioned in SAL (#wikimedia-operations) [2020-09-29T20:51:46Z] <mutante> DHCP server for EQSIN switched from bast5001 to install5001 (T252526)

Change 630964 had a related patch set uploaded (by Dzahn; owner: Dzahn):
[operations/puppet@production] DHCP: switch TFTP server for ulsfo from bast4002 to install4001

https://gerrit.wikimedia.org/r/630964

Change 630966 had a related patch set uploaded (by Dzahn; owner: Dzahn):
[operations/puppet@production] DHCP switch TFTP server for esams from bast3004 to install3001

https://gerrit.wikimedia.org/r/630966

Change 630971 had a related patch set uploaded (by Dzahn; owner: Dzahn):
[operations/puppet@production] DHCP: set TFTP servers in other DC to bootstrap install servers

https://gerrit.wikimedia.org/r/630971

Change 630964 merged by Dzahn:
[operations/puppet@production] DHCP: switch TFTP server for ulsfo from bast4002 to install4001

https://gerrit.wikimedia.org/r/630964

Change 630966 merged by Dzahn:
[operations/puppet@production] DHCP switch TFTP server for esams from bast3004 to install3001

https://gerrit.wikimedia.org/r/630966

Change 630971 merged by Dzahn:
[operations/puppet@production] DHCP: set TFTP servers in other DC to bootstrap install servers

https://gerrit.wikimedia.org/r/630971

Change 631249 had a related patch set uploaded (by Dzahn; owner: Dzahn):
[operations/puppet@production] bastionhost::pop: stop TFTP service on role level

https://gerrit.wikimedia.org/r/631249

Change 631249 merged by Dzahn:
[operations/puppet@production] bastionhost::pop: stop TFTP service on role level

https://gerrit.wikimedia.org/r/631249

Change 631261 had a related patch set uploaded (by Dzahn; owner: Dzahn):
[operations/homer/public@master] switch DHCP servers in POPs to new local install hosts

https://gerrit.wikimedia.org/r/631261

Change 631262 had a related patch set uploaded (by Dzahn; owner: Dzahn):
[operations/puppet@production] installserver: remove hiera host overrides, start DHCP and squid

https://gerrit.wikimedia.org/r/631262

Change 631262 merged by Dzahn:
[operations/puppet@production] installserver: remove hiera host overrides, start DHCP and squid

https://gerrit.wikimedia.org/r/631262

Change 629496 merged by Dzahn:
[operations/puppet@production] bastionhost::pop: remove tftp from bastions

https://gerrit.wikimedia.org/r/629496

Change 631261 merged by Dzahn:
[operations/homer/public@master] switch DHCP servers in POPs to new local install hosts

https://gerrit.wikimedia.org/r/631261

Mentioned in SAL (#wikimedia-operations) [2020-10-06T23:52:25Z] <mutante> 🖧 switched DHCP server for esams from install1003 to install3001 - homer deployed to cr*esams* (T252526) 🖧

Mentioned in SAL (#wikimedia-operations) [2020-10-06T23:53:34Z] <mutante> 🖧 switched DHCP server for ulsfo from install2003 to install4001 - homer deployed to cr*ulsfo* (T252526) 🖧

Mentioned in SAL (#wikimedia-operations) [2020-10-06T23:55:25Z] <mutante> 🖧 switched DHCP server for eqsin from install2003 to install5001 - homer deployed to cr*eqsin* (T252526) 🖧

Mentioned in SAL (#wikimedia-operations) [2020-10-07T00:01:48Z] <mutante> reinstalling testvm[345]001 to confirm OS installs work as normal after switching DHCP servers in POPs (T252526)

☑️ eqsin tested - reinstalled testvm5001 and confirmed in syslog that install5001 is serving both DHCP and TFTP now; not bast5001 (tftp) or install2003 (dhcp) anymore; and OS install is working

☑️ ulsfo tested - reinstalled testvm4001 and confirmed in syslog that install4001 is serving both DHCP and TFTP now; not bast4002 or install2003 (dhcp) anymore; and OS install is working

☑️ esams tested - reinstalled testvm3001 and confirmed in syslog that install4001 is serving both DHCP and TFTP now; not bast3004 or install1003(dhcp) anymore; and OS install is working

Dzahn updated the task description. (Show Details)
Dzahn removed a project: Patch-For-Review.
Dzahn updated the task description. (Show Details)

cookbooks.sre.hosts.decommission executed by dzahn@cumin1001 for hosts: testvm4001.ulsfo.wmnet

  • testvm4001.ulsfo.wmnet (WARN)
    • Failed downtime host on Icinga (likely already removed)
    • Found Ganeti VM
    • VM shutdown
    • Started forced sync of VMs in Ganeti cluster ganeti01.svc.ulsfo.wmnet to Netbox
    • Removed from DebMonitor
    • Removed from Puppet master and PuppetDB
    • VM removed
    • Started forced sync of VMs in Ganeti cluster ganeti01.svc.ulsfo.wmnet to Netbox

cookbooks.sre.hosts.decommission executed by dzahn@cumin1001 for hosts: testvm3001.esams.wmnet

  • testvm3001.esams.wmnet (WARN)
    • Failed downtime host on Icinga (likely already removed)
    • Found Ganeti VM
    • VM shutdown
    • Started forced sync of VMs in Ganeti cluster ganeti01.svc.esams.wmnet to Netbox
    • Removed from DebMonitor
    • Removed from Puppet master and PuppetDB
    • VM removed
    • Started forced sync of VMs in Ganeti cluster ganeti01.svc.esams.wmnet to Netbox
  • COMMON_STEPS (FAIL)
    • Failed to run the sre.dns.netbox cookbook: Cumin execution failed (exit_code=2)

ERROR: some step on some host failed, check the bolded items above

cookbooks.sre.hosts.decommission executed by dzahn@cumin1001 for hosts: testvm5001.eqsin.wmnet

  • testvm5001.eqsin.wmnet (WARN)
    • Failed downtime host on Icinga (likely already removed)
    • Found Ganeti VM
    • VM shutdown
    • Started forced sync of VMs in Ganeti cluster ganeti01.svc.eqsin.wmnet to Netbox
    • Removed from DebMonitor
    • Removed from Puppet master and PuppetDB
    • VM removed
    • Started forced sync of VMs in Ganeti cluster ganeti01.svc.eqsin.wmnet to Netbox