Page MenuHomePhabricator

esams,ulsfo,eqsin: one VM request each for install_servers
Closed, ResolvedPublic

Description

Just like in T244390 a VM was requested to replace old installservers with a "light" variant (DHCP/TFTP but not APT repo), now the same is requested for the 3 edge sites / POPs who don't have their own install servers so far.

So 3 VMs in total, one in eqsin, one in ulsfo, one in esams. Note we have not had ganeti VMs with public IPs in these before.

T242602 was the ticket for the planning and T252526 is for the implementation.

install3001.wikimedia.org

Labs Project Tested: n/a
Site/Location: ESAMS
Number of systems: 1
Service: install_server
Networking Requirements: public
Processor Requirements: 1
Memory: 1G
Disks: 20G
Other Requirements: net-ops, ACL / dhcp-helper config changes

The VM will be used as an install_server (TFTP, DHCP, ...)

install4001.wikimedia.org

Labs Project Tested: n/a
Site/Location: ULSFO
Number of systems: 1
Service: install_server
Networking Requirements: public
Processor Requirements: 1
Memory: 1G
Disks: 20G
Other Requirements: net-ops, ACL / dhcp-helper config changes

The VM will be used as an install_server (TFTP, DHCP, ...)

install5001.wikimedia.org

Labs Project Tested: n/a
Site/Location: EQSIN
Number of systems: 1
Service: install_server
Networking Requirements: public
Processor Requirements: 1
Memory: 1G
Disks: 20G
Other Requirements: net-ops, ACL / dhcp-helper config changes

The VM will be used as an install_server (TFTP, DHCP, ...)

Event Timeline

Change 599883 had a related patch set uploaded (by Dzahn; owner: Dzahn):
[operations/dns@master] add IPs for installservers in POPs

https://gerrit.wikimedia.org/r/599883

Dzahn triaged this task as Medium priority.Jun 4 2020, 9:19 AM
akosiaris added subscribers: ayounsi, akosiaris.

I just had a quick look into the 3 PoP ganeti clusters and it seems they aren't ready to serve public IPs VMs. /etc/network/interfaces lacks the "public" interface that the main DC clusters have.

A quick look into asw2-ulsfo and asw2-esams also point out that the ports of those servers aren't set up to serve the public vlan in those PoPs either.

We should first fix these if we want to have those VMs in public IP space.

Sure I can do it, but do they need internet access? DHCP/TFTP shouldn't need internet access afaik? Are there other services running on them?

@ayounsi - Yes, we're going to have some outbound recursive DNS needs from some ganeti-hosted services

Sure I can do it, but do they need internet access? DHCP/TFTP shouldn't need internet access afaik? Are there other services running on them?

The reason for public IPs of these new "light" installservers in eqiad/codfw was that they also run the squid proxies. It wasn't clear to me yet whether we also add these in POPs.

Mentioned in SAL (#wikimedia-operations) [2020-06-10T06:53:00Z] <XioNoX> trunk public vlan to ulsfo ganeti hosts - T254157

Mentioned in SAL (#wikimedia-operations) [2020-06-10T07:16:20Z] <XioNoX> trunk public vlan to eqsin ganeti hosts - T254157

Mentioned in SAL (#wikimedia-operations) [2020-06-10T07:26:50Z] <XioNoX> trunk public vlan to esams ganeti hosts - T254157

@Dzahn: @akosiaris configured public interfaces on the ganeti hosts and after the Ganeti clusters are rebooted (which I'm currently handling), you can crate VMs with a public IP. I'm already done with the ulsfo Ganeti cluster, so feel free to give install4001.wikimedia.org a shot.

The Ganeti clusters in esams and eqsin have also been rebooted, they should also be ready for instances with public IPs now.

Change 599883 merged by Dzahn:
[operations/dns@master] add IPs for installservers in POPs

https://gerrit.wikimedia.org/r/599883

added to DNS:

install3001.wikimedia.org has address 91.198.174.63
install3001.wikimedia.org has IPv6 address 2620:0:862:1:91:198:174:63

install4001.wikimedia.org has address 198.35.26.12
install4001.wikimedia.org has IPv6 address 2620:0:863:1:198:35:26:12

install5001.wikimedia.org has address 103.102.166.13
install5001.wikimedia.org has IPv6 address 2001:df2:e500:1:103:102:166:13

feel free to give install4001.wikimedia.org a shot.

Thanks! Just did. First try i followed the docs to check in netbox for the row name and saw that in ULSFO it is row "1" (not a letter for the row like in eqiad/codfw) and tried with --network public ulsfo_1 but it told me to just use --network public ulsfo (and eqsin and esams without a number as well).

so:

dzahn@cumin1001:~$ sudo cookbook sre.ganeti.makevm --vcpus 1 --memory 1 --disk 20 --network public ulsfo install4001.wikimedia.org

Ready to create Ganeti VM install4001.wikimedia.org in the ganeti01.svc.ulsfo.wmnet cluster on row 1 with 1 vCPUs, 1GB of RAM, 20GB of disk in the public network.

Change 606718 had a related patch set uploaded (by Dzahn; owner: Dzahn):
[operations/puppet@production] site/DHCP: add install4001.wikimedia.org

https://gerrit.wikimedia.org/r/606718

Change 606718 merged by Dzahn:
[operations/puppet@production] site/DHCP: add install4001.wikimedia.org

https://gerrit.wikimedia.org/r/606718

Change 606720 had a related patch set uploaded (by Dzahn; owner: Dzahn):
[operations/puppet@production] DHCP: configure install2003 as next-server for install4001

https://gerrit.wikimedia.org/r/606720

Change 606720 merged by Dzahn:
[operations/puppet@production] DHCP: configure install2003 as next-server for install4001

https://gerrit.wikimedia.org/r/606720

Creating the VM worked fine. Installing the OS on install4001 has not worked yet though.

DHCP was working right away, but serving the installer was not. Then i changed the "next-server" for install4001 to install2003 (just like bast4001 has it set in DHCP config) and after that i could see it serving lpxelinux.0 but that's where it stops and the console stays empty.

Jun 19 16:07:16 install2003 dhcpd[22103]: DHCPREQUEST for 198.35.26.12 (208.80.153.51) from aa:00:00:6d:c7:59 via 198.35.26.3
Jun 19 16:07:16 install2003 dhcpd[22103]: DHCPACK on 198.35.26.12 to aa:00:00:6d:c7:59 via 198.35.26.3
Jun 19 16:07:16 install2003 atftpd[19167]: Serving lpxelinux.0 to 198.35.26.12:58560

on install2003 the ferm rule matching it is there:

ACCEPT     udp  --  198.35.26.0/28       anywhere             udp dpt:bootps
ACCEPT     tcp  --  198.35.26.0/28       anywhere             tcp dpt:http
ACCEPT     tcp  --  198.35.26.0/28       anywhere             tcp dpt:http-alt
ACCEPT     udp  --  198.35.26.0/28       anywhere             udp dpt:tftp

Change 601342 had a related patch set uploaded (by Dzahn; owner: Dzahn):
[operations/puppet@production] site: add new POP install servers with insetup role

https://gerrit.wikimedia.org/r/601342

@Dzahn what's the status of this? It appears that the VM is up but not in puppet at all.

@Volans Yea, that's right. The status is still that creating the VM worked but installing the OS did not (T254157#6241107). I will get back to debugging the reason for that but probably not today. If it's an issue i can delete the VM and then recreate it again when i get to it.

Not a specific issue for me, came up as inconsistency in some cross checks for Netbox automation. Up to you.

Change 622680 had a related patch set uploaded (by Dzahn; owner: Dzahn):
[operations/puppet@production] DHCP: add install3001 and install5001 MAC addresses

https://gerrit.wikimedia.org/r/622680

Change 622680 merged by Dzahn:
[operations/puppet@production] DHCP: add install3001 and install5001 MAC addresses

https://gerrit.wikimedia.org/r/622680

Change 622685 had a related patch set uploaded (by Dzahn; owner: Dzahn):
[operations/puppet@production] DHCP: use install1003 as next-server for install4001

https://gerrit.wikimedia.org/r/622685

Change 622685 merged by Dzahn:
[operations/puppet@production] DHCP: use install1003 as next-server for install4001

https://gerrit.wikimedia.org/r/622685

Change 622687 had a related patch set uploaded (by Dzahn; owner: Dzahn):
[operations/puppet@production] DHCP: remove next-server settings for new install servers

https://gerrit.wikimedia.org/r/622687

Change 622687 merged by Dzahn:
[operations/puppet@production] DHCP: remove next-server settings for new install servers

https://gerrit.wikimedia.org/r/622687

Change 622696 had a related patch set uploaded (by Dzahn; owner: Dzahn):
[operations/puppet@production] DHCP: set pxelinux.pathprefix to not use http for install5001

https://gerrit.wikimedia.org/r/622696

Change 622696 merged by Dzahn:
[operations/puppet@production] DHCP: set pxelinux.pathprefix to not use http for install5001

https://gerrit.wikimedia.org/r/622696

Mentioned in SAL (#wikimedia-operations) [2020-08-27T02:03:20Z] <mutante> shutting down install3001,install4001,install5001 VMs (no OS yet, but please also don't delete, debugging in progress, shutting them down until I continue on T254157)

Change 601342 merged by Dzahn:
[operations/puppet@production] site: add new POP install servers with insetup role

https://gerrit.wikimedia.org/r/601342

VMs have been created and added to site.pp with insetup role.

There are some problems with installing OS but that shall be a separate ticket now that likely needs an ACL change from netops and isn't anymore about requesting VMs. (T263684)