Page MenuHomePhabricator

Sort out plan for install* servers in edge sites
Closed, ResolvedPublic

Description

With all the Ganeti clusters now set up in the edge sites and the forthcoming deprecation of jessie on the existing install* servers we should define the plans for install* servers in the edge sites:

The existing Ganeti instances (install1002/install2002) provide the following services:

  • DHCP
  • TFTP / d-i preseeding configuration
  • Web proxy
  • the apt.wikimedia.org repository (which is synchronised via rsync to between eqiad/codfw)

From my PoV when moving to buster and into the edge sites we should split the apt.wikimedia.org repository to a separate role/Ganeti instance (like repo1001.wikimedia.org/repo2001.wikimedia.org) and have install1003/install2003/install3001/install4001/install5001 which only provide DHCP/TFTP/d-i/proxy.

I don't think there's a good case for replicating apt.wikimedia.org repository to the edge sites; it's only a handful of servers and the majority of installation traffic originates from our main Debian mirror (sodium) in eqiad anyway. Plus, all our codfw servers already install from the repository provided by install1002 via the apt.wikimedia.org CNAME.

Details

Event Timeline

Looking at the data usage of install* sans the repository and related files (like firmware images) we'd only need 20G for the install* servers and still have plenty of head room (the TFTP environment as pulled from volatile is 2G currently).

Splitting the internal apt repository from the install roles/servers sounds good -- it's more of a historical artifact than anything else. You probably know this already but do note that the install server does not provide just TFTP, but also HTTP (and that is actually favored these days), so we would need to have a webserver running on the install servers.

Apart from that, the other question I have is... do we /need/ install servers on each of the edge sites? What benefit would it provide?

Splitting the internal apt repository from the install roles/servers sounds good -- it's more of a historical artifact than anything else. You probably know this already but do note that the install server does not provide just TFTP, but also HTTP (and that is actually favored these days), so we would need to have a webserver running on the install servers.

Yeah, these will probably just be refactored to a common profile.

Apart from that, the other question I have is... do we /need/ install servers on each of the edge sites? What benefit would it provide?

We don't strictly need them, after all we currently don't have any either :-)

Given that the after the split of the repositories we'll have a separate role/instance in eqiad/codfw I think it makes sense to also test this setup on the edges, given that these will be simple, stateless services requiring little to none maintenance. There may be other good reasons, but some which I see out of the top of my head:

  • It provides a site-local web proxy which could be particularly interesting for eqsin round-trip wise
  • Installations will be quicker/more reliable; we've had a number of issues reimaging servers remotely over the years (although it's not really clear to me the percentage of which was caused by the outdated hardware in esams)
  • We can fold back the edge bastions to the main bastion setup (but we'll still need to keep a variant of role::bastionhost::pop in Puppet to bootstrap whatever next edge site we build in the future

One use case I have of the install1002 server is:

  1. Download a Junos software image from Juniper to install1002
  2. Move it to /srv/junos/
  3. Fetch it over https with for example: file copy "https://install1002.wikimedia.org/junos/jinstall-ppc-17.3R3-S6.3-signed.tgz" /var/tmp/jinstall-ppc-17.3R3-S6.3-signed.tgz

Note that browsing that URL with Firefox shows an https error:

Firefox does not trust this site because it uses a certificate that is not valid for install1002.wikimedia.org. The certificate is only valid for apt.wikimedia.org.

But Junos is (so far) smart enough to ignore it. /s

The other way of downloading an image on a router is to download it to my laptop and then SCP it over, which is quite time consuming.
So it would be great if:

  1. This use case still works after the migration (or something similar)
  2. Optional, the certificate error is fixed

Thanks!

Change 570969 had a related patch set uploaded (by Dzahn; owner: Dzahn):
[operations/puppet@production] installserver: create new role without HTTP/APT, rename existing role

https://gerrit.wikimedia.org/r/570969

Change 570971 had a related patch set uploaded (by Dzahn; owner: Dzahn):
[operations/puppet@production] introduce new role to install nginx and APT repo without DHCP/TFTP

https://gerrit.wikimedia.org/r/570971

Change 570969 merged by Dzahn:
[operations/puppet@production] installserver: create new role without HTTP/APT, rename existing role

https://gerrit.wikimedia.org/r/570969

Change 570971 merged by Dzahn:
[operations/puppet@production] introduce new role to install nginx and APT repo without DHCP/TFTP

https://gerrit.wikimedia.org/r/570971

Change 572311 had a related patch set uploaded (by Dzahn; owner: Dzahn):
[operations/dns@master] add apt1001.wikimedia.org

https://gerrit.wikimedia.org/r/572311

Change 572312 had a related patch set uploaded (by Dzahn; owner: Dzahn):
[operations/puppet@production] site: add apt[12]001.wikimedia.org with role::apt_repo

https://gerrit.wikimedia.org/r/572312

Change 572311 merged by Dzahn:
[operations/dns@master] add apt1001.wikimedia.org

https://gerrit.wikimedia.org/r/572311

Change 574088 had a related patch set uploaded (by Dzahn; owner: Dzahn):
[operations/puppet@production] installserver: allow for multiple failover servers at once

https://gerrit.wikimedia.org/r/574088

Change 574088 merged by Dzahn:
[operations/puppet@production] installserver: allow for multiple failover servers at once

https://gerrit.wikimedia.org/r/574088

Change 574106 had a related patch set uploaded (by Dzahn; owner: Dzahn):
[operations/puppet@production] aptrepo/install: move https monitoring to aptrepo profile

https://gerrit.wikimedia.org/r/574106

Change 574106 merged by Dzahn:
[operations/puppet@production] aptrepo/install: move https monitoring to aptrepo profile

https://gerrit.wikimedia.org/r/574106

Change 574537 had a related patch set uploaded (by Dzahn; owner: Dzahn):
[operations/puppet@production] DHCP: add apt1002 and apt2001

https://gerrit.wikimedia.org/r/574537

Change 574537 merged by Dzahn:
[operations/puppet@production] DHCP: add apt1002 and apt2001

https://gerrit.wikimedia.org/r/574537

Change 572312 merged by Dzahn:
[operations/puppet@production] site: add apt[12]001.wikimedia.org with role::apt_repo

https://gerrit.wikimedia.org/r/572312

Mentioned in SAL (#wikimedia-operations) [2020-02-25T20:25:29Z] <mutante> apt.wikimedia.org (current install* and new apt* roles) - going ECDSA-only and removing RSA certificate from nginx config - to support buster without having to maintain patched nginx for duplicate ssl_stapling_file directive - at the cost of slightly reduced back-compat on the public repo (T242602)

Change 572394 had a related patch set uploaded (by Dzahn; owner: Dzahn):
[operations/puppet@production] site: add installserver::light role on new install servers

https://gerrit.wikimedia.org/r/572394

Change 575404 had a related patch set uploaded (by Dzahn; owner: Dzahn):
[operations/dns@master] switch apt.wikimedia.org from install1002 to apt1001

https://gerrit.wikimedia.org/r/575404

Change 576479 had a related patch set uploaded (by Dzahn; owner: Dzahn):
[operations/puppet@production] installserver: add parameter for DHCP interface

https://gerrit.wikimedia.org/r/576479

I guess the plan has been sorted out. We are in the middle of implementing it and the details are on T224576. Instead of adding some changes here and some in both places we can probably close this ticket since the plan itself exists.

@ayounsi

One use case I have of the install1002 server is:
..
Fetch it over https with for example: file copy "https://install1002.wikimedia.org/junos/jinstall-ppc-17.3R3-S6.3-signed.tgz" /var/tmp/jinstall-ppc-17.3R3-S6.3-signed.tgz
..
Firefox does not trust this site because it uses a certificate that is not valid for install1002.wikimedia.org. The certificate is only valid for apt.wikimedia.org.
..
Optional, the certificate error is fixed

I think this shouldn't be an issue if you just use apt.wikmedia.org. So as of right now that would be:

  1. ssh to install1002.wikimedia.org
  2. wget the file from junos and move it to /srv/junos/
  3. copy it from for example https://apt.wikimedia.org/junos/junos-srxsme-12.1X46-D35.1-domestic.tgz since apt.wm.org is used the cert just matches

The only upcoming change to that would be that you have to start using apt1001 instead of install1002 as the server to download to. (When we switch apt.wikimedia.org in DNS, it already has the matching cert from acme-chief).

Great, thanks, are you going to take care of moving the files over?

Change 572394 abandoned by Dzahn:
site: add installserver::light role on new install servers

Reason:
rebased to nothing as it was done in https://gerrit.wikimedia.org/r/c/operations/puppet/ /576460

https://gerrit.wikimedia.org/r/572394

Great, thanks, are you going to take care of moving the files over?

Yes, it's fully puppetized. rsync of /srv/ from the primary server to (multiple) failovers listed as install_servers_failover in hieradata/common.yaml.

Change 577640 had a related patch set uploaded (by Dzahn; owner: Dzahn):
[operations/dns@master] remove install1003/2003 again to recreate with public IPs

https://gerrit.wikimedia.org/r/577640

Change 577641 had a related patch set uploaded (by Dzahn; owner: Dzahn):
[operations/puppet@production] site: switch install1003/2003 to public IPs

https://gerrit.wikimedia.org/r/577641

Change 577640 merged by Dzahn:
[operations/dns@master] remove install1003/2003 again to recreate with public IPs

https://gerrit.wikimedia.org/r/577640

Change 577641 merged by Dzahn:
[operations/puppet@production] site: switch install1003/2003 to public IPs

https://gerrit.wikimedia.org/r/577641

Change 575404 merged by Dzahn:
[operations/dns@master] switch apt.wikimedia.org from install1002 to apt1001

https://gerrit.wikimedia.org/r/575404

install1002/2002 have now been replaced by install1003/2003 for DHCP/TFTP/webproxy and apt1001/2001 for apt.wikimedia.org / reprepro.

Change 587698 had a related patch set uploaded (by Dzahn; owner: Dzahn):
[operations/dns@master] add install.wikimedia.org CNAME to install1003

https://gerrit.wikimedia.org/r/587698

Change 587699 had a related patch set uploaded (by Dzahn; owner: Dzahn):
[operations/puppet@production] install_server: replace apt.wm.org with install.wm.org in autoinstall URLs

https://gerrit.wikimedia.org/r/587699

Change 587701 had a related patch set uploaded (by Dzahn; owner: Dzahn):
[operations/puppet@production] installserver: include preseed in apt role, not in light role

https://gerrit.wikimedia.org/r/587701

Change 587701 merged by Dzahn:
[operations/puppet@production] installserver: include preseed in apt role, not in light role

https://gerrit.wikimedia.org/r/587701

Change 587699 abandoned by Dzahn:
install_server: replace apt.wm.org with install.wm.org in autoinstall URLs

Reason:
preferring other solution https://gerrit.wikimedia.org/r/c/operations/puppet/ /587701

https://gerrit.wikimedia.org/r/587699

Change 587698 abandoned by Dzahn:
add install.wikimedia.org CNAME to install1003

Reason:
https://gerrit.wikimedia.org/r/c/operations/puppet/ /587701

https://gerrit.wikimedia.org/r/587698

install1002 and install2002 have now been removed. install1003/2003 and apt1001/2001 have replaced them.

Change 599817 had a related patch set uploaded (by Dzahn; owner: Dzahn):
[operations/puppet@production] bastionhost: include TFTP and DHCP profiles in bastion hosts in POPs

https://gerrit.wikimedia.org/r/599817

Change 599883 had a related patch set uploaded (by Dzahn; owner: Dzahn):
[operations/dns@master] add IPs for installservers in POPs

https://gerrit.wikimedia.org/r/599883

Change 599817 abandoned by Dzahn:
bastionhost: include TFTP and DHCP profiles in bastion hosts in POPs

Reason:
creating dedicated ganeti VMs right away instead

https://gerrit.wikimedia.org/r/599817

Change 601342 had a related patch set uploaded (by Dzahn; owner: Dzahn):
[operations/puppet@production] site: add new POP install servers with insetup role

https://gerrit.wikimedia.org/r/601342

Change 599883 merged by Dzahn:
[operations/dns@master] add IPs for installservers in POPs

https://gerrit.wikimedia.org/r/599883

quoting from the original task description and replying inline:

when moving to buster and into the edge sites we should split the apt.wikimedia.org repository to a separate role/Ganeti instance (like repo1001.wikimedia.org/repo2001.wikimedia.org)

this is done since a while. they are just called apt1001.wikimedia.org and apt2001.wikimedia.org

it was done on this ticket and in T224576

and have install1003/install2003/install3001/install4001/install5001 which only provide DHCP/TFTP/d-i/proxy.

This is done now, TFTP was switched from the local bastion hosts in the POPs to dedicated install3001/4001/5001 just recently.

Today also DHCP was switched from install1003/2003 to install3001/4001/5001.

This all happened in T252526. I reinstalled testvm3001/4001/5001 to confirm everything is working.

I don't think there's a good case for replicating apt.wikimedia.org repository to the edge sites;

ACK, there is no apt[345]001 and not planning to make them.

One last thing is open to check here... the "squid" proxy part.

Change 632591 had a related patch set uploaded (by Dzahn; owner: Dzahn):
[operations/dns@master] switch webproxy for esams/ulsfo/eqsin to their local install server

https://gerrit.wikimedia.org/r/632591

Change 632591 merged by Dzahn:
[operations/dns@master] switch webproxy for esams/ulsfo/eqsin to their local install server

https://gerrit.wikimedia.org/r/632591

Mentioned in SAL (#wikimedia-operations) [2020-10-07T16:35:39Z] <mutante> switching webproxy service names to the new local install servers in esams/eqsin/ulsfo T242602

Dzahn claimed this task.

I'm declaring this resolved. Everything is done according to the plan. Let me know if you think otherwise.