Setup install server in codfw - tftp done, but not apt and other install services (now: DHCP, TFTP, webproxy done, just not APT)
Closed, ResolvedPublic

Description

First thing we need in codfw is an install server. Installs will work from
eqiad, but the PXE part will be INCREDIBLY slow over the cross data center
links due to the 34ms+ latency. PXE sends one packet round trip at a time...
The codfw routers will now forward all DHCP requests to carbon in eqiad. That's
fine, if/when we get a DHCP server setup in codfw we can change that, but
that's not urgent. Subnet definitions need to be added to the dhcpd.conf (Extra
points if we can generate them from network.pp).
Once DHCP is setup for these subnets, bootstrap one suitable server that will
be the install server. This step will likely fail due to various problems;
think of firewall rules that need updating for codfw, Preseed files that need
to be adjusted for the new subnets, Puppet manifests that need to be adjusted
for the new $::site and IP ranges, etc. I've added the new ranges to network.pp
but nothing else yet. Please make that work as you go.
Once the codfw install server is installed, check it very carefully for
deviations from our normal completely automated installs. Network configuration
may have failed for example, and it may not have static config. /etc/hosts may
be weird, etc. We've found such issues on all previous bootstrapped install
servers in new data centers in the past. :)
Then make sure PXE gets setup for other codfw installs. Although we should get
local APT repos etc at some point, with multiple 10G links to eqiad and 34ms
latency it shouldn't be bad at all.
Let me know if I can help, especially from the network side of things.
Good luck. ;)
--
Mark Bergsma <mark at wikimedia>
Lead Operations Architect
Director of Technical Operations
Wikimedia Foundation

Details

Reference
rt8201
rtimport set Reference to rt8201.
mark created this task.Aug 25 2014, 12:01 PM
mark added a comment.Aug 25 2014, 12:14 PM

Dependency by ticket #8185 added by mark

mark added a comment.Aug 25 2014, 12:32 PM

On Mon Aug 25 12:01:05 2014, mark wrote:

First thing we need in codfw is an install server. Installs will work
from
eqiad, but the PXE part will be INCREDIBLY slow over the cross data
center
links due to the 34ms+ latency. PXE sends one packet round trip at a
time...
on

Doh. s/PXE/TFTP/ in the above. :)
--
Mark Bergsma <mark at wikimedia>
Lead Operations Architect
Director of Technical Operations
Wikimedia Foundation

RobH added a comment.Aug 25 2014, 8:31 PM

getting the new isntall server online is the master ticket linked to this one,
this is just tracking:
install os and get new install server online
--
Rob Halsell
Operations Engineer
Wikimedia Foundation, Inc.
E-Mail: <rhalsell at wikimedia>

RobH added a comment.Aug 25 2014, 8:31 PM

Dependency by ticket #8184 added by robh

RobH added a comment.Aug 25 2014, 8:31 PM

Dependency on ticket #8184 added by robh

RobH added a comment.Aug 25 2014, 8:42 PM

Dependency on ticket #8202 added by robh

RobH added a comment.Aug 26 2014, 12:07 AM

Issue taken by robh

RobH added a comment.Aug 26 2014, 12:08 AM

mgmt and production (opv4/6) allocated
updated install server module with https://gerrit.wikimedia.org/r/#/c/156210/
need to confirm the network switch port for install2001 before proceeding.
--
Rob Halsell
Operations Engineer
Wikimedia Foundation, Inc.
E-Mail: <rhalsell at wikimedia>

RobH added a comment.Aug 26 2014, 12:08 AM

Dependency on ticket #8205 added by robh

RobH added a comment.Aug 26 2014, 12:28 AM

Subject changed from 'install os and get new install server online' to 'install2001: get os deployed' by robh

RobH added a comment.Aug 27 2014, 10:34 PM

Merged into ticket #8184 by robh

RobH added a comment.Aug 27 2014, 10:34 PM

Merged into ticket #8184 by robh

RobH added a comment.Aug 27 2014, 10:35 PM

So the TFTP stuff is up and running, but not the actual apt items.
I'll be testing this tomorrow when I install bast2001, so stalling this until
then.
--
Rob Halsell
Operations Engineer
Wikimedia Foundation, Inc.
E-Mail: <rhalsell at wikimedia>

RobH added a comment.Aug 27 2014, 10:35 PM

Status changed from 'new' to 'stalled' by robh

RobH added a comment.Aug 27 2014, 10:35 PM

Issue taken by robh

RobH added a comment.Aug 28 2014, 9:46 PM

Status changed from 'stalled' to 'open' by robh

RobH added a comment.Aug 28 2014, 9:47 PM

The basic tftp install server is running.
We'll need to add in more disks/replace disks for running the apt repo there,
but that can wait (since tftp locally was the vital part of starting this
process.)
--
Rob Halsell
Operations Engineer
Wikimedia Foundation, Inc.
E-Mail: <rhalsell at wikimedia>

RobH added a comment.Aug 28 2014, 9:47 PM

Status changed from 'open' to 'resolved' by robh

RobH added a comment.Aug 28 2014, 9:48 PM

meant to keep stalled, not resolve.
--
Rob Halsell
Operations Engineer
Wikimedia Foundation, Inc.
E-Mail: <rhalsell at wikimedia>

Status changed from 'resolved' to 'open' by RT_System

RobH added a comment.Aug 28 2014, 9:48 PM

Status changed from 'open' to 'stalled' by robh

RobH added a comment.Aug 28 2014, 9:48 PM

Subject changed from 'Setup install server in codfw' to 'Setup install server in codfw - tftp done, but not apt and other install services' by robh

RobH added a comment.Aug 28 2014, 9:48 PM

Untaken by robh

Dzahn added a comment.Sep 2 2014, 6:46 PM

Membership in ticket #8183 deleted by dzahn

Dzahn added a comment.Sep 2 2014, 6:46 PM

Dependency by ticket #8183 added by dzahn

Dependency by ticket #8185 deleted by bblack

Dzahn changed the visibility from "WMF-NDA (Project)" to "Public (No Login Required)".Aug 5 2015, 5:56 AM
Dzahn changed the edit policy from "WMF-NDA (Project)" to "All Users".
Dzahn set Security to None.
Restricted Application added a subscriber: Matanya. · View Herald TranscriptAug 5 2015, 5:56 AM
faidon merged a task: Restricted Task.Sep 10 2015, 8:01 PM
faidon added a subscriber: RobH.
Dzahn claimed this task.Dec 22 2016, 12:36 AM

meanwhile install2001 uses identical puppet roles as install1001 and carbon, and DHCP is running now (done as part of T132757)

routers have been configured to send eqiad requests to install1001 as DHCP now, codfw has not been touched yet

Change 333676 had a related patch set uploaded (by Dzahn):
aptrepo: setup rsync between 2 APT servers

https://gerrit.wikimedia.org/r/333676

Change 333676 merged by Dzahn:
aptrepo: setup rsync between 2 APT servers

https://gerrit.wikimedia.org/r/333676

Mentioned in SAL (#wikimedia-operations) [2017-01-26T03:50:12Z] <mutante> rsyncing apt.wikimedia.org data from carbon to install2001 (T84380)

Change 334241 had a related patch set uploaded (by Dzahn):
aptrepo: add cron to rsync APT data automatically

https://gerrit.wikimedia.org/r/334241

Change 334241 merged by Dzahn:
aptrepo: add cron to rsync APT data automatically

https://gerrit.wikimedia.org/r/334241

Change 335585 had a related patch set uploaded (by Dzahn):
installserver: add firewall hole for rsync also for IPv6

https://gerrit.wikimedia.org/r/335585

Change 335585 merged by Dzahn:
installserver: add firewall hole for rsync also for IPv6

https://gerrit.wikimedia.org/r/335585

Mentioned in SAL (#wikimedia-operations) [2017-02-02T02:02:09Z] <mutante> carbon - remove unmapped IPv6 address making ferm rules fail, use only the _mapped_ IP (ip addr del 2620:0:861:1:7a2b:cbff:fe09:ea0/64 dev eth0) (T84380 T132757)

Change 335594 had a related patch set uploaded (by Dzahn):
aptrepo: disable autoconfigured EUI64 addresses

https://gerrit.wikimedia.org/r/335594

Change 335594 abandoned by Dzahn:
aptrepo: disable autoconfigured EUI64 addresses

Reason:
it just affects precise so we can just ignore the issue on carbon until it's down. (rsync -4 to work around it for example)

https://gerrit.wikimedia.org/r/335594

Change 336363 had a related patch set uploaded (by Dzahn):
netboot/partman: remove install2001

https://gerrit.wikimedia.org/r/336363

Change 336363 merged by Dzahn:
netboot/partman: remove install2001

https://gerrit.wikimedia.org/r/336363

Change 336959 had a related patch set uploaded (by Dzahn):
install/TFTP: use install1002 and install2002 as next-servers

https://gerrit.wikimedia.org/r/336959

Change 337084 had a related patch set uploaded (by Dzahn):
remove install1001/install2001 from site.pp

https://gerrit.wikimedia.org/r/337084

Change 336959 merged by Dzahn:
install/DHCP/TFTP: use install1002 and install2002 as next-servers

https://gerrit.wikimedia.org/r/336959

Change 337084 merged by Dzahn:
remove install1001/install2001 from site.pp

https://gerrit.wikimedia.org/r/337084

Change 337093 had a related patch set uploaded (by Dzahn):
remove install1001 and install2001, keep 2001 mgmt

https://gerrit.wikimedia.org/r/337093

Mentioned in SAL (#wikimedia-operations) [2017-02-10T21:18:27Z] <mutante> install1001, install2001 - revoke puppet certs, puppet node deactivate, delete salt keys (T84380, T132757)

Mentioned in SAL (#wikimedia-operations) [2017-02-10T21:27:47Z] <mutante> install1001, install2001 - removed from Icinga, shutting down (T84380, T132757)

Change 337093 merged by Dzahn:
remove install1001 and install2001, keep 2001 mgmt

https://gerrit.wikimedia.org/r/337093

this is done for:

TFTP

DHCP

webproxy

just APT will be pointing to just eqiad for the moment

Dzahn changed the task status from "Stalled" to "Open".Feb 10 2017, 10:24 PM
Dzahn changed the title from "Setup install server in codfw - tftp done, but not apt and other install services" to "Setup install server in codfw - tftp done, but not apt and other install services (now: DHCP, TFTP, webproxy done, just not APT)".
Dzahn added a comment.Mon, Feb 27, 5:41 PM

install2002 is up and running. APT data is synced over from install1002 by rsync/cron. Making APT HA is tracked in T158022.

therefore closing this as resolved now.

Dzahn closed this task as "Resolved".Mon, Feb 27, 5:42 PM
Dzahn removed a project: Patch-For-Review.