First thing we need in codfw is an install server. Installs will work from
eqiad, but the PXE part will be INCREDIBLY slow over the cross data center
links due to the 34ms+ latency. PXE sends one packet round trip at a time...
The codfw routers will now forward all DHCP requests to carbon in eqiad. That's
fine, if/when we get a DHCP server setup in codfw we can change that, but
that's not urgent. Subnet definitions need to be added to the dhcpd.conf (Extra
points if we can generate them from network.pp).
Once DHCP is setup for these subnets, bootstrap one suitable server that will
be the install server. This step will likely fail due to various problems;
think of firewall rules that need updating for codfw, Preseed files that need
to be adjusted for the new subnets, Puppet manifests that need to be adjusted
for the new $::site and IP ranges, etc. I've added the new ranges to network.pp
but nothing else yet. Please make that work as you go.
Once the codfw install server is installed, check it very carefully for
deviations from our normal completely automated installs. Network configuration
may have failed for example, and it may not have static config. /etc/hosts may
be weird, etc. We've found such issues on all previous bootstrapped install
servers in new data centers in the past. :)
Then make sure PXE gets setup for other codfw installs. Although we should get
local APT repos etc at some point, with multiple 10G links to eqiad and 34ms
latency it shouldn't be bad at all.
Let me know if I can help, especially from the network side of things.
Good luck. ;)
--
Mark Bergsma <mark at wikimedia>
Lead Operations Architect
Director of Technical Operations
Wikimedia Foundation
Description
Details
- Reference
- rt8201
Status | Subtype | Assigned | Task | ||
---|---|---|---|---|---|
Resolved | Joe | T154658 Prepare and improve the datacenter switchover procedure | |||
Open | None | T156937 Provide cross-dc redundancy (active-active or active-passive) to all important misc services | |||
Open | None | T84350 Setup basic infrastructure services in codfw | |||
Resolved | Dzahn | T84380 Setup install server in codfw - tftp done, but not apt and other install services (now: DHCP, TFTP, webproxy done, just not APT) |
Event Timeline
On Mon Aug 25 12:01:05 2014, mark wrote:
First thing we need in codfw is an install server. Installs will work
from
eqiad, but the PXE part will be INCREDIBLY slow over the cross data
center
links due to the 34ms+ latency. PXE sends one packet round trip at a
time...
on
Doh. s/PXE/TFTP/ in the above. :)
--
Mark Bergsma <mark at wikimedia>
Lead Operations Architect
Director of Technical Operations
Wikimedia Foundation
getting the new isntall server online is the master ticket linked to this one,
this is just tracking:
install os and get new install server online
--
Rob Halsell
Operations Engineer
Wikimedia Foundation, Inc.
E-Mail: <rhalsell at wikimedia>
mgmt and production (opv4/6) allocated
updated install server module with https://gerrit.wikimedia.org/r/#/c/156210/
need to confirm the network switch port for install2001 before proceeding.
--
Rob Halsell
Operations Engineer
Wikimedia Foundation, Inc.
E-Mail: <rhalsell at wikimedia>
Subject changed from 'install os and get new install server online' to 'install2001: get os deployed' by robh
So the TFTP stuff is up and running, but not the actual apt items.
I'll be testing this tomorrow when I install bast2001, so stalling this until
then.
--
Rob Halsell
Operations Engineer
Wikimedia Foundation, Inc.
E-Mail: <rhalsell at wikimedia>
The basic tftp install server is running.
We'll need to add in more disks/replace disks for running the apt repo there,
but that can wait (since tftp locally was the vital part of starting this
process.)
--
Rob Halsell
Operations Engineer
Wikimedia Foundation, Inc.
E-Mail: <rhalsell at wikimedia>
meant to keep stalled, not resolve.
--
Rob Halsell
Operations Engineer
Wikimedia Foundation, Inc.
E-Mail: <rhalsell at wikimedia>
Subject changed from 'Setup install server in codfw' to 'Setup install server in codfw - tftp done, but not apt and other install services' by robh
meanwhile install2001 uses identical puppet roles as install1001 and carbon, and DHCP is running now (done as part of T132757)
routers have been configured to send eqiad requests to install1001 as DHCP now, codfw has not been touched yet
Change 333676 had a related patch set uploaded (by Dzahn):
aptrepo: setup rsync between 2 APT servers
Mentioned in SAL (#wikimedia-operations) [2017-01-26T03:50:12Z] <mutante> rsyncing apt.wikimedia.org data from carbon to install2001 (T84380)
Change 334241 had a related patch set uploaded (by Dzahn):
aptrepo: add cron to rsync APT data automatically
Change 335585 had a related patch set uploaded (by Dzahn):
installserver: add firewall hole for rsync also for IPv6
Change 335585 merged by Dzahn:
installserver: add firewall hole for rsync also for IPv6
Mentioned in SAL (#wikimedia-operations) [2017-02-02T02:02:09Z] <mutante> carbon - remove unmapped IPv6 address making ferm rules fail, use only the _mapped_ IP (ip addr del 2620:0:861:1:7a2b:cbff:fe09:ea0/64 dev eth0) (T84380 T132757)
Change 335594 had a related patch set uploaded (by Dzahn):
aptrepo: disable autoconfigured EUI64 addresses
Change 335594 abandoned by Dzahn:
aptrepo: disable autoconfigured EUI64 addresses
Reason:
it just affects precise so we can just ignore the issue on carbon until it's down. (rsync -4 to work around it for example)
Change 336363 had a related patch set uploaded (by Dzahn):
netboot/partman: remove install2001
Change 336959 had a related patch set uploaded (by Dzahn):
install/TFTP: use install1002 and install2002 as next-servers
Change 337084 had a related patch set uploaded (by Dzahn):
remove install1001/install2001 from site.pp
Change 336959 merged by Dzahn:
install/DHCP/TFTP: use install1002 and install2002 as next-servers
Change 337093 had a related patch set uploaded (by Dzahn):
remove install1001 and install2001, keep 2001 mgmt
Mentioned in SAL (#wikimedia-operations) [2017-02-10T21:18:27Z] <mutante> install1001, install2001 - revoke puppet certs, puppet node deactivate, delete salt keys (T84380, T132757)
Mentioned in SAL (#wikimedia-operations) [2017-02-10T21:27:47Z] <mutante> install1001, install2001 - removed from Icinga, shutting down (T84380, T132757)
this is done for:
TFTP
DHCP
webproxy
just APT will be pointing to just eqiad for the moment
install2002 is up and running. APT data is synced over from install1002 by rsync/cron. Making APT HA is tracked in T158022.
therefore closing this as resolved now.