Page MenuHomePhabricator

Setup basic infrastructure services in codfw
Closed, ResolvedPublic

Description

This ticket serves as a parent ticket for setting up some basic infrastructure (install server, recursive DNS, monitoring, etc...) we'll need in codfw.

Details

Reference
rt8183

Related Objects

StatusSubtypeAssignedTask
ResolvedJoe
ResolvedLSobanski
Resolvedfgiunchedi
ResolvedDzahn
Resolvedfgiunchedi
ResolvedRobH
Resolvedfgiunchedi
ResolvedRobH

Event Timeline

rtimport raised the priority of this task from to Medium.Dec 18 2014, 2:10 AM
rtimport added a project: ops-core.
rtimport set Reference to rt8183.

Member ticket #8184 added by mark

Member ticket #8185 added by mark

Dependency by ticket #8186 added by mark

Dependency by ticket #8187 added by mark

Dependency by ticket #8188 added by mark

TL;DR: The codfw network is now mostly configured, and should be stable enough to do server installs. Please start deploying basic infrastructure services, and get data backed up & replicated in codfw (and out of Tampa), but DO NOT use codfw for user facing traffic yet for some weeks to come. Also do not use row D for anything.
I’ve configured all access switches in 4 Juniper “Virtual Chassis Fabric” stacks, one stack of 8 switches per row (much like eqiad, but a different switch topology[1] and different switch models). Switch 2 and 7 of each row are 10G switches, and should only be used for high-traffic servers with 10G interfaces. These QFX5100 switches are the “spines”, and also act as the master & backup routing engines of the stack. The other switches are EX4300 GigE switches, much like the EX4200s we have in eqiad. Each stack is accessible using our standard hostnames, e.g. asw-b-codfw.mgmt.codfw.wmnet.
I’ve setup our standard subnets/VLANs, separated per row, like eqiad. The following exist now:
- public1-a-codfw 2001 208.80.153.0/27 2620:0:860:1::/64
- public1-b-codfw 2002 208.80.153.32/27 2620:0:860:2::/64
- public1-c-codfw 2003 208.80.153.64/27 2620:0:860:3::/64
- public1-d-codfw 2004 208.80.153.96/27 2620:0:860:3::/64
- private1-a-codfw 2017 10.192.0.0/22 2620:0:860:101::/64
- private1-b-codfw 2018 10.192.16.0/22 2620:0:860:102::/64
- private1-c-codfw 2019 10.192.32.0/22 2620:0:860:103::/64
- private1-d-codfw 2020 10.192.48.0/22 2620:0:860:104::/64
(gateway to use is always the usable ip, network address + 1)
The special purpose VLANs (analytics, labs, sandbox etc) don’t exist yet, and we’ll create those as we go / need them. They often need special configuration as well.
You’ll note that the public ranges above were part of the old Tampa labs IPs and the old Tampa public IPv6 IPs respectively. I’ve removed the few public IPv6 addresses on the remaining Tampa servers to be able to reuse the same naming scheme as eqiad.
These VLANs have been defined on the switches as well, but none of the corresponding “interface ranges” exist yet, as there are no port members. Feel free to configure these yourself, or ping me if you need help.
I’ve also added these subnets to DNS and created a Gerrit patch set for network.pp in Puppet[2] - please review & merge the latter as you need it.
Some of the things that have not been configured on the codfw network yet:
- Loopback interface ACL
- Anything to do with external (IP Transit/peering) links, including BGP configuration, policy, ACLs. codfw right now still gets its transit via eqiad.
- Any configuration related to PyBal & LVS, including BGP & policy. We’ll do that soon though, as LVS servers will be needed fast.
- ACLs related to our more insecure subnets such as for Labs & sandbox. Once we add thise subnets we need to add those ACLs as well.
- Some generic configuration related to management & monitoring. I’ve been playing with JunOS SLAX scripts over the weekend to automate this more.
The management network is still in the same temporary state I set it up in last week; basically it’s currently routed in-band via cr1-codfw. You can access it from iron. See my previous mail for more details.
You’ll notice that the entire setup above is very similar to eqiad’s. One reason is that it has proven to work well, another reason is that we don’t have much time to experiment with alternatives. I would have liked to take a month to test the new equipment with alternative configurations, however due to the time we’ve lost waiting for our first connectivity and the need (and cost) to get rid of Tampa, and the fact that I won’t have much time this week, I won’t do that. HOWEVER, since we don’t have any equipment in codfw row D yet and we shouldn’t need it for some time, I would like to use that row D for testing alternative configurations. Therefore, please don’t put anything in row D until we decide to use it. Also, I will still be doing various changes/upgrades/failover tests/etc on the entire network in the next few weeks. That’s much easier with actual servers & traffic & stuff visibly breaking on the network anyway. :) But I’ll keep in mind that it’s somewhat used, and will restrict any interruptions to a minimum. It’s a pretty redundant setup anyway. It should be fine to install stuff on, setup backups & replications etc, but codfw should NOT be used for anything user impacting for the upcoming weeks - until we declare it fully ready for that.
We should now start setting up some servers for basic infrastructure in the new data center. First we should setup an install server, or at least the TFTP server part of that. (TFTP will work from eqiad but will be INCREDIBLY slow). I’ve created an RT ticket for that[3].
I’ve created a parent ticket in RT for these immediate tasks:
https://rt.wikimedia.org/Ticket/Display.html?id=8183
Please start working on them now. The way I think we’'ll manage it: If you’re interested in working on it and can do so in the upcoming week (i.e. not blocking others to get this done), please Take the ticket. Others can join in and help if they’re interested. Tickets that don’t get taken/worked on in the next few days, I’ll assign explicitly to people. Post updates on the RT ticket, and if appropriate, on the Ops list. I’ll keep adding some more tickets in the next few days, so check back for more tasks. We may do the next round of tasks in Phabricator if we can make that work. :)
I’m not sure what the status of server DRACs & other LOM interfaces is BTW. Some servers came from Tampa, some are new. We may need to put in some priority tickets for Papaul to get those configured correctly first - please ask Rob and Chris for help/details.
After we’ve got some basic infrastructure up, our next step should be to get all data backed up & replicated to codfw. We already have 10 new databases waiting for that, for example. Sean should hopefully be able to start that work roughly next week.
Thanks,
[1] https://office.wikimedia.org/wiki/File:Codfw_clos_stack.png
[2] https://gerrit.wikimedia.org/r/#/c/156090/
[3] https://rt.wikimedia.org/Ticket/Display.html?id=8184

Mark Bergsma <mark at wikimedia>
Lead Operations Architect
Director of Technical Operations
Wikimedia Foundation

Dependency on ticket #8225 added by akosiaris

added codfw networks to icinga ferm rules on neon, which allowed snmptraps to
be received from there and made puppet freshness monitoring work
https://gerrit.wikimedia.org/r/#/c/156985/
https://icinga.wikimedia.org/cgi-bin/icinga/extinfo.cgi?type=2&host=install2001&service=Puppet+freshness

Status changed from 'new' to 'open' by RT_System

Dependency by ticket #8187 deleted by dzahn

Dependency by ticket #8188 deleted by dzahn

Dependency by ticket #8186 deleted by dzahn

Dependency on ticket #8186 added by dzahn

Dependency on ticket #8187 added by dzahn

Dependency on ticket #8188 added by dzahn

Member ticket #8184 deleted by dzahn

Member ticket #8185 deleted by dzahn

Dependency on ticket #8184 added by dzahn

Dependency on ticket #8185 added by dzahn

faidon removed a subtask: Restricted Task.Sep 10 2015, 8:01 PM
faidon changed the visibility from "WMF-NDA (Project)" to "Public (No Login Required)".
faidon changed the edit policy from "WMF-NDA (Project)" to "All Users".
faidon set Security to None.
fgiunchedi claimed this task.
fgiunchedi subscribed.

Resolving as completed since codfw has been up and running for years now