Page MenuHomePhabricator

cloud: introduce new edge network architecture for eqiad1 and codfw1dev
Closed, ResolvedPublic

Description

We basically completed the work on T261724: cloudgw: evaluate / validate setup in codfw1dev, which means we are happy with the new edge network architecture.

The new edge network architecture is described in wikitech:

Now we need to introduce all the missing pieces to actually introduce the new model.

This is the parent task to track all this work.

Related Objects

StatusSubtypeAssignedTask
ResolvedNone
Resolved aborrero
ResolvedPapaul
ResolvedRobH
Resolved aborrero
Resolved aborrero
Resolved aborrero
Resolvedfnegri
Resolvedfnegri
Resolvedfnegri
Resolved aborrero
ResolvedAndrew
Invalid aborrero
ResolvedRequestVRiley-WMF
Resolvedfnegri
Resolved aborrero
Resolved aborrero
ResolvedMoritzMuehlenhoff
Resolvedayounsi
Resolved aborrero
Resolved aborrero

Event Timeline

aborrero added a subtask: Unknown Object (Task).Dec 22 2020, 2:26 PM
aborrero moved this task from Inbox to Doing on the cloud-services-team (Kanban) board.
aborrero added a subtask: Unknown Object (Task).Dec 22 2020, 2:53 PM
aborrero added a subtask: Unknown Object (Task).Jan 19 2021, 10:30 AM
faidon changed the status of subtask Unknown Object (Task) from Open to Stalled.Jan 19 2021, 11:00 AM
Papaul closed subtask Unknown Object (Task) as Resolved.Jan 21 2021, 5:28 PM
Jclark-ctr closed subtask Unknown Object (Task) as Resolved.Mar 9 2021, 9:39 PM

Change 675556 had a related patch set uploaded (by Arturo Borrero Gonzalez; author: Arturo Borrero Gonzalez):
[operations/puppet@production] cloudgw: introduce eqiad1 service implementation

https://gerrit.wikimedia.org/r/675556

Change 675760 had a related patch set uploaded (by Arturo Borrero Gonzalez; author: Arturo Borrero Gonzalez):
[operations/puppet@production] openstack: neutron: disable conntrackd

https://gerrit.wikimedia.org/r/675760

Change 675760 merged by Arturo Borrero Gonzalez:
[operations/puppet@production] openstack: neutron: disable conntrackd

https://gerrit.wikimedia.org/r/675760

Change 675556 merged by Arturo Borrero Gonzalez:

[operations/puppet@production] cloudgw: introduce eqiad1 service implementation

https://gerrit.wikimedia.org/r/675556

Change 681028 had a related patch set uploaded (by Arturo Borrero Gonzalez; author: Arturo Borrero Gonzalez):

[operations/puppet@production] openstack: clodugw: conntrackd: resolve peer names

https://gerrit.wikimedia.org/r/681028

Change 681028 merged by Arturo Borrero Gonzalez:

[operations/puppet@production] openstack: clodugw: conntrackd: resolve peer names

https://gerrit.wikimedia.org/r/681028

Script wmf-auto-reimage was launched by aborrero on cumin1001.eqiad.wmnet for hosts:

['cloudgw1001.eqiad.wmnet', 'cloudgw1002.eqiad.wmnet']

The log can be found in /var/log/wmf-auto-reimage/202104200923_aborrero_22725.log.

Completed auto-reimage of hosts:

['cloudgw1001.eqiad.wmnet', 'cloudgw1002.eqiad.wmnet']

and were ALL successful.

Change 681322 had a related patch set uploaded (by Arturo Borrero Gonzalez; author: Arturo Borrero Gonzalez):

[operations/dns@master] wikimediacloud.org: prepare DNS records for cloudgw @ eqiad

https://gerrit.wikimedia.org/r/681322

We scheduled the migration for 6th May 11:30 UTC.

Change 681322 merged by Arturo Borrero Gonzalez:

[operations/dns@master] wikimediacloud.org: prepare DNS records for cloudgw @ eqiad

https://gerrit.wikimedia.org/r/681322

Script wmf-auto-reimage was launched by aborrero on cumin1001.eqiad.wmnet for hosts:

['cloudgw1001.eqiad.wmnet', 'cloudgw1002.eqiad.wmnet']

The log can be found in /var/log/wmf-auto-reimage/202104281047_aborrero_10759.log.

Completed auto-reimage of hosts:

['cloudgw1001.eqiad.wmnet', 'cloudgw1002.eqiad.wmnet']

and were ALL successful.

Change 683268 had a related patch set uploaded (by Arturo Borrero Gonzalez; author: Arturo Borrero Gonzalez):

[operations/puppet@production] openstack: neutron: topology changes for cloudgw

https://gerrit.wikimedia.org/r/683268

We scheduled the migration for 6th May 11:30 UTC.

I started preparing an operation screenplay:

  • icinga downtime labs* cloud* etc
  • review routing for 185.15.56.236/30 (cloud-gw-transport-eqiad -- vlan 1107) [cloudgw <-> neutron]
    • core router should route this to cloudsw as next hop
    • cloudsw should route this to cloudgw VIP (185.15.56.244/29)
  • review routing for 185.15.56.240/29 (cloud-instance-transport1-b-eqiad -- vlan 1120) [cloudsw <-> cloudgw]
    • core router should route this to cloudsw as next hop
    • cloudsw has addresses in this subnet:
      • 185.15.56.241/32 (cloudsw1-d5-eqiad) vrrp-gw-1120.eqiad1.wikimediacloud.org
      • 185.15.56.241/32 (cloudsw1-c8-eqiad) vrrp-gw-1120.eqiad1.wikimediacloud.org
      • 185.15.56.242/29 (cloudsw1-c8-eqiad) irb-1120.cloudsw1-c8-eqiad.eqiad1.wikimediacloud.org
      • 185.15.56.243/29 (cloudsw1-d5-eqiad) irb-1120.cloudsw1-d5-eqiad.eqiad1.wikimediacloud.org
  • review cloudnet vlan trunk
    • enable vlans 1105 (existing) 1107 (new) 1120 (being dropped, leave it for later cleanup)
  • neutron ops:
root@cloudcontrol1005:~# openstack router show cloudinstances2b-gw -f shell | grep external_gateway_info
external_gateway_info="{'network_id': '5c9ee953-3a19-4e84-be0f-069b5da75123', 'external_fixed_ips': [{'subnet_id': '7c6bcc12-212f-44c2-9954-5c55002ee371', 'ip_address': '185.15.56.244'}], 'enable_snat': True}"

root@cloudcontrol1005:~# openstack subnet create --network wan-transport-eqiad --gateway 185.15.56.237 --no-dhcp --subnet-range 185.15.56.236/30 cloud-gw-transport-eqiad

root@cloudcontrol1005:~# openstack router set --external-gateway wan-transport-eqiad --fixed-ip subnet=cloud-gw-transport-eqiad,ip-address=185.15.56.238 cloudinstances2b-gw

root@cloudcontrol1005:~# openstack subnet delete cloud-instances-transport1-b-eqiad

root@cloudcontrol1005:~# openstack router set --disable-snat cloudinstances2b-gw --external-gateway wan-transport-eqiad

root@cloudcontrol1005:~# openstack router show cloudinstances2b-gw -f shell | grep external_gateway_info
[... should mention 185.15.56.238 should have enable_snat=False ...]
  • run puppet on cloudnet servers, verify bridges, interfaces, routing and iptables ruleset:
    • brctl show
    • ip -br a
    • ip netns exec qrouter-d93771ba-2711-4f88-804a-8df6fd03978a ip -br a
    • ip netns exec qrouter-d93771ba-2711-4f88-804a-8df6fd03978a ip r
    • ip netns exec qrouter-d93771ba-2711-4f88-804a-8df6fd03978a iptables-save -c | less
  • in case of rollback, undo the changes in reverse order

I'm working on this checklist

1---
2- envvars:
3 - FLOATING_IP_VM: "dev.toolforge.org"
4 TOOLFORGE_BASTION: "login.toolforge.org"
5 NO_FLOATING_VM: "tools-k8s-worker-30.tools.eqiad1.wikimedia.cloud"
6 TOOLS_PUPPETMASTER: "tools-puppetmaster-02.tools.eqiad1.wikimedia.cloud"
7 TOOLSBETA_PUPPETMASTER: "toolsbeta-puppetmaster-04.toolsbeta.eqiad1.wikimedia.cloud"
8---
9# cloudgw after-migration checklist!
10- name: basic ping to cloudgw addresses (raw addresses)
11 tests:
12 # this is cloudgw1001.eqiad1.wikimediacloud.org
13 - cmd: timeout -k5s 10s ping -c1 185.15.56.245 >/dev/null
14 stdout: ""
15 retcode: 0
16 stderr: ""
17 # this is cloudgw1002.eqiad1.wikimediacloud.org
18 - cmd: timeout -k5s 10s ping -c1 185.15.56.246 >/dev/null
19 stdout: ""
20 retcode: 0
21 stderr: ""
22 # this is virt.cloudgw.eqiad1.wikimediacloud.org
23 - cmd: timeout -k5s 10s ping -c1 185.15.56.237 >/dev/null
24 stdout: ""
25 retcode: 0
26 stderr: ""
27 # this wan.cloudgw.eqiad1.wikimediacloud.org, before that, it is neutron
28 - cmd: timeout -k5s 10s ping -c1 185.15.56.244 >/dev/null
29 stdout: ""
30 retcode: 0
31 stderr: ""
32
33- name: basic ping to cloudgw addresses (DNS names)
34 tests:
35 - cmd: timeout -k5s 10s ping -c1 cloudgw1001.eqiad1.wikimediacloud.org >/dev/null
36 stdout: ""
37 retcode: 0
38 stderr: ""
39 - cmd: timeout -k5s 10s ping -c1 cloudgw1002.eqiad1.wikimediacloud.org >/dev/null
40 stdout: ""
41 retcode: 0
42 stderr: ""
43 - cmd: timeout -k5s 10s ping -c1 virt.cloudgw.eqiad1.wikimediacloud.org >/dev/null
44 stdout: ""
45 retcode: 0
46 stderr: ""
47 # this one wont be available until the migration completes:
48 - cmd: timeout -k5s 10s ping -c1 wan.cloudgw.eqiad1.wikimediacloud.org >/dev/null
49 stdout: ""
50 retcode: 0
51 stderr: ""
52
53- name: basic ping to neutron addresses (DNS name)
54 tests:
55 - cmd: timeout -k5s 10s ping -c1 cloudinstances2b-gw.openstack.eqiad1.wikimediacloud.org >/dev/null
56 stdout: ""
57 retcode: 0
58 stderr: ""
59
60- name: basic ping to neutron addresses (raw address)
61 tests:
62 - cmd: timeout -k5s 10s ping -c1 185.15.56.238 >/dev/null
63 stdout: ""
64 retcode: 0
65 stderr: ""
66
67- name: VM (no floating IP) contacting the internet gets NAT'd using routing_source_ip
68 tests:
69 - cmd: ssh $NO_FLOATING_VM "curl -s ifconfig.me ; echo "
70 # this is routing_source_ip
71 stdout: "185.15.56.1"
72 retcode: 0
73 stderr: ""
74
75- name: VM (no floating IP) contacting an address covered by dmz_cidr doesn't get NAT'd
76 tests:
77 - cmd: ssh $NO_FLOATING_VM "curl -Is https://es.wikipedia.org | grep x-client-ip"
78 # this is the internal VM address
79 stdout: "x-client-ip: 172.16.0.241"
80 retcode: 0
81 stderr: ""
82
83- name: VM (using floating IP) isn't affected by either routing_source_ip or dmz_cidr
84 tests:
85 - cmd: ssh $FLOATING_IP_VM "curl -s ifconfig.me ; echo"
86 # this is the VM floating IP address
87 stdout: "185.15.56.50"
88 retcode: 0
89 stderr: ""
90 - cmd: ssh $FLOATING_IP_VM "curl -Is https://es.wikipedia.org | grep x-client-ip"
91 # this is the VM private address, after the migration, it should be the floating IP
92 stdout: "x-client-ip: 185.15.56.50"
93 retcode: 0
94 stderr: ""
95
96- name: VM (no floating IP) can contact auth DNS server
97 tests:
98 - cmd: ssh $NO_FLOATING_VM "dig +short toolforge.org @208.80.154.11"
99 # this the A apex record in the toolforge.org DNS domain zone
100 stdout: "185.15.56.11"
101 retcode: 0
102 stderr: ""
103
104- name: VM (no floating IP) can contact recursor DNS server
105 tests:
106 - cmd: ssh $NO_FLOATING_VM "dig +short www.basket.com @208.80.154.143 | wc -l"
107 # this a somewhat random IPv4 on the internet, so only check that we get "something"
108 stdout: "1"
109 retcode: 0
110 stderr: ""
111
112- name: VM (using floating IP) can contact auth DNS server
113 tests:
114 - cmd: ssh $FLOATING_IP_VM "dig +short toolforge.org @208.80.154.11"
115 # this the A apex record in the toolforge.org DNS domain zone
116 stdout: "185.15.56.11"
117 retcode: 0
118 stderr: ""
119
120- name: VM (using floating IP) can contact recursor DNS server
121 tests:
122 - cmd: ssh $FLOATING_IP_VM "dig +short www.basket.com @208.80.154.143 | wc -l"
123 # this a somewhat random IPv4 on the internet, so only check that we get "something"
124 stdout: "1"
125 retcode: 0
126 stderr: ""
127
128- name: VM (using floating IP) can contact LDAP server
129 tests:
130 - cmd: ssh $FLOATING_IP_VM 'ldapsearch -x whatever | grep -q ^"# numResponses"'
131 # grep is happy, we are too
132 stdout: ""
133 retcode: 0
134 stderr: ""
135
136- name: VM (not using floating IP) can contact LDAP server
137 tests:
138 - cmd: ssh $NO_FLOATING_VM 'ldapsearch -x whatever | grep -q ^"# numResponses"'
139 # grep is happy, we are too
140 stdout: ""
141 retcode: 0
142 stderr: ""
143
144- name: VM (using floating IP) can connect to wikireplicas
145 tests:
146 - cmd: ssh $FLOATING_IP_VM 'sudo -iu tools.arturo-test-tool sql enwiki "select * from page limit 2;" | grep page_id | wc -l'
147 stdout: "1"
148 retcode: 0
149 stderr: ""
150
151- name: Toolforge webservice can be accessed from the internet
152 tests:
153 - cmd: curl -f --no-progress-meter https://network-tests.toolforge.org/files/1MB.bin --output - | file -
154 stdout: "/dev/stdin: data"
155 retcode: 0
156 stderr: ""
157
158- name: Toolforge bastions see herald file on project NFS
159 tests:
160 - cmd: timeout -k5s 60s ssh $FLOATING_IP_VM "file /mnt/nfs/labstore-secondary-tools-project/herald"
161 stdout: "/mnt/nfs/labstore-secondary-tools-project/herald: ASCII text"
162 retcode: 0
163 stderr: ""
164 - cmd: timeout -k5s 60s ssh $TOOLFORGE_BASTION "file /mnt/nfs/labstore-secondary-tools-project/herald"
165 stdout: "/mnt/nfs/labstore-secondary-tools-project/herald: ASCII text"
166 retcode: 0
167 stderr: ""
168
169- name: VM (using floating IP) can contact openstack API
170 tests:
171 - cmd: ssh $FLOATING_IP_VM 'curl -s http://openstack.eqiad1.wikimediacloud.org:5000/v3 | grep -qo identity'
172 # grep is happy, we are too
173 stdout: ""
174 retcode: 0
175 stderr: ""
176
177- name: VM (no floating IP) can contact openstack API
178 tests:
179 - cmd: ssh $NO_FLOATING_VM 'curl -s http://openstack.eqiad1.wikimediacloud.org:5000/v3 | grep -qo identity'
180 # grep is happy, we are too
181 stdout: ""
182 retcode: 0
183 stderr: ""
184
185- name: puppetmasters can sync git tree
186 tests:
187 - cmd: ssh $TOOLS_PUPPETMASTER 'sudo git-sync-upstream 2>&1 | grep -q Up-to-date'
188 # grep is happy, we are too
189 stdout: ""
190 retcode: 0
191 stderr: ""
192 - cmd: ssh $TOOLSBETA_PUPPETMASTER 'sudo git-sync-upstream 2>&1 | grep -q Up-to-date'
193 # grep is happy, we are too
194 stdout: ""
195 retcode: 0
196 stderr: ""
197
198- name: VM (using floating IP) can read dumps NFS
199 tests:
200 - cmd: ssh $FLOATING_IP_VM 'file /mnt/nfs/dumps-labstore1006.wikimedia.org/index.html | grep -q HTML'
201 stdout: ""
202 retcode: 0
203 stderr: ""
204
205- name: VM (no floating IP) can read dumps NFS
206 tests:
207 - cmd: ssh $NO_FLOATING_VM 'file /mnt/nfs/dumps-labstore1006.wikimedia.org/index.html | grep -q HTML'
208 stdout: ""
209 retcode: 0
210 stderr: ""
to be executed by the python script at https://github.com/aborrero/sys-avenger/blob/master/src/cmd-checklist-runner.py to be executed from one's laptop.

I plan to keep adding more tests in the next few days: NFS, wiki replicas, simple ICMP tests, openstack API, etc. Will probably collect some ideas from @Bstorm, @Andrew, @dcaro and @ayounsi

Mentioned in SAL (#wikimedia-cloud) [2021-05-03T10:24:01Z] <arturo> created PTR records for cloudgw100{1,2}.eqiad1.wikimediacloud.org. (T270704)

both check lists are mostly ready:

  • pre-migration (neutron as edge router, with hacks enabled): P15709
  • post-migration (cloudgw as edge router, neutron hacks disabled): P15659

Change 684353 had a related patch set uploaded (by Arturo Borrero Gonzalez; author: Arturo Borrero Gonzalez):

[operations/dns@master] wikimediacloud.org: add cloudsw addresses in vlan 1120

https://gerrit.wikimedia.org/r/684353

Change 684353 merged by Arturo Borrero Gonzalez:

[operations/dns@master] wikimediacloud.org: add cloudsw addresses in vlan 1120

https://gerrit.wikimedia.org/r/684353

Change 684864 had a related patch set uploaded (by Arturo Borrero Gonzalez; author: Arturo Borrero Gonzalez):

[operations/dns@master] wikimediacloud.org: update names for cloudgw migration

https://gerrit.wikimedia.org/r/684864

Change 685379 had a related patch set uploaded (by Arturo Borrero Gonzalez; author: Arturo Borrero Gonzalez):

[operations/puppet@production] cloudgw: introduce icinga checks

https://gerrit.wikimedia.org/r/685379

Change 685405 had a related patch set uploaded (by Arturo Borrero Gonzalez; author: Arturo Borrero Gonzalez):

[operations/puppet@production] cloudgw: enable notifications

https://gerrit.wikimedia.org/r/685405

Mentioned in SAL (#wikimedia-operations) [2021-05-06T15:06:11Z] <aborrero@cumin1001> START - Cookbook sre.hosts.downtime for 1:00:00 on 105 hosts with reason: T270704

Mentioned in SAL (#wikimedia-operations) [2021-05-06T15:06:48Z] <aborrero@cumin1001> END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on 105 hosts with reason: T270704

Mentioned in SAL (#wikimedia-operations) [2021-05-06T15:06:55Z] <aborrero@cumin1001> START - Cookbook sre.hosts.downtime for 1:00:00 on 9 hosts with reason: T270704

Mentioned in SAL (#wikimedia-operations) [2021-05-06T15:07:03Z] <aborrero@cumin1001> END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on 9 hosts with reason: T270704

Mentioned in SAL (#wikimedia-cloud) [2021-05-06T15:31:34Z] <arturo> about to migrating CloudVPS network to the cloudgw architecture T270704

Change 683268 merged by Arturo Borrero Gonzalez:

[operations/puppet@production] openstack: neutron: topology changes for cloudgw

https://gerrit.wikimedia.org/r/683268

Change 684864 merged by Arturo Borrero Gonzalez:

[operations/dns@master] wikimediacloud.org: update names for cloudgw migration

https://gerrit.wikimedia.org/r/684864

Change 685405 merged by Arturo Borrero Gonzalez:

[operations/puppet@production] cloudgw: enable notifications

https://gerrit.wikimedia.org/r/685405

Change 685379 merged by Arturo Borrero Gonzalez:

[operations/puppet@production] cloudgw: introduce icinga checks

https://gerrit.wikimedia.org/r/685379

Change 686457 had a related patch set uploaded (by Arturo Borrero Gonzalez; author: Arturo Borrero Gonzalez):

[operations/puppet@production] openstack: cleanup neutron hacks

https://gerrit.wikimedia.org/r/686457

Change 686457 merged by Arturo Borrero Gonzalez:

[operations/puppet@production] openstack: cleanup neutron hacks

https://gerrit.wikimedia.org/r/686457

Change 688359 had a related patch set uploaded (by Arturo Borrero Gonzalez; author: Arturo Borrero Gonzalez):

[operations/puppet@production] hieradata: drop unused neutron configuration for dmz_cidr

https://gerrit.wikimedia.org/r/688359

Change 688359 merged by Arturo Borrero Gonzalez:

[operations/puppet@production] hieradata: drop unused neutron configuration for dmz_cidr

https://gerrit.wikimedia.org/r/688359

Change 688365 had a related patch set uploaded (by Arturo Borrero Gonzalez; author: Arturo Borrero Gonzalez):

[operations/puppet@production] cloudgw: factorize NAT template file into base profile

https://gerrit.wikimedia.org/r/688365

Change 688366 had a related patch set uploaded (by Arturo Borrero Gonzalez; author: Arturo Borrero Gonzalez):

[operations/puppet@production] cloudgw: cleanup unused all_phy_nics parameter

https://gerrit.wikimedia.org/r/688366

Change 688367 had a related patch set uploaded (by Arturo Borrero Gonzalez; author: Arturo Borrero Gonzalez):

[operations/puppet@production] cloudgw: don't use concatenation with CIDR

https://gerrit.wikimedia.org/r/688367

Change 688365 merged by Arturo Borrero Gonzalez:

[operations/puppet@production] cloudgw: factorize NAT template file into base profile

https://gerrit.wikimedia.org/r/688365

Change 688366 merged by Arturo Borrero Gonzalez:

[operations/puppet@production] cloudgw: cleanup unused all_phy_nics parameter

https://gerrit.wikimedia.org/r/688366

Change 688367 merged by Arturo Borrero Gonzalez:

[operations/puppet@production] cloudgw: don't use concatenation with CIDR

https://gerrit.wikimedia.org/r/688367

Change 689831 had a related patch set uploaded (by Arturo Borrero Gonzalez; author: Arturo Borrero Gonzalez):

[operations/puppet@production] openstack: neutron: more cloudgw cleanups

https://gerrit.wikimedia.org/r/689831

Change 689831 merged by Arturo Borrero Gonzalez:

[operations/puppet@production] openstack: neutron: more cloudgw cleanups

https://gerrit.wikimedia.org/r/689831

RobH changed the status of subtask Unknown Object (Task) from Stalled to Open.Oct 25 2021, 10:31 PM
RobH closed subtask Unknown Object (Task) as Declined.Jul 26 2022, 11:15 PM