Page MenuHomePhabricator

cloud: introduce new edge network architecture for eqiad1 and codfw1dev
Closed, ResolvedPublic

Description

We basically completed the work on T261724: cloudgw: evaluate / validate setup in codfw1dev, which means we are happy with the new edge network architecture.

The new edge network architecture is described in wikitech:

Now we need to introduce all the missing pieces to actually introduce the new model.

This is the parent task to track all this work.

Event Timeline

aborrero added a subtask: Unknown Object (Task).Dec 22 2020, 2:26 PM
aborrero triaged this task as Medium priority.Dec 22 2020, 2:28 PM
aborrero moved this task from Inbox to Doing on the cloud-services-team (Kanban) board.
aborrero added a subtask: Unknown Object (Task).Dec 22 2020, 2:53 PM
aborrero added a subtask: Unknown Object (Task).Jan 19 2021, 10:30 AM
faidon changed the status of subtask Unknown Object (Task) from Open to Stalled.Jan 19 2021, 11:00 AM
Papaul closed subtask Unknown Object (Task) as Resolved.Jan 21 2021, 5:28 PM
Jclark-ctr closed subtask Unknown Object (Task) as Resolved.Mar 9 2021, 9:39 PM

Change 675556 had a related patch set uploaded (by Arturo Borrero Gonzalez; author: Arturo Borrero Gonzalez):
[operations/puppet@production] cloudgw: introduce eqiad1 service implementation

https://gerrit.wikimedia.org/r/675556

Change 675760 had a related patch set uploaded (by Arturo Borrero Gonzalez; author: Arturo Borrero Gonzalez):
[operations/puppet@production] openstack: neutron: disable conntrackd

https://gerrit.wikimedia.org/r/675760

Change 675760 merged by Arturo Borrero Gonzalez:
[operations/puppet@production] openstack: neutron: disable conntrackd

https://gerrit.wikimedia.org/r/675760

Change 675556 merged by Arturo Borrero Gonzalez:

[operations/puppet@production] cloudgw: introduce eqiad1 service implementation

https://gerrit.wikimedia.org/r/675556

Change 681028 had a related patch set uploaded (by Arturo Borrero Gonzalez; author: Arturo Borrero Gonzalez):

[operations/puppet@production] openstack: clodugw: conntrackd: resolve peer names

https://gerrit.wikimedia.org/r/681028

Change 681028 merged by Arturo Borrero Gonzalez:

[operations/puppet@production] openstack: clodugw: conntrackd: resolve peer names

https://gerrit.wikimedia.org/r/681028

Script wmf-auto-reimage was launched by aborrero on cumin1001.eqiad.wmnet for hosts:

['cloudgw1001.eqiad.wmnet', 'cloudgw1002.eqiad.wmnet']

The log can be found in /var/log/wmf-auto-reimage/202104200923_aborrero_22725.log.

Completed auto-reimage of hosts:

['cloudgw1001.eqiad.wmnet', 'cloudgw1002.eqiad.wmnet']

and were ALL successful.

Change 681322 had a related patch set uploaded (by Arturo Borrero Gonzalez; author: Arturo Borrero Gonzalez):

[operations/dns@master] wikimediacloud.org: prepare DNS records for cloudgw @ eqiad

https://gerrit.wikimedia.org/r/681322

We scheduled the migration for 6th May 11:30 UTC.

Change 681322 merged by Arturo Borrero Gonzalez:

[operations/dns@master] wikimediacloud.org: prepare DNS records for cloudgw @ eqiad

https://gerrit.wikimedia.org/r/681322

Script wmf-auto-reimage was launched by aborrero on cumin1001.eqiad.wmnet for hosts:

['cloudgw1001.eqiad.wmnet', 'cloudgw1002.eqiad.wmnet']

The log can be found in /var/log/wmf-auto-reimage/202104281047_aborrero_10759.log.

Completed auto-reimage of hosts:

['cloudgw1001.eqiad.wmnet', 'cloudgw1002.eqiad.wmnet']

and were ALL successful.

Change 683268 had a related patch set uploaded (by Arturo Borrero Gonzalez; author: Arturo Borrero Gonzalez):

[operations/puppet@production] openstack: neutron: topology changes for cloudgw

https://gerrit.wikimedia.org/r/683268

We scheduled the migration for 6th May 11:30 UTC.

I started preparing an operation screenplay:

  • icinga downtime labs* cloud* etc
  • review routing for 185.15.56.236/30 (cloud-gw-transport-eqiad -- vlan 1107) [cloudgw <-> neutron]
    • core router should route this to cloudsw as next hop
    • cloudsw should route this to cloudgw VIP (185.15.56.244/29)
  • review routing for 185.15.56.240/29 (cloud-instance-transport1-b-eqiad -- vlan 1120) [cloudsw <-> cloudgw]
    • core router should route this to cloudsw as next hop
    • cloudsw has addresses in this subnet:
      • 185.15.56.241/32 (cloudsw1-d5-eqiad) vrrp-gw-1120.eqiad1.wikimediacloud.org
      • 185.15.56.241/32 (cloudsw1-c8-eqiad) vrrp-gw-1120.eqiad1.wikimediacloud.org
      • 185.15.56.242/29 (cloudsw1-c8-eqiad) irb-1120.cloudsw1-c8-eqiad.eqiad1.wikimediacloud.org
      • 185.15.56.243/29 (cloudsw1-d5-eqiad) irb-1120.cloudsw1-d5-eqiad.eqiad1.wikimediacloud.org
  • review cloudnet vlan trunk
    • enable vlans 1105 (existing) 1107 (new) 1120 (being dropped, leave it for later cleanup)
  • neutron ops:
root@cloudcontrol1005:~# openstack router show cloudinstances2b-gw -f shell | grep external_gateway_info
external_gateway_info="{'network_id': '5c9ee953-3a19-4e84-be0f-069b5da75123', 'external_fixed_ips': [{'subnet_id': '7c6bcc12-212f-44c2-9954-5c55002ee371', 'ip_address': '185.15.56.244'}], 'enable_snat': True}"

root@cloudcontrol1005:~# openstack subnet create --network wan-transport-eqiad --gateway 185.15.56.237 --no-dhcp --subnet-range 185.15.56.236/30 cloud-gw-transport-eqiad

root@cloudcontrol1005:~# openstack router set --external-gateway wan-transport-eqiad --fixed-ip subnet=cloud-gw-transport-eqiad,ip-address=185.15.56.238 cloudinstances2b-gw

root@cloudcontrol1005:~# openstack subnet delete cloud-instances-transport1-b-eqiad

root@cloudcontrol1005:~# openstack router set --disable-snat cloudinstances2b-gw --external-gateway wan-transport-eqiad

root@cloudcontrol1005:~# openstack router show cloudinstances2b-gw -f shell | grep external_gateway_info
[... should mention 185.15.56.238 should have enable_snat=False ...]
  • run puppet on cloudnet servers, verify bridges, interfaces, routing and iptables ruleset:
    • brctl show
    • ip -br a
    • ip netns exec qrouter-d93771ba-2711-4f88-804a-8df6fd03978a ip -br a
    • ip netns exec qrouter-d93771ba-2711-4f88-804a-8df6fd03978a ip r
    • ip netns exec qrouter-d93771ba-2711-4f88-804a-8df6fd03978a iptables-save -c | less
  • in case of rollback, undo the changes in reverse order

I'm working on this checklist

1---
2- envvars:
3 - FLOATING_IP_VM: "dev.toolforge.org"
4 TOOLFORGE_BASTION: "login.toolforge.org"
5 NO_FLOATING_VM: "tools-k8s-worker-30.tools.eqiad1.wikimedia.cloud"
6 TOOLS_PUPPETMASTER: "tools-puppetmaster-02.tools.eqiad1.wikimedia.cloud"
7 TOOLSBETA_PUPPETMASTER: "toolsbeta-puppetmaster-04.toolsbeta.eqiad1.wikimedia.cloud"
8---
9# cloudgw after-migration checklist!
10- name: basic ping to cloudgw addresses (raw addresses)
11 tests:
12 # this is cloudgw1001.eqiad1.wikimediacloud.org
13 - cmd: timeout -k5s 10s ping -c1 185.15.56.245 >/dev/null
14 stdout: ""
15 retcode: 0
16 stderr: ""
17 # this is cloudgw1002.eqiad1.wikimediacloud.org
18 - cmd: timeout -k5s 10s ping -c1 185.15.56.246 >/dev/null
19 stdout: ""
20 retcode: 0
21 stderr: ""
22 # this is virt.cloudgw.eqiad1.wikimediacloud.org
23 - cmd: timeout -k5s 10s ping -c1 185.15.56.237 >/dev/null
24 stdout: ""
25 retcode: 0
26 stderr: ""
27 # this wan.cloudgw.eqiad1.wikimediacloud.org, before that, it is neutron
28 - cmd: timeout -k5s 10s ping -c1 185.15.56.244 >/dev/null
29 stdout: ""
30 retcode: 0
31 stderr: ""
32
33- name: basic ping to cloudgw addresses (DNS names)
34 tests:
35 - cmd: timeout -k5s 10s ping -c1 cloudgw1001.eqiad1.wikimediacloud.org >/dev/null
36 stdout: ""
37 retcode: 0
38 stderr: ""
39 - cmd: timeout -k5s 10s ping -c1 cloudgw1002.eqiad1.wikimediacloud.org >/dev/null
40 stdout: ""
41 retcode: 0
42 stderr: ""
43 - cmd: timeout -k5s 10s ping -c1 virt.cloudgw.eqiad1.wikimediacloud.org >/dev/null
44 stdout: ""
45 retcode: 0
46 stderr: ""
47 # this one wont be available until the migration completes:
48 - cmd: timeout -k5s 10s ping -c1 wan.cloudgw.eqiad1.wikimediacloud.org >/dev/null
49 stdout: ""
50 retcode: 0
51 stderr: ""
52
53- name: basic ping to neutron addresses (DNS name)
54 tests:
55 - cmd: timeout -k5s 10s ping -c1 cloudinstances2b-gw.openstack.eqiad1.wikimediacloud.org >/dev/null
56 stdout: ""
57 retcode: 0
58 stderr: ""
59
60- name: basic ping to neutron addresses (raw address)
61 tests:
62 - cmd: timeout -k5s 10s ping -c1 185.15.56.238 >/dev/null
63 stdout: ""
64 retcode: 0
65 stderr: ""
66
67- name: VM (no floating IP) contacting the internet gets NAT'd using routing_source_ip
68 tests:
69 - cmd: ssh $NO_FLOATING_VM "curl -s ifconfig.me ; echo "
70 # this is routing_source_ip
71 stdout: "185.15.56.1"
72 retcode: 0
73 stderr: ""
74
75- name: VM (no floating IP) contacting an address covered by dmz_cidr doesn't get NAT'd
76 tests:
77 - cmd: ssh $NO_FLOATING_VM "curl -Is https://es.wikipedia.org | grep x-client-ip"
78 # this is the internal VM address
79 stdout: "x-client-ip: 172.16.0.241"
80 retcode: 0
81 stderr: ""
82
83- name: VM (using floating IP) isn't affected by either routing_source_ip or dmz_cidr
84 tests:
85 - cmd: ssh $FLOATING_IP_VM "curl -s ifconfig.me ; echo"
86 # this is the VM floating IP address
87 stdout: "185.15.56.50"
88 retcode: 0
89 stderr: ""
90 - cmd: ssh $FLOATING_IP_VM "curl -Is https://es.wikipedia.org | grep x-client-ip"
91 # this is the VM private address, after the migration, it should be the floating IP
92 stdout: "x-client-ip: 185.15.56.50"
93 retcode: 0
94 stderr: ""
95
96- name: VM (no floating IP) can contact auth DNS server
97 tests:
98 - cmd: ssh $NO_FLOATING_VM "dig +short toolforge.org @208.80.154.11"
99 # this the A apex record in the toolforge.org DNS domain zone
100 stdout: "185.15.56.11"
101 retcode: 0
102 stderr: ""
103
104- name: VM (no floating IP) can contact recursor DNS server
105 tests:
106 - cmd: ssh $NO_FLOATING_VM "dig +short www.basket.com @208.80.154.143 | wc -l"
107 # this a somewhat random IPv4 on the internet, so only check that we get "something"
108 stdout: "1"
109 retcode: 0
110 stderr: ""
111
112- name: VM (using floating IP) can contact auth DNS server
113 tests:
114 - cmd: ssh $FLOATING_IP_VM "dig +short toolforge.org @208.80.154.11"
115 # this the A apex record in the toolforge.org DNS domain zone
116 stdout: "185.15.56.11"
117 retcode: 0
118 stderr: ""
119
120- name: VM (using floating IP) can contact recursor DNS server
121 tests:
122 - cmd: ssh $FLOATING_IP_VM "dig +short www.basket.com @208.80.154.143 | wc -l"
123 # this a somewhat random IPv4 on the internet, so only check that we get "something"
124 stdout: "1"
125 retcode: 0
126 stderr: ""
127
128- name: VM (using floating IP) can contact LDAP server
129 tests:
130 - cmd: ssh $FLOATING_IP_VM 'ldapsearch -x whatever | grep -q ^"# numResponses"'
131 # grep is happy, we are too
132 stdout: ""
133 retcode: 0
134 stderr: ""
135
136- name: VM (not using floating IP) can contact LDAP server
137 tests:
138 - cmd: ssh $NO_FLOATING_VM 'ldapsearch -x whatever | grep -q ^"# numResponses"'
139 # grep is happy, we are too
140 stdout: ""
141 retcode: 0
142 stderr: ""
143
144- name: VM (using floating IP) can connect to wikireplicas
145 tests:
146 - cmd: ssh $FLOATING_IP_VM 'sudo -iu tools.arturo-test-tool sql enwiki "select * from page limit 2;" | grep page_id | wc -l'
147 stdout: "1"
148 retcode: 0
149 stderr: ""
150
151- name: Toolforge webservice can be accessed from the internet
152 tests:
153 - cmd: curl -f --no-progress-meter https://network-tests.toolforge.org/files/1MB.bin --output - | file -
154 stdout: "/dev/stdin: data"
155 retcode: 0
156 stderr: ""
157
158- name: Toolforge bastions see herald file on project NFS
159 tests:
160 - cmd: timeout -k5s 60s ssh $FLOATING_IP_VM "file /mnt/nfs/labstore-secondary-tools-project/herald"
161 stdout: "/mnt/nfs/labstore-secondary-tools-project/herald: ASCII text"
162 retcode: 0
163 stderr: ""
164 - cmd: timeout -k5s 60s ssh $TOOLFORGE_BASTION "file /mnt/nfs/labstore-secondary-tools-project/herald"
165 stdout: "/mnt/nfs/labstore-secondary-tools-project/herald: ASCII text"
166 retcode: 0
167 stderr: ""
168
169- name: VM (using floating IP) can contact openstack API
170 tests:
171 - cmd: ssh $FLOATING_IP_VM 'curl -s http://openstack.eqiad1.wikimediacloud.org:5000/v3 | grep -qo identity'
172 # grep is happy, we are too
173 stdout: ""
174 retcode: 0
175 stderr: ""
176
177- name: VM (no floating IP) can contact openstack API
178 tests:
179 - cmd: ssh $NO_FLOATING_VM 'curl -s http://openstack.eqiad1.wikimediacloud.org:5000/v3 | grep -qo identity'
180 # grep is happy, we are too
181 stdout: ""
182 retcode: 0
183 stderr: ""
184
185- name: puppetmasters can sync git tree
186 tests:
187 - cmd: ssh $TOOLS_PUPPETMASTER 'sudo git-sync-upstream 2>&1 | grep -q Up-to-date'
188 # grep is happy, we are too
189 stdout: ""
190 retcode: 0
191 stderr: ""
192 - cmd: ssh $TOOLSBETA_PUPPETMASTER 'sudo git-sync-upstream 2>&1 | grep -q Up-to-date'
193 # grep is happy, we are too
194 stdout: ""
195 retcode: 0
196 stderr: ""
197
198- name: VM (using floating IP) can read dumps NFS
199 tests:
200 - cmd: ssh $FLOATING_IP_VM 'file /mnt/nfs/dumps-labstore1006.wikimedia.org/index.html | grep -q HTML'
201 stdout: ""
202 retcode: 0
203 stderr: ""
204
205- name: VM (no floating IP) can read dumps NFS
206 tests:
207 - cmd: ssh $NO_FLOATING_VM 'file /mnt/nfs/dumps-labstore1006.wikimedia.org/index.html | grep -q HTML'
208 stdout: ""
209 retcode: 0
210 stderr: ""
to be executed by the python script at https://github.com/aborrero/sys-avenger/blob/master/src/cmd-checklist-runner.py to be executed from one's laptop.

I plan to keep adding more tests in the next few days: NFS, wiki replicas, simple ICMP tests, openstack API, etc. Will probably collect some ideas from @Bstorm, @Andrew, @dcaro and @ayounsi

Mentioned in SAL (#wikimedia-cloud) [2021-05-03T10:24:01Z] <arturo> created PTR records for cloudgw100{1,2}.eqiad1.wikimediacloud.org. (T270704)

both check lists are mostly ready:

  • pre-migration (neutron as edge router, with hacks enabled): P15709
  • post-migration (cloudgw as edge router, neutron hacks disabled): P15659

Change 684353 had a related patch set uploaded (by Arturo Borrero Gonzalez; author: Arturo Borrero Gonzalez):

[operations/dns@master] wikimediacloud.org: add cloudsw addresses in vlan 1120

https://gerrit.wikimedia.org/r/684353

Change 684353 merged by Arturo Borrero Gonzalez:

[operations/dns@master] wikimediacloud.org: add cloudsw addresses in vlan 1120

https://gerrit.wikimedia.org/r/684353

Change 684864 had a related patch set uploaded (by Arturo Borrero Gonzalez; author: Arturo Borrero Gonzalez):

[operations/dns@master] wikimediacloud.org: update names for cloudgw migration

https://gerrit.wikimedia.org/r/684864

Change 685379 had a related patch set uploaded (by Arturo Borrero Gonzalez; author: Arturo Borrero Gonzalez):

[operations/puppet@production] cloudgw: introduce icinga checks

https://gerrit.wikimedia.org/r/685379

Change 685405 had a related patch set uploaded (by Arturo Borrero Gonzalez; author: Arturo Borrero Gonzalez):

[operations/puppet@production] cloudgw: enable notifications

https://gerrit.wikimedia.org/r/685405

Mentioned in SAL (#wikimedia-operations) [2021-05-06T15:06:11Z] <aborrero@cumin1001> START - Cookbook sre.hosts.downtime for 1:00:00 on 105 hosts with reason: T270704

Mentioned in SAL (#wikimedia-operations) [2021-05-06T15:06:48Z] <aborrero@cumin1001> END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on 105 hosts with reason: T270704

Mentioned in SAL (#wikimedia-operations) [2021-05-06T15:06:55Z] <aborrero@cumin1001> START - Cookbook sre.hosts.downtime for 1:00:00 on 9 hosts with reason: T270704

Mentioned in SAL (#wikimedia-operations) [2021-05-06T15:07:03Z] <aborrero@cumin1001> END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on 9 hosts with reason: T270704

Mentioned in SAL (#wikimedia-cloud) [2021-05-06T15:31:34Z] <arturo> about to migrating CloudVPS network to the cloudgw architecture T270704

Change 683268 merged by Arturo Borrero Gonzalez:

[operations/puppet@production] openstack: neutron: topology changes for cloudgw

https://gerrit.wikimedia.org/r/683268

Change 684864 merged by Arturo Borrero Gonzalez:

[operations/dns@master] wikimediacloud.org: update names for cloudgw migration

https://gerrit.wikimedia.org/r/684864

Change 685405 merged by Arturo Borrero Gonzalez:

[operations/puppet@production] cloudgw: enable notifications

https://gerrit.wikimedia.org/r/685405

Change 685379 merged by Arturo Borrero Gonzalez:

[operations/puppet@production] cloudgw: introduce icinga checks

https://gerrit.wikimedia.org/r/685379

Change 686457 had a related patch set uploaded (by Arturo Borrero Gonzalez; author: Arturo Borrero Gonzalez):

[operations/puppet@production] openstack: cleanup neutron hacks

https://gerrit.wikimedia.org/r/686457

Change 686457 merged by Arturo Borrero Gonzalez:

[operations/puppet@production] openstack: cleanup neutron hacks

https://gerrit.wikimedia.org/r/686457

Change 688359 had a related patch set uploaded (by Arturo Borrero Gonzalez; author: Arturo Borrero Gonzalez):

[operations/puppet@production] hieradata: drop unused neutron configuration for dmz_cidr

https://gerrit.wikimedia.org/r/688359

Change 688359 merged by Arturo Borrero Gonzalez:

[operations/puppet@production] hieradata: drop unused neutron configuration for dmz_cidr

https://gerrit.wikimedia.org/r/688359

Change 688365 had a related patch set uploaded (by Arturo Borrero Gonzalez; author: Arturo Borrero Gonzalez):

[operations/puppet@production] cloudgw: factorize NAT template file into base profile

https://gerrit.wikimedia.org/r/688365

Change 688366 had a related patch set uploaded (by Arturo Borrero Gonzalez; author: Arturo Borrero Gonzalez):

[operations/puppet@production] cloudgw: cleanup unused all_phy_nics parameter

https://gerrit.wikimedia.org/r/688366

Change 688367 had a related patch set uploaded (by Arturo Borrero Gonzalez; author: Arturo Borrero Gonzalez):

[operations/puppet@production] cloudgw: don't use concatenation with CIDR

https://gerrit.wikimedia.org/r/688367

Change 688365 merged by Arturo Borrero Gonzalez:

[operations/puppet@production] cloudgw: factorize NAT template file into base profile

https://gerrit.wikimedia.org/r/688365

Change 688366 merged by Arturo Borrero Gonzalez:

[operations/puppet@production] cloudgw: cleanup unused all_phy_nics parameter

https://gerrit.wikimedia.org/r/688366

Change 688367 merged by Arturo Borrero Gonzalez:

[operations/puppet@production] cloudgw: don't use concatenation with CIDR

https://gerrit.wikimedia.org/r/688367

Change 689831 had a related patch set uploaded (by Arturo Borrero Gonzalez; author: Arturo Borrero Gonzalez):

[operations/puppet@production] openstack: neutron: more cloudgw cleanups

https://gerrit.wikimedia.org/r/689831

Change 689831 merged by Arturo Borrero Gonzalez:

[operations/puppet@production] openstack: neutron: more cloudgw cleanups

https://gerrit.wikimedia.org/r/689831