Page MenuHomePhabricator

Inconsistent connectivity between cloudservices200[45]-dev and codfw1dev cloudcontrols
Closed, ResolvedPublic

Description

To access Designate/Zone/DNS endpoints, an api call needs to contact Keystone on a cloudcontrol for discovery, then Designate on a cloudservices node. Designate on the cloudservices node will, in turn, validate the token via Keystone back on the cloudcontrol nodes.

Something in that journey is a bit broken. Anytime I try a designate call on cloudcontrol2001-dev (behind cloudlb) I get:

root@cloudcontrol2001-dev:/var/log# openstack zone list --os-cloud novaadmin
Failed to contact the endpoint at https://openstack.codfw1dev.wikimediacloud.org:29001 for discovery. Fallback to using that endpoint as the base url.
Unknown

When I run the same command on cloudcontrol200[45]-dev, it works sometimes and times out sometimes:

root@cloudcontrol2004-dev:~# openstack zone list --os-cloud novaadmin

root@cloudcontrol2005-dev:~# openstack zone list --os-cloud novaadmin
timeout

Event Timeline

telnet cloudservices2005-dev.wikimedia.org 9001

Works from cloudcontrol200[45]-dev but not from cloudcontrol2001-dev

"wget https://openstack.codfw1dev.wikimediacloud.org:29001" returns 503 no matter whether haproxy is or isn't running on cloudcontrol2005-dev. This surprises me since openstack.codfw1dev.wikimediacloud.org is a CNAME for cloudcontrol2005-dev.wikimedia.org. Some routing thing is happening that I don't understand.

Removing '185.15.57.24 openstack.codfw1dev.wikimediacloud.org' from /etc/hosts in cloudcontrol2001-dev sfixed the 503 problem. The intermittent timeouts are still happening.

cloudlb2001-dev seems unable to reach designate. "telnet 208.80.153.43 9001" and "telnet 208.80.153.44 9001" both fail from cloudlb2001-dev. That likely means that haproxy is not pooling designate on cloudlb2001.

I'm confused by the firewall rules I'm seeing... cloudservices hosts allow cloudlb2001-dev.private.codfw.wikimedia.cloud (172.20.5.2) but cloudlb2001-dev's ip address seems to be 10.192.20.8. If I add 'cloudlb2001-dev.codfw.wmnet' to the ferm rule on cloudservices1004 then the telnet works.

I don't know how this is meant to work so don't know what the right solution is; I'm guessing that adding that address is not the final answer.

to make things more confusing I made manual overrides on cloudcontrol2001-dev in /etc/hosts to force the primary FQDN openstack.codfw1dev.wikimediacloud.org to go using cloudlb VIP.

Change 920795 had a related patch set uploaded (by Andrew Bogott; author: Andrew Bogott):

[operations/puppet@production] cloudlb2001: use new cloud-private vlan addresses for designate

https://gerrit.wikimedia.org/r/920795

Change 920795 merged by Andrew Bogott:

[operations/puppet@production] cloudlb2001: use new cloud-private vlan addresses for designate

https://gerrit.wikimedia.org/r/920795

To clarify the current situation here from a network perspective.

iptables on the cloudservices2xxx nodes is currently allowing traffic on TCP 9001 as follows:

cmooney@cloudservices2004-dev:~$ sudo iptables -L -v --line -n | grep 9001
22    3131  188K ACCEPT     tcp  --  *      *       208.80.153.116       0.0.0.0/0            tcp dpt:9001
23    3323  199K ACCEPT     tcp  --  *      *       208.80.153.40        0.0.0.0/0            tcp dpt:9001
24   16204  972K ACCEPT     tcp  --  *      *       172.20.5.2           0.0.0.0/0            tcp dpt:9001
25       0     0 ACCEPT     tcp  --  *      *       172.20.5.3           0.0.0.0/0            tcp dpt:9001
26       0     0 ACCEPT     tcp  --  *      *       172.20.5.4           0.0.0.0/0            tcp dpt:9001

Effectively this means the cloudservice nodes accept connections from:

  • cloudcontrol2004 and cloudcontrol2005 directly over the wmf public vlan as before (208.80.153.x)
  • cloudlb2001, cloudlb2002 and cloudlb2003 directly over the cloud-private vlan (172.20.5.x)

That means cloudcontrol2001 is not permitted to connect directly from its own IP on cloud-private vlan (172.20.5.5). That may be the desired state if the intention is for it to use the load-balancers instead.

DNS is of course part of the equation here too. I'm not sure what way that should be set up with the current connectivity.

I think the last missing point here are rabbitmq servers:

aborrero@cloudservices2004-dev:~ $ sudo tail -f /var/log/designate/designate-agent.log
2023-06-01 12:03:37.422 2492427 ERROR oslo.messaging._drivers.impl_rabbit [None req-a174d555-1417-4981-a972-4a213fc00303 - - - all - -] [4a3405f2-5ef4-4730-a859-7745f7d80400] AMQP server on rabbitmq02.codfw1dev.wikimediacloud.org:5671 is unreachable: failed to resolve broker hostname. Trying again in 0 seconds.: OSError: failed to resolve broker hostname

Change 925761 had a related patch set uploaded (by Arturo Borrero Gonzalez; author: Arturo Borrero Gonzalez):

[operations/dns@master] wikimediacloud.org: codfw1dev: rework rabbitmq CNAMEs

https://gerrit.wikimedia.org/r/925761

Change 925762 had a related patch set uploaded (by Arturo Borrero Gonzalez; author: Arturo Borrero Gonzalez):

[operations/puppet@production] openstack: codfw1dev: refresh rabbitmq nodes

https://gerrit.wikimedia.org/r/925762

Change 925761 merged by Arturo Borrero Gonzalez:

[operations/dns@master] wikimediacloud.org: codfw1dev: rework rabbitmq CNAMEs

https://gerrit.wikimedia.org/r/925761

Change 925762 merged by Arturo Borrero Gonzalez:

[operations/puppet@production] openstack: codfw1dev: refresh rabbitmq nodes

https://gerrit.wikimedia.org/r/925762

Change 926035 had a related patch set uploaded (by Andrew Bogott; author: Andrew Bogott):

[operations/puppet@production] rabbitmq: Change node names to use the cname service name

https://gerrit.wikimedia.org/r/926035

Change 926035 merged by Andrew Bogott:

[operations/puppet@production] rabbitmq: Change node names to use the cname service name

https://gerrit.wikimedia.org/r/926035

Change 926456 had a related patch set uploaded (by Arturo Borrero Gonzalez; author: Arturo Borrero Gonzalez):

[operations/puppet@production] openstack: codfw1dev: let all designate traffic happen using cloud-private

https://gerrit.wikimedia.org/r/926456

Change 926456 had a related patch set uploaded (by Arturo Borrero Gonzalez; author: Arturo Borrero Gonzalez):

[operations/puppet@production] openstack: codfw1dev: let all designate traffic happen using cloud-private

https://gerrit.wikimedia.org/r/926456

Hey @Andrew the first iteration of this patch resulted in a PCC that was too invasive for my linking, check here:

This is mostly because the hiera key profile::openstack::codfw1dev::designate_hosts is used with different semantics (ACL but also DNS settings etc).

For now, I'll override the key only on cloudcontrol nodes so DB/rabbitmq access is enabled from cloudservices.

Change 926456 merged by Arturo Borrero Gonzalez:

[operations/puppet@production] openstack: codfw1dev: let all designate traffic happen using cloud-private

https://gerrit.wikimedia.org/r/926456

Change 926474 had a related patch set uploaded (by Arturo Borrero Gonzalez; author: Arturo Borrero Gonzalez):

[operations/puppet@production] cloudlb: haproxy: mysql: expose tcp port to all internal networks

https://gerrit.wikimedia.org/r/926474

Change 926474 merged by Arturo Borrero Gonzalez:

[operations/puppet@production] cloudlb: haproxy: mysql: expose tcp port to all internal networks

https://gerrit.wikimedia.org/r/926474

This should work now.

aborrero@cloudcontrol2001-dev:~ $ sudo wmcs-openstack zone list --all-projects 
+--------------------------------------+------------------------+------------------------------------------------------+---------+------------+--------+--------+
| id                                   | project_id             | name                                                 | type    |     serial | status | action |
+--------------------------------------+------------------------+------------------------------------------------------+---------+------------+--------+--------+
| 187fdc06-c5d2-46e1-ab71-97d34dd067ce | cloudinfra-codfw1dev   | 16.172.in-addr.arpa.                                 | PRIMARY | 1685009888 | ACTIVE | NONE   |
| 4c754100-1790-4858-a583-9de93c9e8b3d | cloudinfra-codfw1dev   | codfw1dev.wikimedia.cloud.                           | PRIMARY | 1685009888 | ACTIVE | NONE   |
| 5748595a-11c6-4099-bfc2-b95b6ae67c21 | cloudinfra-codfw1dev   | codfw1dev.wmcloud.org.                               | PRIMARY | 1684083543 | ACTIVE | NONE   |
[...]

Please reopen if required.