Page MenuHomePhabricator

CloudVPS: a VM is unable to contact floating IPs of other VMs
Closed, ResolvedPublic

Description

We set up the new paws entrypoint using keepalived instead of the usual static and manually moved floating IP. This is good. However, we never taught labsaliaser about this new thing, so inside cloud you cannot hit https://hub.paws.wmcloud.org/hub/metrics (for example). It should either be able to map to its internal IP (which I believe is what k8s.svc.paws.eqiad1.wikimedia.cloud uses) or perhaps we should fix the bug that requires labsaliaser?

Either way, it seems to be a thing to fix if we start making auto-failover more common.

Update: this seems to be an issue when routing from a VM to a neutron floating IP in a different VM. such as we use in this setup. There is an asymmetric packet flow, moreover, the reply packet doesn't use NAT at all.

This diagram should help understand the issue.

image.png (386×596 px, 36 KB)

Event Timeline

I believe this is purely about routing inside neutron and not related to DNS at all:

aborrero@tools-prometheus-04:~$ telnet 172.16.1.171 443
Trying 172.16.1.171...
Connected to 172.16.1.171.
Escape character is '^]'.
^CConnection closed by foreign host.
aborrero@tools-prometheus-04:~$ telnet 185.15.56.57 443
Trying 185.15.56.57...
^C

When investigating this I found T257552 which is more urgent.

aborrero renamed this task from Find a way to teach labsaliaser about neutron port floating IPs to CloudVPS: issues when routing to static internal IP.Jul 14 2020, 2:09 PM
aborrero claimed this task.
aborrero triaged this task as Medium priority.
aborrero updated the task description. (Show Details)
aborrero moved this task from Inbox to Doing on the cloud-services-team (Kanban) board.

Mentioned in SAL (#wikimedia-cloud) [2020-07-14T15:19:37Z] <arturo> briefly set root@cloudnet1003:~ # sysctl net.ipv4.conf.all.accept_local=1 (in neutron qrouter netns) (T257534)

aborrero renamed this task from CloudVPS: issues when routing to static internal IP to CloudVPS: a VM is unable to contact floating IPs of other VMs.Jul 15 2020, 11:03 AM
aborrero updated the task description. (Show Details)

I'm also investigating codfw1dev, because at quick glance it may behave differently and I don't know why yet.

I confirm the behavior is different in codfw1dev. Upon research, it turns out the dmz_cidr setting is different in both deployments.

In concrete, in T206261: Routing RFC1918 private IP addresses to/from WMCS floating IPs I introduced puppet change https://gerrit.wikimedia.org/r/c/operations/puppet/+/468546 that might be causing the issues described in this task,

So, after a bit more investigation I'm confident I understand what's happening here.

Look at this diagram:

floating ip(2).png (902×2 px, 151 KB)

We are currently in the scenario 1. When a VM in CloudVPS contacts a floating IP we don't do the general SNAT (excluded in dmz_cidr). The destination VM therefore knows the source address of the source VM.
But precisely because of that, the destination VM replies directly, which results in an invalid TCP connection being dropped by the basic firewalling that lives in each cloudvirt hypervisor server.

The scenario 1 was introduced by me in T206261, but I'm pretty sure it doesn't work as expected. We should revert back to scenario 2, which applies the general SNAT to connections between VMs. The tradeoff is that a VM behind a floating IP wont know the original source address, but that's not a big deal anyway.

The dmz_cidr mechanism is not intended for this anyway. We are actually trying to get rid of it and will most likely do when we introduce the cloudgw project.

Change 613123 had a related patch set uploaded (by Arturo Borrero Gonzalez; owner: Arturo Borrero Gonzalez):
[operations/puppet@production] cloud: eqiad1: drop dmz_cidr exclussion 172.16.0.0/21 : 172.16.0.0/21

https://gerrit.wikimedia.org/r/613123

Change 613123 merged by Arturo Borrero Gonzalez:
[operations/puppet@production] cloud: eqiad1: drop dmz_cidr exclussion 172.16.0.0/21 : 172.16.0.0/21

https://gerrit.wikimedia.org/r/613123

hey @Bstorm could you please confirm the prometheus thing you were trying works no?

Yes, I think. It's broken for another reason now 😁 The certificate is now a problem:

bstorm@tools-prometheus-03:~$ curl https://hub.paws.wmcloud.org/hub/metrics
curl: (60) SSL certificate problem: unable to get local issuer certificate
More details here: https://curl.haxx.se/docs/sslcerts.html

curl failed to verify the legitimacy of the server and therefore could not
establish a secure connection to it. To learn more about this situation and
how to fix it, please visit the web page mentioned above.

Outside:

577) [0] bstormWMF2026:wmf bstorm$ curl https://hub.paws.wmcloud.org/hub/metrics
# HELP python_gc_objects_collected_total Objects collected during gc
# TYPE python_gc_objects_collected_total counter
python_gc_objects_collected_total{generation="0"} 4203.0
python_gc_objects_collected_total{generation="1"} 1875.0
python_gc_objects_collected_total{generation="2"} 54.0
# HELP python_gc_objects_uncollectable_total Uncollectable object found during GC
# TYPE python_gc_objects_uncollectable_total counter
python_gc_objects_uncollectable_total{generation="0"} 0.0
python_gc_objects_uncollectable_total{generation="1"} 0.0
python_gc_objects_uncollectable_total{generation="2"} 0.0
# HELP python_gc_collections_total Number of times this generation was collected
# TYPE python_gc_collections_total counter
python_gc_collections_total{generation="0"} 423.0
python_gc_collections_total{generation="1"} 38.0
python_gc_collections_total{generation="2"} 3.0
# HELP python_info Python platform information
# TYPE python_info gauge
python_info{implementation="CPython",major="3",minor="6",patchlevel="9",version="3.6.9"} 1.0
# HELP process_virtual_memory_bytes Virtual memory size in bytes.
# TYPE process_virtual_memory_bytes gauge
process_virtual_memory_bytes 1.113591808e+09
# HELP process_resident_memory_bytes Resident memory size in bytes.
# TYPE process_resident_memory_bytes gauge
process_resident_memory_bytes 1.03047168e+08
# HELP process_start_time_seconds Start time of the process since unix epoch in seconds.
# TYPE process_start_time_seconds gauge
process_start_time_seconds 1.59598504863e+09
# HELP process_cpu_seconds_total Total user and system CPU time spent in seconds.
# TYPE process_cpu_seconds_total counter
process_cpu_seconds_total 6.54
# HELP process_open_fds Number of open file descriptors.
# TYPE process_open_fds gauge
process_open_fds 17.0
# HELP process_max_fds Maximum number of open file descriptors.
# TYPE process_max_fds gauge
process_max_fds 1.048576e+06
# HELP request_duration_seconds request duration for all HTTP requests
# TYPE request_duration_seconds histogram
request_duration_seconds_bucket{code="200",handler="jupyterhub.apihandlers.users.UserListAPIHandler",le="0.005",method="GET"} 0.0
request_duration_seconds_bucket{code="200",handler="jupyterhub.apihandlers.users.UserListAPIHandler",le="0.01",method="GET"} 1.0
request_duration_seconds_bucket{code="200",handler="jupyterhub.apihandlers.users.UserListAPIHandler",le="0.025",method="GET"} 2.0
request_duration_seconds_bucket{code="200",handler="jupyterhub.apihandlers.users.UserListAPIHandler",le="0.05",method="GET"} 2.0
request_duration_seconds_bucket{code="200",handler="jupyterhub.apihandlers.users.UserListAPIHandler",le="0.075",method="GET"} 2.0
request_duration_seconds_bucket{code="200",handler="jupyterhub.apihandlers.users.UserListAPIHandler",le="0.1",method="GET"} 2.0
request_duration_seconds_bucket{code="200",handler="jupyterhub.apihandlers.users.UserListAPIHandler",le="0.25",method="GET"} 2.0
request_duration_seconds_bucket{code="200",handler="jupyterhub.apihandlers.users.UserListAPIHandler",le="0.5",method="GET"} 2.0
request_duration_seconds_bucket{code="200",handler="jupyterhub.apihandlers.users.UserListAPIHandler",le="0.75",method="GET"} 2.0
request_duration_seconds_bucket{code="200",handler="jupyterhub.apihandlers.users.UserListAPIHandler",le="1.0",method="GET"} 2.0
request_duration_seconds_bucket{code="200",handler="jupyterhub.apihandlers.users.UserListAPIHandler",le="2.5",method="GET"} 2.0
request_duration_seconds_bucket{code="200",handler="jupyterhub.apihandlers.users.UserListAPIHandler",le="5.0",method="GET"} 2.0
request_duration_seconds_bucket{code="200",handler="jupyterhub.apihandlers.users.UserListAPIHandler",le="7.5",method="GET"} 2.0
request_duration_seconds_bucket{code="200",handler="jupyterhub.apihandlers.users.UserListAPIHandler",le="10.0",method="GET"} 2.0
request_duration_seconds_bucket{code="200",handler="jupyterhub.apihandlers.users.UserListAPIHandler",le="+Inf",method="GET"} 2.0
request_duration_seconds_count{code="200",handler="jupyterhub.apihandlers.users.UserListAPIHandler",method="GET"} 2.0
request_duration_seconds_sum{code="200",handler="jupyterhub.apihandlers.users.UserListAPIHandler",method="GET"} 0.02556443214416504
request_duration_seconds_bucket{code="200",handler="jupyterhub.handlers.pages.HealthCheckHandler",le="0.005",method="GET"} 91.0
request_duration_seconds_bucket{code="200",handler="jupyterhub.handlers.pages.HealthCheckHandler",le="0.01",method="GET"} 91.0
request_duration_seconds_bucket{code="200",handler="jupyterhub.handlers.pages.HealthCheckHandler",le="0.025",method="GET"} 91.0
request_duration_seconds_bucket{code="200",handler="jupyterhub.handlers.pages.HealthCheckHandler",le="0.05",method="GET"} 91.0
request_duration_seconds_bucket{code="200",handler="jupyterhub.handlers.pages.HealthCheckHandler",le="0.075",method="GET"} 91.0
request_duration_seconds_bucket{code="200",handler="jupyterhub.handlers.pages.HealthCheckHandler",le="0.1",method="GET"} 91.0
request_duration_seconds_bucket{code="200",handler="jupyterhub.handlers.pages.HealthCheckHandler",le="0.25",method="GET"} 91.0
request_duration_seconds_bucket{code="200",handler="jupyterhub.handlers.pages.HealthCheckHandler",le="0.5",method="GET"} 91.0
request_duration_seconds_bucket{code="200",handler="jupyterhub.handlers.pages.HealthCheckHandler",le="0.75",method="GET"} 91.0
request_duration_seconds_bucket{code="200",handler="jupyterhub.handlers.pages.HealthCheckHandler",le="1.0",method="GET"} 91.0
request_duration_seconds_bucket{code="200",handler="jupyterhub.handlers.pages.HealthCheckHandler",le="2.5",method="GET"} 91.0
request_duration_seconds_bucket{code="200",handler="jupyterhub.handlers.pages.HealthCheckHandler",le="5.0",method="GET"} 91.0
request_duration_seconds_bucket{code="200",handler="jupyterhub.handlers.pages.HealthCheckHandler",le="7.5",method="GET"} 91.0
request_duration_seconds_bucket{code="200",handler="jupyterhub.handlers.pages.HealthCheckHandler",le="10.0",method="GET"} 91.0
request_duration_seconds_bucket{code="200",handler="jupyterhub.handlers.pages.HealthCheckHandler",le="+Inf",method="GET"} 91.0
request_duration_seconds_count{code="200",handler="jupyterhub.handlers.pages.HealthCheckHandler",method="GET"} 91.0
request_duration_seconds_sum{code="200",handler="jupyterhub.handlers.pages.HealthCheckHandler",method="GET"} 0.15820932388305664
request_duration_seconds_bucket{code="302",handler="jupyterhub.handlers.base.PrefixRedirectHandler",le="0.005",method="GET"} 1.0
request_duration_seconds_bucket{code="302",handler="jupyterhub.handlers.base.PrefixRedirectHandler",le="0.01",method="GET"} 1.0
request_duration_seconds_bucket{code="302",handler="jupyterhub.handlers.base.PrefixRedirectHandler",le="0.025",method="GET"} 1.0
request_duration_seconds_bucket{code="302",handler="jupyterhub.handlers.base.PrefixRedirectHandler",le="0.05",method="GET"} 1.0
request_duration_seconds_bucket{code="302",handler="jupyterhub.handlers.base.PrefixRedirectHandler",le="0.075",method="GET"} 1.0
request_duration_seconds_bucket{code="302",handler="jupyterhub.handlers.base.PrefixRedirectHandler",le="0.1",method="GET"} 1.0
request_duration_seconds_bucket{code="302",handler="jupyterhub.handlers.base.PrefixRedirectHandler",le="0.25",method="GET"} 1.0
request_duration_seconds_bucket{code="302",handler="jupyterhub.handlers.base.PrefixRedirectHandler",le="0.5",method="GET"} 1.0
request_duration_seconds_bucket{code="302",handler="jupyterhub.handlers.base.PrefixRedirectHandler",le="0.75",method="GET"} 1.0
request_duration_seconds_bucket{code="302",handler="jupyterhub.handlers.base.PrefixRedirectHandler",le="1.0",method="GET"} 1.0
request_duration_seconds_bucket{code="302",handler="jupyterhub.handlers.base.PrefixRedirectHandler",le="2.5",method="GET"} 1.0
request_duration_seconds_bucket{code="302",handler="jupyterhub.handlers.base.PrefixRedirectHandler",le="5.0",method="GET"} 1.0
request_duration_seconds_bucket{code="302",handler="jupyterhub.handlers.base.PrefixRedirectHandler",le="7.5",method="GET"} 1.0
request_duration_seconds_bucket{code="302",handler="jupyterhub.handlers.base.PrefixRedirectHandler",le="10.0",method="GET"} 1.0
request_duration_seconds_bucket{code="302",handler="jupyterhub.handlers.base.PrefixRedirectHandler",le="+Inf",method="GET"} 1.0
request_duration_seconds_count{code="302",handler="jupyterhub.handlers.base.PrefixRedirectHandler",method="GET"} 1.0
request_duration_seconds_sum{code="302",handler="jupyterhub.handlers.base.PrefixRedirectHandler",method="GET"} 0.0024704933166503906
request_duration_seconds_bucket{code="302",handler="jupyterhub.handlers.pages.RootHandler",le="0.005",method="GET"} 0.0
request_duration_seconds_bucket{code="302",handler="jupyterhub.handlers.pages.RootHandler",le="0.01",method="GET"} 1.0
request_duration_seconds_bucket{code="302",handler="jupyterhub.handlers.pages.RootHandler",le="0.025",method="GET"} 1.0
request_duration_seconds_bucket{code="302",handler="jupyterhub.handlers.pages.RootHandler",le="0.05",method="GET"} 1.0
request_duration_seconds_bucket{code="302",handler="jupyterhub.handlers.pages.RootHandler",le="0.075",method="GET"} 1.0
request_duration_seconds_bucket{code="302",handler="jupyterhub.handlers.pages.RootHandler",le="0.1",method="GET"} 1.0
request_duration_seconds_bucket{code="302",handler="jupyterhub.handlers.pages.RootHandler",le="0.25",method="GET"} 1.0
request_duration_seconds_bucket{code="302",handler="jupyterhub.handlers.pages.RootHandler",le="0.5",method="GET"} 1.0
request_duration_seconds_bucket{code="302",handler="jupyterhub.handlers.pages.RootHandler",le="0.75",method="GET"} 1.0
request_duration_seconds_bucket{code="302",handler="jupyterhub.handlers.pages.RootHandler",le="1.0",method="GET"} 1.0
request_duration_seconds_bucket{code="302",handler="jupyterhub.handlers.pages.RootHandler",le="2.5",method="GET"} 1.0
request_duration_seconds_bucket{code="302",handler="jupyterhub.handlers.pages.RootHandler",le="5.0",method="GET"} 1.0
request_duration_seconds_bucket{code="302",handler="jupyterhub.handlers.pages.RootHandler",le="7.5",method="GET"} 1.0
request_duration_seconds_bucket{code="302",handler="jupyterhub.handlers.pages.RootHandler",le="10.0",method="GET"} 1.0
request_duration_seconds_bucket{code="302",handler="jupyterhub.handlers.pages.RootHandler",le="+Inf",method="GET"} 1.0
request_duration_seconds_count{code="302",handler="jupyterhub.handlers.pages.RootHandler",method="GET"} 1.0
request_duration_seconds_sum{code="302",handler="jupyterhub.handlers.pages.RootHandler",method="GET"} 0.006300687789916992
request_duration_seconds_bucket{code="200",handler="jupyterhub.handlers.login.LoginHandler",le="0.005",method="GET"} 0.0
request_duration_seconds_bucket{code="200",handler="jupyterhub.handlers.login.LoginHandler",le="0.01",method="GET"} 0.0
request_duration_seconds_bucket{code="200",handler="jupyterhub.handlers.login.LoginHandler",le="0.025",method="GET"} 0.0
request_duration_seconds_bucket{code="200",handler="jupyterhub.handlers.login.LoginHandler",le="0.05",method="GET"} 1.0
request_duration_seconds_bucket{code="200",handler="jupyterhub.handlers.login.LoginHandler",le="0.075",method="GET"} 1.0
request_duration_seconds_bucket{code="200",handler="jupyterhub.handlers.login.LoginHandler",le="0.1",method="GET"} 1.0
request_duration_seconds_bucket{code="200",handler="jupyterhub.handlers.login.LoginHandler",le="0.25",method="GET"} 1.0
request_duration_seconds_bucket{code="200",handler="jupyterhub.handlers.login.LoginHandler",le="0.5",method="GET"} 1.0
request_duration_seconds_bucket{code="200",handler="jupyterhub.handlers.login.LoginHandler",le="0.75",method="GET"} 1.0
request_duration_seconds_bucket{code="200",handler="jupyterhub.handlers.login.LoginHandler",le="1.0",method="GET"} 1.0
request_duration_seconds_bucket{code="200",handler="jupyterhub.handlers.login.LoginHandler",le="2.5",method="GET"} 1.0
request_duration_seconds_bucket{code="200",handler="jupyterhub.handlers.login.LoginHandler",le="5.0",method="GET"} 1.0
request_duration_seconds_bucket{code="200",handler="jupyterhub.handlers.login.LoginHandler",le="7.5",method="GET"} 1.0
request_duration_seconds_bucket{code="200",handler="jupyterhub.handlers.login.LoginHandler",le="10.0",method="GET"} 1.0
request_duration_seconds_bucket{code="200",handler="jupyterhub.handlers.login.LoginHandler",le="+Inf",method="GET"} 1.0
request_duration_seconds_count{code="200",handler="jupyterhub.handlers.login.LoginHandler",method="GET"} 1.0
request_duration_seconds_sum{code="200",handler="jupyterhub.handlers.login.LoginHandler",method="GET"} 0.03591203689575195
request_duration_seconds_bucket{code="200",handler="jupyterhub.handlers.static.CacheControlStaticFilesHandler",le="0.005",method="GET"} 1.0
request_duration_seconds_bucket{code="200",handler="jupyterhub.handlers.static.CacheControlStaticFilesHandler",le="0.01",method="GET"} 1.0
request_duration_seconds_bucket{code="200",handler="jupyterhub.handlers.static.CacheControlStaticFilesHandler",le="0.025",method="GET"} 1.0
request_duration_seconds_bucket{code="200",handler="jupyterhub.handlers.static.CacheControlStaticFilesHandler",le="0.05",method="GET"} 1.0
<snip>
proxy_delete_duration_seconds_created{status="success"} 1.5959850497812405e+09
proxy_delete_duration_seconds_created{status="failure"} 1.5959850497812855e+09

I recall some issue where we don't trust LE certs for some reason. However, I didn't think that affected VMs? I guess I was wrong.

The fix is proposed! https://gerrit.wikimedia.org/r/c/operations/software/acme-chief/+/617177

Apparently, we need a combined cert that includes the chain of authority (which currently, acme-chief doesn't offer).

All that said, this task is done.

This is back. I discovered it trying to scrape jupyterhub metrics (again). This can be done from the internet, but it is blocked as if by firewall in our cloud.

bstorm@tools-prometheus-03:~$ wget https://hub.paws.wmcloud.org:443/hub/metrics
--2021-07-21 16:45:41--  https://hub.paws.wmcloud.org/hub/metrics
Resolving hub.paws.wmcloud.org (hub.paws.wmcloud.org)... 185.15.56.57
Connecting to hub.paws.wmcloud.org (hub.paws.wmcloud.org)|185.15.56.57|:443... failed: Connection timed out.
Retrying.

So I think this is back since last we looked. This may have come back after the cloudgw introduction, since the metrics I need are completely missing (thus past retention).

Actually, I'm going to make a new one as a subtask. This is different (which is not surprising).