Page MenuHomePhabricator

VM metadata service slow response
Open, LowPublic

Description

I got curious on why provisioning a new VM takes so long, looking at the fullstackd graphs: https://grafana.wikimedia.org/goto/Jrb386iDR?orgId=1

Asking cloud-init first on a newly-created VM:

# cloud-init analyze blame | head -10
-- Boot Record 01 --
     29.77100s (init-local/search-OpenStackLocal)
     25.00000s (init-network/config-bootcmd)
     01.75600s (modules-config/config-apt-configure)
     00.67000s (init-network/config-ssh)
     00.46500s (modules-final/config-package-update-upgrade-install)

OpenStackLocal taking 30s was surprising to me since that is when metadata about the cloud is gathered.

Digging further, the metadata service takes ~1s to answer, and cloud-init does a bunch of calls, which explains the ~30s total:

$ curl -w '\nTotal: %{time_total}s\nDNS: %{time_namelookup}s\nConnect: %{time_connect}s\nStart Transfer: %{time_starttransfer}s\n' \   -o /dev/null -s http://169.254.169.254/ 
Total: 0.000000s
DNS: 0.000000s
Connect: 0.000000s
Start Transfer: 0.000000s
1.0
2007-01-19
2007-03-01
2007-08-29
2007-10-10
2007-12-15
2008-02-01
2008-09-01
2009-04-04
latest
Total: 1.026512s
DNS: 0.000045s
Connect: 0.001972s
Start Transfer: 1.022772s

Digging a little further, neutron-metadata-agent is the process answering those calls and then talking to nova-api-metadata. The service is leaking file descriptors as per T395742: Neutron metadata service failing for all VMs and has kept doing so, becoming quite slow

root@cloudnet1006:~# lsof -p $(systemctl show -P MainPID neutron-metadata-agent) | grep -c eventpoll
193376

Even though after a restart of neutron-metadata-agent responses are still slow:

curl -w '\nTotal: %{time_total}s\nDNS: %{time_namelookup}s\nConne
ct: %{time_connect}s\nStart Transfer: %{time_starttransfer}s\n' \   -o /dev/null -s http://169.254.169.254/

And communication with nova metadata doesn't seem slow, so it seems the issue is somewhere between neutron-metadata-agent and the VM

cloudnet1006:~# tshark -i ens3f0np0 -td  'port 8775'
Running as user "root" and group "root". This could be dangerous.
Capturing on 'ens3f0np0'
 ** (tshark:871501) 08:20:11.747456 [Main MESSAGE] -- Capture started.
 ** (tshark:871501) 08:20:11.747575 [Main MESSAGE] -- File: "/tmp/wireshark_ens3f0np0XRKWG3.pcapng"
    1 0.000000000 2a02:ec80:a000:202::3 → 2a02:ec80:a000:4000::1 TCP 98 46584 → 8775 [SYN] Seq=0 Win=35760 Len=0 MSS=8940 SACK_PERM TSval=2106012432 TSecr=0 WS=512
    2 0.000063029 2a02:ec80:a000:4000::1 → 2a02:ec80:a000:202::3 TCP 98 8775 → 46584 [SYN, ACK] Seq=0 Ack=1 Win=35712 Len=0 MSS=8940 SACK_PERM TSval=1319812654 TSecr=2106012432 WS=512
    3 0.000036772 2a02:ec80:a000:202::3 → 2a02:ec80:a000:4000::1 TCP 90 46584 → 8775 [ACK] Seq=1 Ack=1 Win=35840 Len=0 TSval=2106012432 TSecr=1319812654
    4 0.000162656 2a02:ec80:a000:202::3 → 2a02:ec80:a000:4000::1 HTTP 466 GET / HTTP/1.1
    5 0.001770702 2a02:ec80:a000:4000::1 → 2a02:ec80:a000:202::3 TCP 267 HTTP/1.1 200 OK  [TCP segment of a reassembled PDU]
    6 0.000021594 2a02:ec80:a000:202::3 → 2a02:ec80:a000:4000::1 TCP 90 46584 → 8775 [ACK] Seq=377 Ack=178 Win=35840 Len=0 TSval=2106012434 TSecr=1319812656
    7 0.000014612 2a02:ec80:a000:4000::1 → 2a02:ec80:a000:202::3 HTTP 90 HTTP/1.1 200 OK  (text/plain)
    8 0.001011807 2a02:ec80:a000:202::3 → 2a02:ec80:a000:4000::1 TCP 90 46584 → 8775 [FIN, ACK] Seq=377 Ack=179 Win=35840 Len=0 TSval=2106012435 TSecr=1319812656
    9 0.000040456 2a02:ec80:a000:4000::1 → 2a02:ec80:a000:202::3 TCP 90 8775 → 46584 [ACK] Seq=179 Ack=378 Win=35840 Len=0 TSval=1319812657 TSecr=2106012435