Page MenuHomePhabricator

Icinga/Check for VMs leaked by the nova-fullstack test
Closed, ResolvedPublic

Description

From a page.

summary: 10 instances in the admin-monitoring project
4 minutes ago
instance: cloudcontrol1003
team: wmcs
@cluster: wikimedia.org

Had a quick look as there was a known puppet issue, and wanted to verify, this time it seems to be something related to DNS cleanup:

dcaro@cloudcontrol1003:~$ echo "$(sudo journalctl -n 5000 -u  nova-fullstack.service | grep '"log.level": "ERROR"' | tail -n 10 | sed -e 's/^.*@cee: //')" | jq '."error.stack"'
"Traceback (most recent call last):\n  File \"/usr/local/sbin/nova-fullstack\", line 748, in main\n    vdns = verify_dns_cleanup(host, dnsd, timeout=60.0)\n  File \"/usr/local/sbin/nova-fullstack\", line 272, in verify_dns_cleanup\n    raise Exception(\"Failed to clean up dns for {}\".format(hostname))\nException: Failed to clean up dns for fullstackd-20211121050502.admin-monitoring.eqiad1.wikimedia.cloud"
"Traceback (most recent call last):\n  File \"/usr/local/sbin/nova-fullstack\", line 697, in main\n    vdns = verify_dns(host, addr, dnsd, timeout=30)\n  File \"/usr/local/sbin/nova-fullstack\", line 199, in verify_dns\n    \"Timed out waiting for A record for {}\".format(hostname)\nException: Timed out waiting for A record for fullstackd-20211121051528.admin-monitoring.eqiad1.wikimedia.cloud"
"Traceback (most recent call last):\n  File \"/usr/local/sbin/nova-fullstack\", line 748, in main\n    vdns = verify_dns_cleanup(host, dnsd, timeout=60.0)\n  File \"/usr/local/sbin/nova-fullstack\", line 272, in verify_dns_cleanup\n    raise Exception(\"Failed to clean up dns for {}\".format(hostname))\nException: Failed to clean up dns for fullstackd-20211121052122.admin-monitoring.eqiad1.wikimedia.cloud"
"Traceback (most recent call last):\n  File \"/usr/local/sbin/nova-fullstack\", line 748, in main\n    vdns = verify_dns_cleanup(host, dnsd, timeout=60.0)\n  File \"/usr/local/sbin/nova-fullstack\", line 272, in verify_dns_cleanup\n    raise Exception(\"Failed to clean up dns for {}\".format(hostname))\nException: Failed to clean up dns for fullstackd-20211121053200.admin-monitoring.eqiad1.wikimedia.cloud"
"Traceback (most recent call last):\n  File \"/usr/local/sbin/nova-fullstack\", line 697, in main\n    vdns = verify_dns(host, addr, dnsd, timeout=30)\n  File \"/usr/local/sbin/nova-fullstack\", line 199, in verify_dns\n    \"Timed out waiting for A record for {}\".format(hostname)\nException: Timed out waiting for A record for fullstackd-20211121054238.admin-monitoring.eqiad1.wikimedia.cloud"
"Traceback (most recent call last):\n  File \"/usr/local/sbin/nova-fullstack\", line 697, in main\n    vdns = verify_dns(host, addr, dnsd, timeout=30)\n  File \"/usr/local/sbin/nova-fullstack\", line 199, in verify_dns\n    \"Timed out waiting for A record for {}\".format(hostname)\nException: Timed out waiting for A record for fullstackd-20211121054832.admin-monitoring.eqiad1.wikimedia.cloud"
"Traceback (most recent call last):\n  File \"/usr/local/sbin/nova-fullstack\", line 697, in main\n    vdns = verify_dns(host, addr, dnsd, timeout=30)\n  File \"/usr/local/sbin/nova-fullstack\", line 199, in verify_dns\n    \"Timed out waiting for A record for {}\".format(hostname)\nException: Timed out waiting for A record for fullstackd-20211121055426.admin-monitoring.eqiad1.wikimedia.cloud"
"Traceback (most recent call last):\n  File \"/usr/local/sbin/nova-fullstack\", line 748, in main\n    vdns = verify_dns_cleanup(host, dnsd, timeout=60.0)\n  File \"/usr/local/sbin/nova-fullstack\", line 272, in verify_dns_cleanup\n    raise Exception(\"Failed to clean up dns for {}\".format(hostname))\nException: Failed to clean up dns for fullstackd-20211121060020.admin-monitoring.eqiad1.wikimedia.cloud"
"Traceback (most recent call last):\n  File \"/usr/local/sbin/nova-fullstack\", line 748, in main\n    vdns = verify_dns_cleanup(host, dnsd, timeout=60.0)\n  File \"/usr/local/sbin/nova-fullstack\", line 272, in verify_dns_cleanup\n    raise Exception(\"Failed to clean up dns for {}\".format(hostname))\nException: Failed to clean up dns for fullstackd-20211121061059.admin-monitoring.eqiad1.wikimedia.cloud"
"Traceback (most recent call last):\n  File \"/usr/local/sbin/nova-fullstack\", line 748, in main\n    vdns = verify_dns_cleanup(host, dnsd, timeout=60.0)\n  File \"/usr/local/sbin/nova-fullstack\", line 272, in verify_dns_cleanup\n    raise Exception(\"Failed to clean up dns for {}\".format(hostname))\nException: Failed to clean up dns for fullstackd-20211121062150.admin-monitoring.eqiad1.wikimedia.cloud"

# Just one of then with full info:
{
  "ecs.version": "1.7.0",
  "log.level": "ERROR",
  "log.origin.file.line": 758,
  "log.origin.file.name": "nova-fullstack",
  "log.origin.file.path": "/usr/local/sbin/nova-fullstack",
  "log.origin.function": "main",
  "message": "fullstackd-20211121061059 failed, leaking",
  "process.name": "MainProcess",
  "process.thread.id": 21839,
  "process.thread.name": "MainThread",
  "timestamp": "2021-11-21T06:16:50.302484",
  "error.stack": "Traceback (most recent call last):\n  File \"/usr/local/sbin/nova-fullstack\", line 748, in main\n    vdns = verify_dns_cleanup(host, dnsd, timeout=60.0)\n  File \"/usr/local/sbin/nova-fullstack\", line 272, in verify_dns_cleanup\n    raise Exception(\"Failed to clean up dns for {}\".format(hostname))\nException: Failed to clean up dns for fullstackd-20211121061059.admin-monitoring.eqiad1.wikimedia.cloud"
}

Looking

Event Timeline

dcaro triaged this task as High priority.Nov 21 2021, 6:30 AM
dcaro created this task.

Mentioned in SAL (#wikimedia-cloud) [2021-11-21T07:19:49Z] <dcaro_away> restarting designate-sink with some extra logs in it (T296144)

Change 740319 had a related patch set uploaded (by David Caro; author: David Caro):

[operations/puppet@production] Revert \"dynamicproxy: add keystone token verification\"

https://gerrit.wikimedia.org/r/740319

Change 740319 merged by David Caro:

[operations/puppet@production] Revert \"dynamicproxy: add keystone token verification\"

https://gerrit.wikimedia.org/r/740319

Reverted the change with https://gerrit.wikimedia.org/r/c/operations/puppet/+/740319
there's a proposed fix here https://gerrit.wikimedia.org/r/c/operations/puppet/+/740306
Will cleanup the VMs and leave the proper fix for monday.
We might have to cleanup the DNS entries for the leaked VMs:

root@cloudcontrol1003:~# wmcs-openstack --os-project-id=admin-monitoring server list
+--------------------------------------+---------------------------+--------+----------------------------------------+--------------------+-----------------------+
| ID                                   | Name                      | Status | Networks                               | Image              | Flavor                |
+--------------------------------------+---------------------------+--------+----------------------------------------+--------------------+-----------------------+
| 891c6e6f-ae93-4ac8-affb-47e4dcb12603 | fullstackd-20211121075558 | ACTIVE | lan-flat-cloudinstances2b=172.16.6.231 | debian-10.0-buster | g3.cores1.ram2.disk20 |
| e3f5f6ed-6bf9-4d22-b7b0-0308d9e1415a | fullstackd-20211121055426 | ACTIVE | lan-flat-cloudinstances2b=172.16.2.211 | debian-10.0-buster | g3.cores1.ram2.disk20 |
| 9338f4c5-e4d8-41d5-96c3-821f8f9d3f06 | fullstackd-20211121054832 | ACTIVE | lan-flat-cloudinstances2b=172.16.0.167 | debian-10.0-buster | g3.cores1.ram2.disk20 |
| a51cb60e-84a8-4025-b898-45c24ef9ed0e | fullstackd-20211121054238 | ACTIVE | lan-flat-cloudinstances2b=172.16.4.217 | debian-10.0-buster | g3.cores1.ram2.disk20 |
| cc9cba46-fb60-4490-bedd-784b91d9c982 | fullstackd-20211121051528 | ACTIVE | lan-flat-cloudinstances2b=172.16.0.29  | debian-10.0-buster | g3.cores1.ram2.disk20 |
| 09df0758-b71a-447f-9630-0e9a30127fa5 | fullstackd-20211121030313 | ACTIVE | lan-flat-cloudinstances2b=172.16.5.27  | debian-10.0-buster | g3.cores1.ram2.disk20 |
| 12b42fde-8e36-4e8b-a9f7-03275c7ed92c | fullstackd-20211121024642 | ACTIVE | lan-flat-cloudinstances2b=172.16.2.100 | debian-10.0-buster | g3.cores1.ram2.disk20 |
| 37f3c247-5582-44e1-ba4b-9567e618095e | fullstackd-20211121024048 | ACTIVE | lan-flat-cloudinstances2b=172.16.1.207 | debian-10.0-buster | g3.cores1.ram2.disk20 |
| 5f6c2c26-0a12-495c-aee9-c5887a34bc80 | fullstackd-20211121022417 | ACTIVE | lan-flat-cloudinstances2b=172.16.6.214 | debian-10.0-buster | g3.cores1.ram2.disk20 |
| 92192822-8a88-43d4-840b-029dbbfb1ad3 | fullstackd-20211121014639 | ACTIVE | lan-flat-cloudinstances2b=172.16.0.221 | debian-10.0-buster | g3.cores1.ram2.disk20 |
+--------------------------------------+---------------------------+--------+----------------------------------------+--------------------+-----------------------+

When removing the leaked VMs this error showed up on designate-sink.log (cloudservices1003):

2021-11-21 08:01:05.899 12517 WARNING wmf_sink.base [req-758a3508-4781-4764-9ea6-45740fba2005 - - - - -] Remote command ['/usr/bin/ssh', '-o', 'StrictHostKeyChecking=no', '-lcertmanager', '185.15.56.64', 'sudo puppet cert clean fullstackd-20211121022417.admin-monitoring.eqiad.wmflabs'] failed with output b'' and err b"\x1b[1;33mWarning: `puppet cert` is deprecated and will be removed in a future release.\n   (location: /usr/lib/ruby/vendor_ruby/puppet/application.rb:370:in `run')\x1b[0m\n\x1b[1;31mError: Could not find a serial number for fullstackd-20211121022417.admin-monitoring.eqiad.wmflabs\x1b[0m\n"
2021-11-21 08:01:07.146 12517 ERROR oslo_messaging.notify.dispatcher [req-f1a5d67e-035f-47f2-b323-b1fa65d39980 - - - - -] Callback raised an exception.: IndexError: list index out of range
2021-11-21 08:01:07.146 12517 ERROR oslo_messaging.notify.dispatcher Traceback (most recent call last):
2021-11-21 08:01:07.146 12517 ERROR oslo_messaging.notify.dispatcher   File "/usr/lib/python3/dist-packages/oslo_messaging/notify/dispatcher.py", line 82, in _exec_callback
2021-11-21 08:01:07.146 12517 ERROR oslo_messaging.notify.dispatcher     message["metadata"])
2021-11-21 08:01:07.146 12517 ERROR oslo_messaging.notify.dispatcher   File "/usr/lib/python3/dist-packages/designate/sink/service.py", line 124, in info
2021-11-21 08:01:07.146 12517 ERROR oslo_messaging.notify.dispatcher     handler.process_notification(context, event_type, payload)
2021-11-21 08:01:07.146 12517 ERROR oslo_messaging.notify.dispatcher   File "/usr/lib/python3/dist-packages/wmf_sink/wmfsink.py", line 92, in process_notification
2021-11-21 08:01:07.146 12517 ERROR oslo_messaging.notify.dispatcher     resource_type='instance')
2021-11-21 08:01:07.146 12517 ERROR oslo_messaging.notify.dispatcher   File "/usr/lib/python3/dist-packages/wmf_sink/base.py", line 133, in _delete
2021-11-21 08:01:07.146 12517 ERROR oslo_messaging.notify.dispatcher     ip = records[0].data
2021-11-21 08:01:07.146 12517 ERROR oslo_messaging.notify.dispatcher   File "/usr/lib/python3/dist-packages/designate/objects/base.py", line 381, in __getitem__
2021-11-21 08:01:07.146 12517 ERROR oslo_messaging.notify.dispatcher     return self.objects[index]
2021-11-21 08:01:07.146 12517 ERROR oslo_messaging.notify.dispatcher IndexError: list index out of range
2021-11-21 08:01:07.146 12517 ERROR oslo_messaging.notify.dispatcher

@Andrew please write down your findings and if resolved feel free to close.

This is still happening intermittently. We are also leaking DNS records on VM deletion; I suspect that is the primary issue and the fullstack failures are resulting from DNS collisions.

Change 740592 had a related patch set uploaded (by Andrew Bogott; author: Andrew Bogott):

[operations/puppet@production] wmf-sink (designate): fix a crash when we fail to find a DNS record for a VM

https://gerrit.wikimedia.org/r/740592

Change 740592 merged by Andrew Bogott:

[operations/puppet@production] wmf-sink (designate): fix a crash when we fail to find a DNS record for a VM

https://gerrit.wikimedia.org/r/740592

Change 740659 had a related patch set uploaded (by Andrew Bogott; author: Andrew Bogott):

[operations/puppet@production] nova_fixed_multi: improved logging for unexpected exceptions

https://gerrit.wikimedia.org/r/740659

Change 740659 merged by Andrew Bogott:

[operations/puppet@production] nova_fixed_multi: improved logging for unexpected exceptions

https://gerrit.wikimedia.org/r/740659

We leaked one more VM overnight; this was again due to DNS collision, this time a ptr record.

Found 2 ptr recordsets for the same VM: ['250.0.16.172.in-addr.arpa.', '51.1.16.172.in-addr.arpa.']
Found 3 ptr recordsets for the same VM: ['250.0.16.172.in-addr.arpa.', '51.1.16.172.in-addr.arpa.', '169.1.16.172.in-addr.arpa.']
Found 2 ptr recordsets for the same VM: ['107.1.16.172.in-addr.arpa.', '166.1.16.172.in-addr.arpa.']
Found 2 ptr recordsets for the same VM: ['44.2.16.172.in-addr.arpa.', '251.4.16.172.in-addr.arpa.']
Found 2 ptr recordsets for the same VM: ['51.2.16.172.in-addr.arpa.', '201.2.16.172.in-addr.arpa.']

I've expanded the cleanup script to detect these and deleted the excess records. Patch incoming.

Change 740850 had a related patch set uploaded (by Andrew Bogott; author: Andrew Bogott):

[operations/puppet@production] wmcs-novastats-dnsleaks: Detect duplicate ptr records

https://gerrit.wikimedia.org/r/740850

Change 740850 merged by Andrew Bogott:

[operations/puppet@production] wmcs-novastats-dnsleaks: Detect duplicate ptr records

https://gerrit.wikimedia.org/r/740850