Page MenuHomePhabricator

Unable to create networking for new VMs in codfw1-dev
Closed, ResolvedPublic

Description

New VMs in codfw1-dev fail to start up properly. I'm unclear on if this is a result of the Rocky upgrade, Jason's port work last week, or something else.

I'm not getting a lot from logfiles, but I suspect that this (from ovs-vswitchd.log) is a clue:

2020-04-05T16:23:19.230Z|00121|rconn|WARN|br-tun<->tcp:127.0.0.1:6633: connection failed (Connection refused)
2020-04-05T16:23:19.230Z|00122|rconn|WARN|br-int<->tcp:127.0.0.1:6633: connection failed (Connection refused)
2020-04-05T16:23:19.230Z|00123|rconn|WARN|br-provider<->tcp:127.0.0.1:6633: connection failed (Connection refused)
2020-04-05T16:23:27.231Z|00124|rconn|WARN|br-tun<->tcp:127.0.0.1:6633: connection failed (Connection refused)
2020-04-05T16:23:27.231Z|00125|rconn|WARN|br-int<->tcp:127.0.0.1:6633: connection failed (Connection refused)
2020-04-05T16:23:27.231Z|00126|rconn|WARN|br-provider<->tcp:127.0.0.1:6633: connection failed (Connection refused)
2020-04-05T16:23:35.230Z|00127|rconn|WARN|br-tun<->tcp:127.0.0.1:6633: connection failed (Connection refused)
2020-04-05T16:23:35.231Z|00128|rconn|WARN|br-int<->tcp:127.0.0.1:6633: connection failed (Connection refused)
2020-04-05T16:23:35.231Z|00129|rconn|WARN|br-provider<->tcp:127.0.0.1:6633: connection failed (Connection refused)
2020-04-05T16:23:43.231Z|00130|rconn|WARN|br-tun<->tcp:127.0.0.1:6633: connection failed (Connection refused)
2020-04-05T16:23:43.231Z|00131|rconn|WARN|br-int<->tcp:127.0.0.1:6633: connection failed (Connection refused)
2020-04-05T16:23:43.231Z|00132|rconn|WARN|br-provider<->tcp:127.0.0.1:6633: connection failed (Connection refused)
2020-04-05T16:23:51.231Z|00133|rconn|WARN|br-tun<->tcp:127.0.0.1:6633: connection failed (Connection refused)

Event Timeline

This is not related to the port work I did last week, OVS (openvswitch) is left over from the VXLAN work @aborrero is working on.

Which host is this from?

Looks like there's database connection issues too

root@cloudcontrol2003-dev:/var/log/nova# tail nova-conductor.log
2020-04-05 16:46:46.488 11527 ERROR oslo_messaging.rpc.server   File "/usr/lib/python3/dist-packages/pymysql/connections.py", line 1203, in _request_authentication
2020-04-05 16:46:46.488 11527 ERROR oslo_messaging.rpc.server     auth_packet = self._read_packet()
2020-04-05 16:46:46.488 11527 ERROR oslo_messaging.rpc.server   File "/usr/lib/python3/dist-packages/pymysql/connections.py", line 1059, in _read_packet
2020-04-05 16:46:46.488 11527 ERROR oslo_messaging.rpc.server     packet.check_error()
2020-04-05 16:46:46.488 11527 ERROR oslo_messaging.rpc.server   File "/usr/lib/python3/dist-packages/pymysql/connections.py", line 384, in check_error
2020-04-05 16:46:46.488 11527 ERROR oslo_messaging.rpc.server     err.raise_mysql_exception(self._data)
2020-04-05 16:46:46.488 11527 ERROR oslo_messaging.rpc.server   File "/usr/lib/python3/dist-packages/pymysql/err.py", line 109, in raise_mysql_exception
2020-04-05 16:46:46.488 11527 ERROR oslo_messaging.rpc.server     raise errorclass(errno, errval)
2020-04-05 16:46:46.488 11527 ERROR oslo_messaging.rpc.server oslo_db.exception.DBError: (pymysql.err.InternalError) (1226, "User 'nova' has exceeded the 'max_user_connections' resource (current value: 100)") (Background on this error at: http://sqlalche.me/e/2j85)
2020-04-05 16:46:46.488 11527 ERROR oslo_messaging.rpc.server

I restarted nova-conductor and was able to create a new VM, but the database connection limit error came back right away.

| a1fb3d7c-ffc6-4e89-bc76-0653439bccb8 | jeh-test2 | ACTIVE | lan-flat-cloudinstances2b=172.16.128.12 | debian-10.0-buster | m1.small |

Change 586118 had a related patch set uploaded (by Jhedden; owner: Jhedden):
[operations/puppet@production] openstack: update nova-placement healthcheck in codfwdev1

https://gerrit.wikimedia.org/r/586118

The haproxy check for nova-placement on /healthcheck is generating keystone errors:

INFO keystonemiddleware.auth_token [req-74e64099-5d1e-4b5b-bdc2-c2aa950f1a8f novaadmin admin - default default] Rejecting request

Patch for this at https://gerrit.wikimedia.org/r/c/operations/puppet/+/586118


It also looks like there might be some schema errors:

nova-conductor.log:2020-04-05 17:43:45.992 187920 ERROR oslo_messaging.rpc.server oslo_db.exception.DBError: (pymysql.err.InternalError) (1054, "Unknown column 'trusted_certs' in 'field list'") [SQL: 'INSERT INTO instance_extra (created_at, updated_at, deleted_at, deleted, instance_uuid, device_metadata, numa_topology, pci_requests, flavor, vcpu_model, migration_context, keypairs, trusted_certs) VALUES (%(created_at)s, %(updated_at)s, %(deleted_at)s, %(deleted)s, %(instance_uuid)s, %(device_metadata)s, %(numa_topology)s, %(pci_requests)s, %(flavor)s, %(vcpu_model)s, %(migration_context)s, %(keypairs)s, %(trusted_certs)s)'] [parameters: {'trusted_certs': None, 'updated_at': None, 'numa_topology': None, 'flavor': '{"new": null, "cur": {"nova_object.namespace": "nova", "nova_object.data": {"root_gb": 20,
"rxtx_factor": 1.0, "updated_at": null, "vcpus": 1, "extra ... (275 characters truncated) ... vcpu_weight": 0, "deleted": false}, "nova_object.version": "1.2", "nova_object.changes": ["extra_specs"], "nova_object.name": "Flavor"}, "old": null}', 'device_metadata': None,
'migration_context': None, 'deleted': 0, 'keypairs': '{"nova_object.namespace": "nova", "nova_object.data": {"objects": []}, "nova_object.version": "1.3", "nova_object.name": "KeyPairList"}', 'pci_requests': '[]', 'created_at': datetime.datetime(2020, 4, 5, 17, 43, 45, 865672), 'vcpu_model': None, 'deleted_at': None, 'instance_uuid': '5ace8e66-3c7c-4ab2-800a-61e2289b0c39'}] (Background on this error at: http://sqlalche.me/e/2j85)

The only changes I made last week in codfw were on cloudvirt2001-dev.codfw.wmnet, which should have no impact on any of the upgrades.

Thanks @jeh! there are clearly a few different things going on; I'll start chipping away.

Change 586135 had a related patch set uploaded (by Andrew Bogott; owner: Andrew Bogott):
[operations/puppet@production] codf1dev db server: increase max connections by a lot

https://gerrit.wikimedia.org/r/586135

Change 586135 merged by Andrew Bogott:
[operations/puppet@production] codf1dev db server: increase max connections by a lot

https://gerrit.wikimedia.org/r/586135

Change 586118 merged by Jhedden:
[operations/puppet@production] openstack: update nova-placement healthcheck in codfwdev1

https://gerrit.wikimedia.org/r/586118

Change 586424 had a related patch set uploaded (by Jhedden; owner: Jhedden):
[operations/puppet@production] openstack: update haproxy healthchecks for openstack services

https://gerrit.wikimedia.org/r/586424

Change 586424 merged by Jhedden:
[operations/puppet@production] openstack: update haproxy healthchecks for openstack services

https://gerrit.wikimedia.org/r/586424

Change 586437 had a related patch set uploaded (by Jhedden; owner: Jhedden):
[operations/puppet@production] openstack: codfw1dev update neutron haproxy config

https://gerrit.wikimedia.org/r/586437

Change 586437 merged by Jhedden:
[operations/puppet@production] openstack: codfw1dev update neutron haproxy config

https://gerrit.wikimedia.org/r/586437

JHedden claimed this task.

Mentioned in SAL (#wikimedia-operations) [2020-04-07T14:52:24Z] <jeh> cloudvirt2003-dev: downtime in icinga and reboot to enable BIOS virtualization support T249453