Page MenuHomePhabricator

tf-infra-test fails creating dbs and k8s cluster
Closed, ResolvedPublic

Description

tf-infra-test has been failing for 5 days now. I cleaned up a bunch of Trove instances in ERROR status, but it's still failing with a timeout on 4 resources:

Oct 09 13:00:06 tf-bastion tf-infra-test[389875]: │ Error: Error waiting for openstack_containerinfra_cluster_v1 7f2c14eb-8e7b-4280-af7a-567bc0383516 to become read>
Oct 09 13:00:06 tf-bastion tf-infra-test[389875]: │
Oct 09 13:00:06 tf-bastion tf-infra-test[389875]: │   with openstack_containerinfra_cluster_v1.k8s_127,
Oct 09 13:00:06 tf-bastion tf-infra-test[389875]: │   on magnum.tf line 44, in resource "openstack_containerinfra_cluster_v1" "k8s_127":
Oct 09 13:00:06 tf-bastion tf-infra-test[389875]: │   44: resource "openstack_containerinfra_cluster_v1" "k8s_127" {
Oct 09 13:00:06 tf-bastion tf-infra-test[389875]: │
Oct 09 13:00:06 tf-bastion tf-infra-test[389875]: ╵
Oct 09 13:00:06 tf-bastion tf-infra-test[389875]: ╷
Oct 09 13:00:06 tf-bastion tf-infra-test[389875]: │ Error: Error waiting for openstack_db_instance_v1 8f90e8f3-93c6-446b-b2e5-1333ce69cd9a to become ready: context >
Oct 09 13:00:06 tf-bastion tf-infra-test[389875]: │
Oct 09 13:00:06 tf-bastion tf-infra-test[389875]: │   with openstack_db_instance_v1.postgresql,
Oct 09 13:00:06 tf-bastion tf-infra-test[389875]: │   on trove.tf line 51, in resource "openstack_db_instance_v1" "postgresql":
Oct 09 13:00:06 tf-bastion tf-infra-test[389875]: │   51: resource "openstack_db_instance_v1" "postgresql" {
Oct 09 13:00:06 tf-bastion tf-infra-test[389875]: │
Oct 09 13:00:06 tf-bastion tf-infra-test[389875]: ╵
Oct 09 13:00:06 tf-bastion tf-infra-test[389875]: ╷
Oct 09 13:00:06 tf-bastion tf-infra-test[389875]: │ Error: Error waiting for openstack_db_instance_v1 9c3afb74-2e31-4201-9477-9272231dd7ae to become ready: context >
Oct 09 13:00:06 tf-bastion tf-infra-test[389875]: │
Oct 09 13:00:06 tf-bastion tf-infra-test[389875]: │   with openstack_db_instance_v1.mysql,
Oct 09 13:00:06 tf-bastion tf-infra-test[389875]: │   on trove.tf line 78, in resource "openstack_db_instance_v1" "mysql":
Oct 09 13:00:06 tf-bastion tf-infra-test[389875]: │   78: resource "openstack_db_instance_v1" "mysql" {
Oct 09 13:00:06 tf-bastion tf-infra-test[389875]: │
Oct 09 13:00:06 tf-bastion tf-infra-test[389875]: ╵
Oct 09 13:00:06 tf-bastion tf-infra-test[389875]: ╷
Oct 09 13:00:06 tf-bastion tf-infra-test[389875]: │ Error: Error waiting for openstack_db_instance_v1 d98a94b5-2e06-4214-947c-d5e64622cafb to become ready: context >
Oct 09 13:00:06 tf-bastion tf-infra-test[389875]: │
Oct 09 13:00:06 tf-bastion tf-infra-test[389875]: │   with openstack_db_instance_v1.mariadb,
Oct 09 13:00:06 tf-bastion tf-infra-test[389875]: │   on trove.tf line 105, in resource "openstack_db_instance_v1" "mariadb":
Oct 09 13:00:06 tf-bastion tf-infra-test[389875]: │  105: resource "openstack_db_instance_v1" "mariadb" {
Oct 09 13:00:06 tf-bastion tf-infra-test[389875]: │
Oct 09 13:00:06 tf-bastion tf-infra-test[389875]: │   with openstack_db_instance_v1.mysql,
Oct 09 13:00:06 tf-bastion tf-infra-test[389875]: │   on trove.tf line 78, in resource "openstack_db_instance_v1" "mysql":
Oct 09 13:00:06 tf-bastion tf-infra-test[389875]: │   78: resource "openstack_db_instance_v1" "mysql" {
Oct 09 13:00:06 tf-bastion tf-infra-test[389875]: │
Oct 09 13:00:06 tf-bastion tf-infra-test[389875]: ╵
Oct 09 13:00:06 tf-bastion tf-infra-test[389875]: ╷
Oct 09 13:00:06 tf-bastion tf-infra-test[389875]: │ Error: Error waiting for openstack_db_instance_v1 d98a94b5-2e06-4214-947c-d5e64622cafb to become ready: context >
Oct 09 13:00:06 tf-bastion tf-infra-test[389875]: │
Oct 09 13:00:06 tf-bastion tf-infra-test[389875]: │   with openstack_db_instance_v1.mariadb,
Oct 09 13:00:06 tf-bastion tf-infra-test[389875]: │   on trove.tf line 105, in resource "openstack_db_instance_v1" "mariadb":
Oct 09 13:00:06 tf-bastion tf-infra-test[389875]: │  105: resource "openstack_db_instance_v1" "mariadb" {
Oct 09 13:00:06 tf-bastion tf-infra-test[389875]: │
Oct 09 13:00:06 tf-bastion tf-infra-test[389875]: ╵

Event Timeline

fnegri triaged this task as Medium priority.

I fixed the db errors by resetting the trove quotas for the project as described in https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Trove#Reserved_quota_does_not_go_down but Trove instances are still getting stuck at the "Building" phase.

Screenshot 2024-10-09 at 17.14.44.png (604×2 px, 162 KB)

fnegri raised the priority of this task from Medium to High.Oct 9 2024, 3:33 PM

I can replicate the issue if I try manually creating a new Trove instance from Horizon, in a different project.

fnegri changed the task status from Open to In Progress.Oct 9 2024, 3:51 PM

Mentioned in SAL (#wikimedia-cloud) [2024-10-10T10:40:52Z] <dhinus> cumin 'cloudrabbit*' 'systemctl restart rabbitmq-server' T376802

Restarting RabbitMQ did not fix the issue, but I discovered something: adding the "ssh-from-anywhere" SG to a broken instance makes it move from BUILD to ACTIVE.

I was adding that SG for debugging, but it looks like just by adding it, without even SSHing to the instance, Trove is able to complete the provisioning

Did any network setting change about 1 week ago? tf-infra-test started failing on October 4th.

cc @aborrero @dcaro

fnegri closed this task as Resolved.EditedOct 10 2024, 3:35 PM
fnegri moved this task from In progress to Done on the cloud-services-team (FY2024/2025-Q1-Q2) board.

This issue was caused by the removal of default security group rules during the work on T375111: openstack: clarify default security group semantics.

I reinstated the default rules as such, on both eqiad1 and codfw1dev:

sudo wmcs-openstack default security group rule create --egress --ethertype IPv4
sudo wmcs-openstack default security group rule create --egress --ethertype IPv6

After this, I re-ran tf-infra-test and it completed successfully.