Page MenuHomePhabricator

[cloudvps] TfInfraTestApplyFailed
Closed, ResolvedPublic

Description

From alertmanager:

TfInfraTestApplyFailed
7 hours agoinstance: tf-bastion
job: node
crit
team: wmcs
@cluster: wmcloud.org
@receiver: metricsinfra_cloud-feed
summary

Event Timeline

dcaro triaged this task as High priority.Jul 13 2023, 7:20 AM
dcaro created this task.

I can't ssh to the tf-bastion.tf-infra-test.eqiad1.wikimedia.cloud vm, looking.

Weird, it seems to accept my public key, but then kicks me out:

Jul 13 07:31:31 tf-bastion sshd[1243075]: Accepted key ED25519 SHA256:CatnOygIhBgspb69r+Hd0dxS1bSeX81WWzy/c1SlA6Y found at /usr/sbin/ssh-key-ldap-lookup:1
Jul 13 07:31:31 tf-bastion sshd[1243075]: pam_access(sshd:account): access denied for user `dcaro' from `172.16.5.8'
Jul 13 07:31:31 tf-bastion audit[3236]: AVC apparmor="ALLOWED" operation="open" profile="/usr/sbin/sssd" name="/proc/1243075/cmdline" pid=3236 comm="sssd_pam" requested_mask="r" denied_mask="r" fsuid=0 ouid=0
Jul 13 07:31:31 tf-bastion kernel: audit: type=1400 audit(1689233491.693:2867): apparmor="ALLOWED" operation="open" profile="/usr/sbin/sssd" name="/proc/1243075/cmdline" pid=3236 comm="sssd_pam" requested_mask="r" denied_mask="r" fsuid=0 ouid=0
Jul 13 07:31:31 tf-bastion sshd[1243075]: Failed publickey for dcaro from 172.16.5.8 port 34850 ssh2: ED25519 SHA256:CatnOygIhBgspb69r+Hd0dxS1bSeX81WWzy/c1SlA6Y
Jul 13 07:31:31 tf-bastion sshd[1243075]: fatal: Access denied for user dcaro by PAM account configuration [preauth]

That's the key I'm offering:

debug1: Server accepts key: /home/dcaro/.ssh/cloud.id_ed25519 ED25519 SHA256:CatnOygIhBgspb69r+Hd0dxS1bSeX81WWzy/c1SlA6Y explicit agent
debug3: sign_and_send_pubkey: using publickey-hostbound-v00@openssh.com with ED25519 SHA256:CatnOygIhBgspb69r+Hd0dxS1bSeX81WWzy/c1SlA6Y
debug3: sign_and_send_pubkey: signing using ssh-ed25519 SHA256:CatnOygIhBgspb69r+Hd0dxS1bSeX81WWzy/c1SlA6Y
debug3: send packet: type 50
debug3: receive packet: type 96
debug2: channel 0: rcvd eof
debug2: channel 0: output open -> drain
debug2: channel 0: obuf empty
debug2: chan_shutdown_write: channel 0: (i0 o1 sock -1 wfd 5 efd -1 [closed])
debug2: channel 0: output drain -> closed
debug1: channel 0: FORCE input drain
debug2: channel 0: ibuf empty
debug2: channel 0: send eof
debug3: send packet: type 96
debug2: channel 0: input drain -> closed
Connection closed by UNKNOWN port 65535
debug3: send packet: type 1
debug1: channel 0: free: direct-tcpip: listening port 0 for tf-bastion.tf-infra-test.eqiad1.wikimedia.cloud port 22, connect from 127.0.0.1 port 65535 to UNKNOWN port 65536, nchannels 1
debug3: channel 0: status: The following connections are open:
#0 direct-tcpip: listening port 0 for tf-bastion.tf-infra-test.eqiad1.wikimedia.cloud port 22, connect from 127.0.0.1 port 65535 to UNKNOWN port 65536 (t4 r0 i3/0 o3/0 e[closed]/0 fd 4/-1/-1 sock -1 cc -1 io 0x00/0x00)

I was able to login using console, will continue with that.

The runbook is not in the runbooks folder, and does not have the sections/headers and such, will move and reformat.

In the meantime the manual apply is taking it's time.

I wonder if there's any logs from the previous run anywhere :/

Ok, so it took a bit, but the error is:

╷
│ Error: Error creating openstack_db_instance_v1: Expected HTTP response code [200] when accessing [POST https://openstack.eqiad1.wikimediacloud.org:28779/v1.0/tf-infra-test/instances], but got 413 instead
│ {"overLimit": {"code": 413, "message": "Quota exceeded for resources: ['volumes']."}}
│
│   with openstack_db_instance_v1.postgresql,
│   on trove.tf line 30, in resource "openstack_db_instance_v1" "postgresql":
│   30: resource "openstack_db_instance_v1" "postgresql" {
│
╵

looking, seems to be a quota issue for trove somehow

Trove seems ok, though in the tf-infra-test project there's a couple instances with volumes that seem left over somehow:

+--------------------------------------+-----------------------------------------+--------+------------------------------------------------------+-------------------------+-----------------------+
| 123d04f6-d8c5-46a3-9072-8f741c07d11f | tf-infra-test-123-ku7alouvzp22-node-0   | ACTIVE | lan-flat-cloudinstances2b=172.16.1.234               | magnum-fedora-coreos-34 | g3.cores1.ram2.disk20 |
| 26b74f81-f207-4e11-ba2b-4be764e528a8 | tf-infra-test-123-ku7alouvzp22-master-0 | ACTIVE | lan-flat-cloudinstances2b=172.16.0.26                | magnum-fedora-coreos-34 | g3.cores2.ram4.disk20 |
| 38207521-3869-41c8-9e8d-51377404e2f6 | tf-infra-test                           | ACTIVE | lan-flat-cloudinstances2b=172.16.6.135, 185.15.56.93 | debian-11.0-bullseye    | g3.cores2.ram4.disk20 |
| 40560d4a-6b06-49be-bfcd-2565666ef95d | tf-bastion                              | ACTIVE | lan-flat-cloudinstances2b=172.16.5.68                | debian-12.0-bookworm    | g3.cores2.ram4.disk20 |
+--------------------------------------+-----------------------------------------+--------+------------------------------------------------------+-------------------------+-----------------------+
root@cloudcontrol1005:~# openstack server show --os-project-id tf-infra-test 26b74f81-f207-4e11-ba2b-4be764e528a8
+-------------------------------------+-------------------------------------------------------------------------+
| Field                               | Value                                                                   |
+-------------------------------------+-------------------------------------------------------------------------+
| OS-DCF:diskConfig                   | MANUAL                                                                  |
| OS-EXT-AZ:availability_zone         | nova                                                                    |
| OS-SRV-USG:launched_at              | 2023-07-13T07:39:50.000000                                              |
| created                             | 2023-07-13T07:39:34Z                                                    |
| flavor                              | g3.cores2.ram4.disk20 (bb8bee7e-d8f9-460b-8344-74f745c139b9)            |
| id                                  | 26b74f81-f207-4e11-ba2b-4be764e528a8                                    |
| image                               | magnum-fedora-coreos-34 (89f6c455-36bd-4d41-8d63-f48cfb50a3fb)          |
| name                                | tf-infra-test-123-ku7alouvzp22-master-0                                 |
| project_id                          | tf-infra-test                                                           |
| security_groups                     | name='tf-infra-test-123-ku7alouvzp22-secgroup_kube_master-gs24dbnjgj5j' |
| status                              | ACTIVE                                                                  |
| updated                             | 2023-07-13T07:39:50Z                                                    |
| user_id                             | rook                                                                    |
| volumes_attached                    | id='519917d7-0349-4e6d-93fa-e629c3479520'                               |
+-------------------------------------+-------------------------------------------------------------------------+
root@cloudcontrol1005:~# openstack server show --os-project-id tf-infra-test 123d04f6-d8c5-46a3-9072-8f741c07d11f
+-------------------------------------+-------------------------------------------------------------------------+
| Field                               | Value                                                                   |
+-------------------------------------+-------------------------------------------------------------------------+
| OS-DCF:diskConfig                   | MANUAL                                                                  |
| OS-EXT-AZ:availability_zone         | nova                                                                    |
| OS-SRV-USG:launched_at              | 2023-07-13T07:49:02.000000                                              |
| OS-SRV-USG:terminated_at            | None                                                                    |
| flavor                              | g3.cores1.ram2.disk20 (55d5d90f-c5c6-44ff-bb8a-be7b077481cf)            |
| id                                  | 123d04f6-d8c5-46a3-9072-8f741c07d11f                                    |
| image                               | magnum-fedora-coreos-34 (89f6c455-36bd-4d41-8d63-f48cfb50a3fb)          |
| name                                | tf-infra-test-123-ku7alouvzp22-node-0                                   |
| security_groups                     | name='tf-infra-test-123-ku7alouvzp22-secgroup_kube_minion-owcyslwvnnjb' |
| status                              | ACTIVE                                                                  |
| updated                             | 2023-07-13T07:49:02Z                                                    |
| user_id                             | rook                                                                    |
| volumes_attached                    | id='a87d98d8-8466-4da0-a74c-c1ca8e58cd45'                               |
+-------------------------------------+-------------------------------------------------------------------------+

Maybe some leftovers from a previous failed run? I'll try to destroy first, see if that helps in any way :/

And start looking for previous runs logs.

Running a destroy cleared those instances:

root@cloudcontrol1005:~# openstack server list --os-project-id tf-infra-test
+--------------------------------------+------------+--------+---------------------------------------+----------------------+-----------------------+
| ID                                   | Name       | Status | Networks                              | Image                | Flavor                |
+--------------------------------------+------------+--------+---------------------------------------+----------------------+-----------------------+
| 40560d4a-6b06-49be-bfcd-2565666ef95d | tf-bastion | ACTIVE | lan-flat-cloudinstances2b=172.16.5.68 | debian-12.0-bookworm | g3.cores2.ram4.disk20 |
+--------------------------------------+------------+--------+---------------------------------------+----------------------+-----------------------+

Will try again an apply, see if that helped

There's no previous logs :/, created T341769: [cloudvps] use a systemd timer for the terraform tests to get logs to follow up on that.

Running apply again, see if this time it works or fails in the same place.

Now I got double error:

╷
│ Error: Error creating openstack_db_instance_v1: Expected HTTP response code [200] when accessing [POST https://openstack.eqiad1.wikimediacloud.org:28779/v1.0/tf-infra-test/instances], but got 413 instead
│ {"overLimit": {"code": 413, "message": "Quota exceeded for resources: ['volumes']."}}
│
│   with openstack_db_instance_v1.postgresql,
│   on trove.tf line 30, in resource "openstack_db_instance_v1" "postgresql":
│   30: resource "openstack_db_instance_v1" "postgresql" {
│
╵
╷
│ Error: Error creating openstack_db_instance_v1: Expected HTTP response code [200] when accessing [POST https://openstack.eqiad1.wikimediacloud.org:28779/v1.0/tf-infra-test/instances], but got 413 instead
│ {"overLimit": {"code": 413, "message": "Quota exceeded for resources: ['volumes']."}}
│
│   with openstack_db_instance_v1.mariadb,
│   on trove.tf line 84, in resource "openstack_db_instance_v1" "mariadb":
│   84: resource "openstack_db_instance_v1" "mariadb" {
│
╵

Hmm, definitely there's something going on there with volumes quotas somewhere.

Running with terraform apply --parallelism=1 did not help:

╷
│ Error: Error creating openstack_db_instance_v1: Expected HTTP response code [200] when accessing [POST https://openstack.eqiad1.wikimediacloud.org:28779/v1.0/tf-infra-test/instances], but got 413 instead
│ {"overLimit": {"code": 413, "message": "Quota exceeded for resources: ['volumes']."}}
│
│   with openstack_db_instance_v1.postgresql,
│   on trove.tf line 30, in resource "openstack_db_instance_v1" "postgresql":
│   30: resource "openstack_db_instance_v1" "postgresql" {
│
╵
╷
│ Error: Error creating openstack_db_instance_v1: Expected HTTP response code [200] when accessing [POST https://openstack.eqiad1.wikimediacloud.org:28779/v1.0/tf-infra-test/instances], but got 413 instead
│ {"overLimit": {"code": 413, "message": "Quota exceeded for resources: ['volumes']."}}
│
│   with openstack_db_instance_v1.mariadb,
│   on trove.tf line 84, in resource "openstack_db_instance_v1" "mariadb":
│   84: resource "openstack_db_instance_v1" "mariadb" {
│

But I found a way to trigger only the error:

root@tf-bastion:~/tf-infra-test# terraform apply --target=openstack_db_instance_v1.postgresql

Terraform used the selected providers to generate the following execution plan. Resource actions are indicated with the following symbols:
  + create

Terraform will perform the following actions:

  # openstack_db_instance_v1.postgresql will be created
  + resource "openstack_db_instance_v1" "postgresql" {
      + addresses = (known after apply)
      + flavor_id = "55d5d90f-c5c6-44ff-bb8a-be7b077481cf"
      + id        = (known after apply)
      + name      = "tf-postgresql"
      + region    = "eqiad1-r"
      + size      = 2

      + database {
          + name = "tf-postgresql"
        }

      + datastore {
          + type    = "postgresql"
          + version = "12.7"
        }

      + network {
          + uuid = "7425e328-560c-4f00-8e99-706f3fb90bb4"
        }

      + user {
          + databases = [
              + "tf-postgresql",
            ]
          + host      = "%"
          + name      = "infra"
          + password  = (sensitive value)
        }
    }

Plan: 1 to add, 0 to change, 0 to destroy.
╷
│ Warning: Resource targeting is in effect
│
│ You are creating a plan with the -target option, which means that the result of this plan may not represent all of the changes requested by the current configuration.
│
│ The -target option is not for routine use, and is provided only for exceptional situations such as recovering from errors or mistakes, or when Terraform specifically suggests to use it as part of an error message.
╵

Do you want to perform these actions?
  Terraform will perform the actions described above.
  Only 'yes' will be accepted to approve.

  Enter a value: yes

openstack_db_instance_v1.postgresql: Creating...
╷
│ Warning: Applied changes may be incomplete
│
│ The plan was created with the -target option in effect, so some changes requested in the configuration may have been ignored and the output values may not be fully updated. Run the following command to verify that no other changes are pending:
│     terraform plan
│
│ Note that the -target option is not suitable for routine use, and is provided only for exceptional situations such as recovering from errors or mistakes, or when Terraform specifically suggests to use it as part of an error message.
╵
╷
│ Error: Error creating openstack_db_instance_v1: Expected HTTP response code [200] when accessing [POST https://openstack.eqiad1.wikimediacloud.org:28779/v1.0/tf-infra-test/instances], but got 413 instead
│ {"overLimit": {"code": 413, "message": "Quota exceeded for resources: ['volumes']."}}
│
│   with openstack_db_instance_v1.postgresql,
│   on trove.tf line 30, in resource "openstack_db_instance_v1" "postgresql":
│   30: resource "openstack_db_instance_v1" "postgresql" {
│

So, it seems that the problem is leftover volume reservations in the database quota for the tf-infra-test project.

From the database quota:

root@cloudcontrol1005:~# trove quota-show tf-infra-test
+--------------+--------+----------+-------+
| Resource     | In Use | Reserved | Limit |
+--------------+--------+----------+-------+
| b'backups'   |      0 |        0 |     2 |
| b'instances' |      1 |       -4 |    10 |
| b'ram'       |   2048 |    -4096 |    -1 |
| b'volumes'   |      2 |        8 |    10 |
+--------------+--------+----------+-------+

That shows it the database like:

mysql:galera_backup@localhost [trove_eqiad1]> select * from quota_usages where tenant_id='tf-infra-test';
+--------------------------------------+---------------------+---------------------+---------------+--------+----------+-----------+
| id                                   | created             | updated             | tenant_id     | in_use | reserved | resource  |
+--------------------------------------+---------------------+---------------------+---------------+--------+----------+-----------+
| b3f7b371-ac14-4d02-afbd-f94f1b992c2a | 2023-06-26 19:39:38 | 2023-06-26 19:39:38 | tf-infra-test |      0 |        0 | backups   |
| ad1bcb7e-8b6b-4d36-8382-b24812d1a0f5 | 2023-06-26 17:11:12 | 2023-07-13 09:22:42 | tf-infra-test |      1 |       -4 | instances |
| cc8e7f0b-b102-4c64-aee5-c19924742e11 | 2023-06-26 17:11:12 | 2023-07-13 09:22:42 | tf-infra-test |   2048 |    -4096 | ram       |
| 22714fc9-75a5-4c11-82bf-a3a0bc900424 | 2023-06-26 17:11:12 | 2023-07-13 09:22:42 | tf-infra-test |      2 |        8 | volumes   |
+--------------------------------------+---------------------+---------------------+---------------+--------+----------+-----------+
4 rows in set (0.001 sec)

But looking at the reservations, we find none that is not committed or rolled back:

mysql:galera_backup@localhost [trove_eqiad1]> select * from reservations where status not in ('Committed', 'Rolled Back');
Empty set (0.003 sec)

The guess is that the current slowness of the apis might have made some processes misbehave and not cleanup properly, or similar, so we'll try to restart all the openstack controller services, and manually cleanup.

Ok, rebooted all the services, and manually updated the reserved value:

mysql:root@localhost [trove_eqiad1]> update quota_usages set reserved=0 where tenant_id='tf-infra-test' and resource='volumes';
Query OK, 1 row affected (0.002 sec)
Rows matched: 1  Changed: 1  Warnings: 0

Let's try again (btw. thanks @fnegri for the hints to the reserved! and @aborrero for the support!)

After the restart and cleanup, the trove DB creation fails now on DNS registry:

    Error creating DNS entry for instance:
    9fef8e89-e86d-410b-af43-c255d46fda1f
Unknown

Looking

Manually removing all the database instances seemed to do the trick, the tests are back up and running \o/

dcaro moved this task from To refine to Done on the User-dcaro board.