Page MenuHomePhabricator

cumin and cloud-vps instances not working
Open, Stalled, HighPublic

Description

I am trying to fix a fairly simple issue today using cumin, and I can't make cumin work.

Some examples:

andrew@cloudcumin1001:~$ sudo cumin --force O{project:deployment-prep} 'true'

fails on 64/66 VMs.

andrew@cloudcumin1001:~$ sudo cumin --force O{*} 'true' 
Caught Unauthorized exception: The request you have made requires authentication. (HTTP 401) (Request-ID: req-08388682-1396-4143-898e-aea417d122a9)
andrew@cloud-cumin-03:~$ sudo cumin --force A:all 'true'
Caught Unauthorized exception: The request you have made requires authentication. (HTTP 401) (Request-ID: req-4e9ba2db-8806-4de1-a355-4796a8c019e0)

(and 'keyholder arm' doesn't seem to help with this)

This are things we need to be able to do. As it currently stands there are any number of mishaps that we have no way to recover from. Is there a right/new way to do things like this that I just don't know about?

Event Timeline

Andrew triaged this task as Unbreak Now! priority.Sep 26 2023, 8:01 PM

HTTP 401, is that the openstack API?

This is a duplicate similar to T346453: [cumin] [openstack] Openstack backend fails when project is not set, but I thought it only failed when a project was not specified?

It's now definitely failing when a project is set, but I think they are two separate issues:

  • when running cumin with O{project:some-project} it fails without an error message (but seemingly only if the project contains many hosts)
  • when running cumin with O{*} or A:all it fails with HTTP 401 because of T346453, that should be fixable with the workaround listed there

The failure when a project is specified is a concurrency issue, running cumin with -b 10 works fine. This is probably related to T340241: cloudcumin is slow when targeting many cloud VMs.

fnegri changed the task status from Open to In Progress.Sep 27 2023, 10:05 AM
fnegri claimed this task.

Increasing connect_timeout from 20 to 40 in /etc/cumin/config.yaml does indeed fix the issue. I have no idea why 20 was enough when I set it in T323484, I'm sure I tested it in large projects with hundreds of hosts and it worked fine.

Instead of increasing the value to 40, I will first try to assign more CPUs to the restricted bastion, as discussed in T340241.

Mentioned in SAL (#wikimedia-cloud-feed) [2023-09-27T10:52:38Z] <fnegri@cloudcumin1001> START - Cookbook wmcs.vps.create_instance_with_prefix with prefix 'bastion-restricted-eqiad1' (T347428)

Mentioned in SAL (#wikimedia-cloud-feed) [2023-09-27T10:55:10Z] <fnegri@cloudcumin1001> END (PASS) - Cookbook wmcs.vps.create_instance_with_prefix (exit_code=0) with prefix 'bastion-restricted-eqiad1' (T347428)

fnegri lowered the priority of this task from Unbreak Now! to High.Sep 27 2023, 11:55 AM

Lowering the priority to "High" as we have a workaround for both issues:

  • using -b 10 to fix the concurrency issue
  • manually applying this patch to fix the HTTP 401 issue (I have just applied it to cloudcumin1001)

After replacing the bastion in T340241, it's no longer necessary to add -b 10, though Cumin is still not as fast as I'd like (more details in that task).

Fixing the 401 issue is tracked in T346453, so I'll mark this one as "Stalled" until that one is resolved.

fnegri changed the task status from In Progress to Stalled.Sep 28 2023, 1:18 PM