Page MenuHomePhabricator

EXPLORE removing Openstack API calls from the user critical path
Closed, DeclinedPublic

Description

This is followup work from T348549: EXPLORE provisioning one small k8s VM per each test environment USING a single Cloud VPS project.

Current status:

  • users can use catalyst to trigger the Openstack API to start a VM on Cloud VPS with a small k8s installed

Problems:

  • during development, the Openstack API would at time timeout or return 500 errors

Options:

  • would a persistent small k8s cluster to handle API requests be possible and maintainable?
  • would a pool of warm VMs instantiated outside the user critical path lessen this issue? Or only move it around?

The output of this exploration story should be a proposal broken down into story size steps.

Event Timeline

thcipriani added a subscriber: Slst2020.

@Slst2020 assigning you based on discussion from the Catalyst workboard discussion today. Please reach out if you have questions or worries about the scope of this task.

would a pool of warm VMs instantiated outside the user critical path lessen this issue? Or only move it around?

This is something we discussed a few times but never got around to try out. I think we should explore this option before trying to create a custom clone of toolforge ;) And, WMCS now supports unmanaged instances, which gives us full control over VM provisioning and access control through dynamic SSH key injection.

A PoC would need to have at least these components:

  1. An initial pool of pre-provisioned VMs. Could be just manually created to begin with. These VMs would be accessible to all admins, i.e. members of the catalyst Cloud VPS project.
  2. VM Pool Manager: Monitors the number of available VMs and provisions new ones if the count drops below X. Interacts with OpenStack APIs to manage VM lifecycle (creation, deletion), and with the Web Proxy API for networking. Handles requests from the Catalyst API when a user needs a new instance or wants to delete an existing one.
  3. Database/State Store: Stores the state of each VM (e.g., in use, available) and user allocations.
  4. Automation Tools: E.g. Ansible for automating VM provisioning and configuration.

On receiving a request from a user via the Catalyst API, the VM Pool Manager allocates an available VM to the user, injects their SSH keys, marks it as in-use in the database, and returns the connection details. It then adds a new VM to the warm pool. Upon release, the VM Pool Manager destroys the instance.

I'm a bit confused by the scope of this task and the current proposal.

I thought there was no work remaining in the backend/infra for the PoC. And for the MVP my understanding was that a K8s cluster was the preferred solution over a pool of VMs. Mentioned in section "Looking towards v1.0" in Stef's wrap-up doc

EDIT: If our intention is to already offer the PoC to testers, we could for now deal with Openstack's flakiness by implementing a retry mechanism in the backend

I'm a bit confused by the scope of this task and the current proposal.

I thought there was no work remaining in the backend/infra for the PoC.

Yes I agree – this is out of scope as far as the PoC is concerned.

And for the MVP my understanding was that a K8s cluster was the preferred solution over a pool of VMs. Mentioned in section "Looking towards v1.0" in Stef's wrap-up doc

True, although I don't think any firm decision has been taken yet. We never actually had the planned wrap-up meeting. I think exploring a warm pool of VM's should at least be considered before going all-in on a toolforge-style cluster, especially as it's now been made easier with unmanaged VMs.

  1. An initial pool of pre-provisioned VMs. Could be just manually created to begin with. These VMs would be accessible to all admins, i.e. members of the catalyst Cloud VPS project.
  2. VM Pool Manager: Monitors the number of available VMs and provisions new ones if the count drops below X. Interacts with OpenStack APIs to manage VM lifecycle (creation, deletion), and with the Web Proxy API for networking. Handles requests from the Catalyst API when a user needs a new instance or wants to delete an existing one.

This sounds a lot like recreating https://wikitech.wikimedia.org/wiki/Obsolete:Nodepool. When Jenkins was using Nodepool for job state isolation it was a major cause of load for the OpenStack backplane and toil for the WMCS SREs.

during development, the Openstack API would at time timeout or return 500 errors

Why is there no task or other discussion I have heard about about fixing those 5xx errors? That seems like a more stable solution than trying to invent a hack to work around them.

during development, the Openstack API would at time timeout or return 500 errors

Why is there no task or other discussion I have heard about about fixing those 5xx errors? That seems like a more stable solution than trying to invent a hack to work around them.

Regardless, starting up and provisioning a VM takes ~10 min. Keeping a warm pool is not an unusual pattern in this context, not a "hack" or a workaround.

@jnuche / @jeena, would you mind filing a task with details about any OpenStack API flakiness you've encountered while developing the catalyst backend?

Sorry to cause confusion and worry in this task, my fault for wording the task like I did.

I should clarify:

  • For the Catalyst proof of concept (PoC), users click a button, we call out the Openstack API, and it magics us a VM. And it's all working.
  • Spawning VMs via the API is super cool and it works—I'm grateful for WMCS' support.
  • I'm unaware of any bugs or problems that need fixing for Openstack. My target in filing this task (which I missed) was to flag the user experience from Catalyst as something to explore improving.
  • I'm not signing anyone up for nodepool 2.0

So, the catalyst user experience is less than ideal since (1) it takes time to spawn VMs, unavoidably (2) we hit some snags (I don't have the details, @jnuche may remember challenges he encountered—please add details, but let's treat anything actionable there as a separate task).

But I'm still spinning up on the project and the discussion here makes me think I've gotten ahead of the PoC phase and problems like these are fine to remain unresolved but should be part of the planning for the MVP— @Slst2020 and @jnuche what are your thoughts?

During the development of the PoC back in last November, calls the the OpenStack API from the Catalyst backend would sometimes time out or return 500. IIRC this would happen in bouts, i.e. the OS API would stay unstable for a short period of time (5-10 mins?) during which some calls would succeed and others fail. Then the API would remain stable for a longer time for me, sometimes days at a time. Maybe the problem was caused by maintenance windows, but don't remember seeing any announcements at the time.

Last week I spent a couple of days trying to reproduce the problem from the Catalyst backend and get a stacktrace to show an example of the issue, but the API stayed stable all throughout. Then I got sidetracked and forgot replying here, really sorry about that @Slst2020

I do have a screenshot from Horizon I captured at some point. This one happened while trying to refresh the proxies page:

500-proxy-ui.png (245×322 px, 24 KB)

@thcipriani it also makes sense to me to tackle this during the planning of the MVP

@thcipriani it also makes sense to me to tackle this during the planning of the MVP

Yep, makes sense, and post our sync with WMCS I think we've collectively found a path forward there. I'm going to decline this one. Thanks all!