Page MenuHomePhabricator

Request creation of 'sre-sandbox' VPS project
Closed, ResolvedPublic

Description

Project Name: sre-sandbox

Wikitech Usernames of requestors: i think every one in cn=ops,ou=groups,dc=wikimedia,dc=org

Purpose: General R&D for SRE projects

Brief description: It's quite common when investigating new software and comparing different platforms to require short-lived machines (1-2 months) to build prototypes and compare features before picking the project which will ultimately be developed for production. It would be useful if we could have a general-purpose project to perform this type of prototyping, instead of requesting short-lived projects for each quarter. The SSO project is an example of such a project which should have been short-lived and removed after we settled on using CAS, however I've now started to reuse it to evaluate PKI solutions.

How soon you are hoping this can be fulfilled: this quarter

Event Timeline

jbond triaged this task as Medium priority.Mar 12 2020, 2:23 PM
jbond added a project: SRE.

I'm going to start my response with an annoying quoting of the guidance on project scope:

Project scope

Cloud VPS projects should be scoped based around concrete products or software projects, rather than the team working on them. The three main problems that we (the Cloud Services team) have seen in the past with team ownership/scope for Cloud VPS projects are:

  • Team gets disbanded/reorganized but its project needs to live on due to hosting of important VMs
  • Difficulty establishing who is the primary point of contact for a given VM when trying to reclaim quota or fix a broken instance
  • Tendency to close membership/participation to only team members rather than inviting participation by other volunteers

There are things that can be done to mitigate these problems, but the easiest thing to do is to create more targeted projects that are scoped to a product/project rather than a team. This can become a burden in other ways if a common group of developers is active on a large number of such projects, so we are willing to be flexible if good cause can be shown for project consolidation.

For more guidance, see https://wikitech.wikimedia.org/wiki/Help:Cloud_VPS_project

Just quoting documents without providing a more complete picture of the 'why' seems dismissive, so I would like to elaborate a bit on this topic in part because it comes up on an almost predictable schedule and also because we have not revisited this decision by the WMCS team for long enough that it is worth a bit of time to think more deeply about.

Way back in the olden days (FY2013/2014) when I was first introduced to Cloud VPS (né Labs) it was relatively common practice to have team oriented projects. I actually was the 'founder' of such a project for the MediaWiki Core Team in that era. It was also common practice for all SREs and several other Foundation staff to have 'cloudadmin' rights on Wikitech. At that time all OpenStack project creation was handled using MediaWiki-extensions-OpenStackManager on Wikitech and users with the 'cloudadmin' role could create a new project with a few mouse clicks. This was a great time to be a trusted Cloud VPS user. It was also a horrible time to be responsible for cross-project maintenance or capacity planning for the Cloud VPS environment. Ad hoc project creation without any oversight or central tracking led to several unwelcome surprises for the team primarily related to growth in usage without any real tracking of who and why.

A process of using Phabricator tasks to track project requests started in FY2014/2015 by @yuvipanda with T76375: [DO NOT USE] New Labs project requests (tracking) [superseded by #cloud-vps-project-requests]. This was a bit better for tracking than the prior system of "ask someone on irc"/"fill out this Semantic MediaWiki form". Processing these requests was still done ad hoc until November 2016 when @chasemp and the rest of what is now the WMCS team implemented the set weekly review of the queue and asked other cloudadmins to stop creating ad hoc projects without them going through that review & approval phase. At the time capacity planning was a real struggle and the team was hoping that this central review change would give them better insight into where and why use was growing. I think that has actually turned out to be a success, at least from the point of view of the WMCS team.

Somewhere in the 2015-2016 period, we started trying to discourage 'team' projects as well. I hope the reasons for this are explained reasonably well in the "project scope" guidance that I quoted above. Going through many cycles of dead project purging (for example Cloud VPS 2016 Purge), Operating System deprecation (for example Trusty deprecation), and the more 'normal' search for instances which have broken Puppet manifests which are keeping them from tracking changes to our shared infrastructure led to this. Projects with uncertain "ownership" of each instance make all of the normal periodic tasks more challenging than more focused projects. As I mentioned, I started the mediawiki-core-team in 2014. That team was disbanded in April 2015, but it took until January 2017 for all the instances to be removed from the project and it to be shutdown.

</history>

That historical rambling is all well and good, and I hope it helps folks understand some of the reasons that the current processes and guidance exist. I think a more interesting thing to explore though is what the pain points are with "requesting short-lived projects for each quarter" as mentioned in this request.

  • Is the ~1 week waiting period too long?
  • Would a shorter turn around time for requests make it seem less burdensome?
  • What other pain do y'all experience in bootstrapping a new Cloud VPS project for they type of work you do there?

@bd808 thanks for the detailed history this does help and i think the reasons given and how we arrived here all seems logical. It sounds like historicity team projects where used for infrastructure which ended up serving production requests or providing useful services to the foundation. With this in mind it seems reasonable for services to be scoped correctly and have there own individual projects. however what I'm requesting here is simply a playground, i don't envisage any service in the proposed SRE project ever serving live traffic or providing useful services to the wider foundation (famous last words).

The project would be solely used for prototyping and trialling new software/services once the prototyping phase is over i would envisage a request for physical, ganeti or cloud resources to actually deploy the service. speaking from my own personal experience i would normally do this type of prototyping on my own laptop or some cloud provider however there are definitely times when the resources required to run certain tests are not available.

To answer your specific questions

Is the ~1 week waiting period too long?

Yes, in many cases some of theses prototypes wouldn't even exist for a week, i.e. i may spin up a server preform some test over a 24 hour period and conclude this software is not good. destroy the VM and move on.

Would a shorter turn around time for requests make it seem less burdensome?

For my use case i think the tun around time would need to be < 1hour otherwise i think i would just work around the issues in gcloud, amazon or virtual box .

What other pain do y'all experience in bootstrapping a new Cloud VPS project for they type of work you do there?

I think for well scoped projects the current process is good. however i guess what im asking for is space for projects which are not well scoped or even just a space to trial whims of inspiration i.e. "i just saw this new piece of software on HN, gonna spin it up in the sre project and trial it." play with the software for a few hours and either decide its no good and destroy the VM or it is good. so use the VM as a demo to decide if we want to progress with the project, if we do decide to move forward we properly scope the project and ask for the appropriate resources in cloud, ganeti or physical.

The project would be solely used for prototyping and trialling new software/services once the prototyping phase is over i would envisage a request for physical, ganeti or cloud resources to actually deploy the service. speaking from my own personal experience i would normally do this type of prototyping on my own laptop or some cloud provider however there are definitely times when the resources required to run certain tests are not available.

I am pretty sure I understand your use case and desire, but I currently don't see how this will end differently than other loosely scoped, multi-owner projects unless we do something more proactive than social contracts. Any ideas how we could enforce instances being short lived in this "sre-sandbox" project?

One quick idea: pick a mean instance lifetime (for example 10 days) and setup a script somewhere that will email a whole bunch of people when an instance is found in the project that is older than that. Basically this is harnessing our collective dislike of cron spam to force folks to respect the idea of this being a quick proving ground rather than a long term work location.

! In T247517#5995043, @bd808 wrote:

I am pretty sure I understand your use case and desire, but I currently don't see how this will end differently than other loosely scoped, multi-owner projects unless we do something more proactive than social contracts. Any ideas how we could enforce instances being short lived in this "sre-sandbox" project?

I completely understand your fear.

One quick idea: pick a mean instance lifetime (for example 10 days) and setup a script somewhere that will email a whole bunch of people when an instance is found in the project that is older than that. Basically this is harnessing our collective dislike of cron spam to force folks to respect the idea of this being a quick proving ground rather than a long term work location.

Perhaps we could be even more aggressive. I'm not sure how much we could script things or what the capabilities of openstack are but i wonder if its possible to have a logic similar to the following

  • all machines must be associated with an owner (do this via hiera?)
  • after a machine has been online for 10 work days the owner is emailed stating the have 5 work days to delete or migrate there host
    • If the host has no owner it is deleted automaticity at this point
  • The owner of the host now has 5 work days to request a correctly scoped project to migrate there host to or they delete it
  • Any machines that have been on for more then 15 work days are automaticity deleted

Thanks john

Perhaps we could be even more aggressive. I'm not sure how much we could script things or what the capabilities of openstack are but i wonder if its possible to have a logic similar to the following

  • all machines must be associated with an owner (do this via hiera?)

If we consider the user that created the instance from Horizon the owner, then OpenStack already tracks this for us. That metadata does get messed up when we have to migrate an instance from one hypervisor host to another today, but that shouldn't happen often in a time limited scenario.

  • after a machine has been online for 10 work days the owner is emailed stating the have 5 work days to delete or migrate there host
    • If the host has no owner it is deleted automatically at this point
  • The owner of the host now has 5 work days to request a correctly scoped project to migrate there host to or they delete it
  • Any machines that have been on for more then 15 work days are automaticity deleted

This should all be possible to script from the control plane servers. I think it would be easier to drop the idea of deleting unowned instances at day 10 and just have the 15 day max lifetime with a warning message at day 10.

bd808 renamed this task from Request creation of SRE VPS project to Request creation of 'sre-sandbox' VPS project.Mar 25 2020, 3:12 PM
bd808 updated the task description. (Show Details)

ack thanks ill pass this on in the sre-foundations meeting in 5 mins

bd808 moved this task from Feedback needed to Approved on the Cloud-VPS (Project-requests) board.

Discussed and approved in the 2020-03-25 WMCS team meeting. We want to build the instance reaper script before we turn the project over to the SRE folks. That means that the project will probably not be created by the requested end of quarter date (2020-03-31), but it seems likely that it will happen within the next 2-3 weeks.

The idea sounds good to me and surely help to simplify one main use case. One possible addition could be to have a single long-lived instance with a standalone puppetmaster if one has to test some puppet changes, but it can be postponed for later once we see how the project is used in real life.

Change 592971 had a related patch set uploaded (by Andrew Bogott; owner: Andrew Bogott):
[operations/puppet@production] sre-sandbox hiera: added a link to the project request

https://gerrit.wikimedia.org/r/592971

Change 592971 merged by Andrew Bogott:
[operations/puppet@production] sre-sandbox hiera: added a link to the project request

https://gerrit.wikimedia.org/r/592971

Andrew claimed this task.
Andrew added a subscriber: Andrew.

This project has now been created. @jbond, you are the initial projectadmin and can add other users/projectadmins as you see fit.

This project is subject to the attention of

https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Maintenance#wmcs-instancepurge

which means that VMs will be deleted automatically after 15 days. Please take note that you are the first project to use this script, so you should expect misfires during the first few weeks (where 'misfire' means 'annoying random emails and/or unexpected VM deletion')

is to possible to get more quota in this project. I Just tried to create a machine and we have 1 x m1.xlarge which seems to have taken all the quote.

is to possible to get more quota in this project. I Just tried to create a machine and we have 1 x m1.xlarge which seems to have taken all the quote.

See Cloud-VPS (Quota-requests) for the process that needs to be followed. And yes, the default quota is equivalent to a single xlarge instance.