Page MenuHomePhabricator

Wikimedia Technical Conference 2019 Session: Quo Vadis Beta Cluster? Towards better testing and staging environments
Closed, ResolvedPublic

Description

Session

  • Track: Deploying and Hosting
  • Topic: Quo Vadis Beta Cluster? Towards better testing and staging environments

Description

So called "beta" sites have been commonly used as testing and staging environment for code changes before these are shipped to production. There are several issues with these sites used like a pre-production environment

  • they are hosted on the de facto not maintained infrastructure
  • There is no easy automated way to set the configuration (in the sense of extensions and services enabled, features set enabled, and other config options) of the testing/staging site, not to mention being able to launch a pre-production environment with the particular configuration and software versions, e.g. when intending to test compatibility of the new feature with different versions of extensions, skins, etc.
  • They are actual "permanent" sites, i.e. they're not suited for use as mid-/short-term living staging environments

In this session we will look into what requirements we would have for the staging environments used in our work, and also see what possible solutions could we see that could replace "beta" sites as testing and staging sites (e.g. using new possibilities allowed with the adoption of container-based solutions).

Questions to answer and discuss

Question: Why and when is parity with production important?
Significance: Understanding/level setting on why will help tease out requirements.

Question: What are the most impactful problems wrt testing environments today?
Significance: Understanding the challenges with past efforts will help inform future approaches.

Question: What are some high level ideas to improve testing environments?
Significance: As these set of problems will not be addressed in an hour session, having some seed ideas for future discussions would be helpful. We are all engineers that like to solve problems, not like we'll avoid diving into solutions anyways :-)

Related Issues

  • Testing Pyramid - would you be ready to use these environments today?
  • ...

Pre-reading for all Participants

  • [add links here]

Notes document(s)

https://etherpad.wikimedia.org/p/WMTC19-T234643

Notes and Facilitation guidance

https://www.mediawiki.org/wiki/Wikimedia_Technical_Conference/2019/NotesandFacilitation


Session Leader(s)

Session Scribes

Session Facilitator

Session Style / Format

  • [what type of format will this session be?]

Post-event summary:

  • Test Environments are hard!
    • When discussing "production-like", production data is often a strong desire. This could be scale of data as well, and most likely, diversity of data.
    • Having multiple environments is likely part of the solution,
    • There are many use cases/requirements from many different personas
      • Software Engineering, Test Engineers, SREs, Product Managers
      • Test activities are varied and have separate and sometimes conflicting requirements.
  • Further requirements analysis is needed.

Post-event action items:

  • Formation of the Test Environments WG
  • Collaboration with Local Dev Environments work

Event Timeline

debt created this task.Oct 4 2019, 3:33 PM

I'm very interested in this topic. Over the past couple of years, I've been involved in a number of discussions on the topic, especially that of making beta cluster more "production-like". However, I don't think we've defined what "production-like" means. I'm of the belief that we may need to approach the question differently. Right now the discussions seem to be focused on a single production-like staging environment. Perhaps viewing it as several testing environments that have characteristics of the production would yield some alternatives that are more achievable. I think coupling that discussion with adiscussion about testing approaches would also be beneficial.

kostajh added a subscriber: kostajh.Oct 9 2019, 9:49 AM

Along the same lines as what @Jrbranaa said, it would also be interesting to think about "local-development-environment-like" in addition to "production-like", meaning is it possible to have a future where we can share configuration and tooling among local/QA/CI/production.

Perhaps viewing it as several testing environments that have characteristics of the production would yield some alternatives that are more achievable

Yes, agreed, and maybe also broadening this out a little bit to talk about Netlify-style on-demand QA environments that can be created on a per-patch basis, with predefined recipes for content / users / configuration / extensions in these environments.

Thanks @Jrbranaa and @kostajh for the input. The topic is indeed pretty broad, and, to actually make the work at the conference productive, its scope should probably be narrowed down, or split into multiple sessions.

I find the following ideas of yours very interesting:

Right now the discussions seem to be focused on a single production-like staging environment. Perhaps viewing it as several testing environments that have characteristics of the production would yield some alternatives that are more achievable.

Yes, agreed, and maybe also broadening this out a little bit to talk about Netlify-style on-demand QA environments that can be created on a per-patch basis, with predefined recipes for content / users / configuration / extensions in these environments.

I've recently discussed with a colleague of mine that what we miss from the existing staging environment (the fact it is a single one is also relevant/problematic!) is the ability to define and launch (in an automated way) an environment with particular production-like configuration (extensions, services, etc) but, what seems often overlooked, also the given set of feature flags. This would make development, testing and releasing new features to the complex systems we deal with significantly easier, less error-prone and simply nicer.

Does this blurry idea somehow corresponds, at least partly, to what you had in mind @Jrbranaa @kostajh. Do you think it would make sense to try to specify this session further in this direction?

Does this blurry idea somehow corresponds, at least partly, to what you had in mind @Jrbranaa @kostajh. Do you think it would make sense to try to specify this session further in this direction?

Yes, it does, especially the part about "also the given set of feature flags. This would make development, testing and releasing new features to the complex systems we deal with significantly easier, less error-prone and simply nicer."

It doesn't seem like we are that far away, I think what is missing is the ability to easily define CI settings (which could be done per-patch/branch), for example in extension.json the ability to have:

 json
"SomeVar": {
  "description": "Some feature flag.",
  "value": false,
  "ci": true
}

Would help get us part of the way. I haven't seen this done, but I imagine we could also standardize on writing maintenance scripts that run on install which could populate content or make further changes by checking to see if the wgWikimediaJenkinsCI variable is true.

These things would both be useful for CI, but they'd apply as well to creating a throwaway QA environment as well.

Maybe we could use Quibble for this since it already knows how to clone dependencies and such -- on toolforge, we could have an app that listens for particular comments coming from Gerrit ("make environment") which would then use Quibble to run a container with a publicly exposed port for a certain period of time, and report that URL back to Gerrit.

Do you think it would make sense to try to specify this session further in this direction?

@WMDE-leszek Yes, I think this direction makes sense. As @kostajh mentioned, I think there are things in place that we can build on already from a tech perspective. I do think that it will require some change in the way people see/plan/execute test though. In my experience, some of the most difficult discussions around this have been what "production-like" actually means and the testing context in which those production-like attributes matter (i.e., you don't need production-like scale if you're primarily interesting in compatibility with other extensions).

WMDE-leszek renamed this task from Wikimedia Technical Conference 2019 Session: Quo Vadis Beta Cluster? Towards production-like testing and staging environments to Wikimedia Technical Conference 2019 Session: Quo Vadis Beta Cluster? Towards better testing and staging environments.Oct 11 2019, 10:34 AM
WMDE-leszek updated the task description. (Show Details)
TheDJ added a subscriber: TheDJ.Oct 12 2019, 2:20 PM

Good points being raised. Personally I see beta as actually two things.

  • shadow service (use the latest software against existing data for verification), which it is not, since it uses separate data structures
  • pre-production staging of significant software configuration changes and features. Which it is not, since it doesn't really reproduce production

I think this sort of causes the -like descriptions that people have given. Ideally both cases are more separated and more true to their intended purpose. This was not really possible before due to resource limitations, but maybe now we can be closer ?

bd808 awarded a token.Oct 15 2019, 2:49 PM
WMDE-leszek updated the task description. (Show Details)Oct 16 2019, 9:13 PM
debt triaged this task as Medium priority.Oct 22 2019, 6:58 PM
greg added a comment.Oct 23 2019, 9:37 PM

(Programming note)

This session was accepted and will be scheduled.

Notes to the session leader

  • Please continue to scope this session and post the session's goals and main questions into the task description.
    • If your topic is too big for one session, work with your Program Committee contact to break it down even further.
    • Session descriptions need to be completely finalized by November 1, 2019.
  • Please build your session collaboratively!
    • You should consider breakout groups with report-backs, using posters / post-its to visualize thoughts and themes, or any other collaborative meeting method you like.
    • If you need to have any large group discussions they must be planned out, specific, and focused.
    • A brief summary of your session format will need to go in the associated Phabricator task.
    • Some ideas from the old WMF Team Practices Group.
  • If you have any pre-session suggested reading or any specific ideas that you would like your attendees to think about in advance of your session, please state that explicitly in your session’s task.
    • Please put this at the top of your Phabricator task under the label “Pre-reading for all Participants.”

Notes to those interested in attending this session

(or those wanting to engage before the event because they are not attending)

  • If the session leader is asking for feedback, please engage!
  • Please do any pre-session reading that the leader would like you to do.
debt updated the task description. (Show Details)Oct 25 2019, 9:22 PM
Jrbranaa updated the task description. (Show Details)Oct 30 2019, 5:48 PM
Jrbranaa updated the task description. (Show Details)
Jdforrester-WMF added a subscriber: Jdforrester-WMF.
Nikerabbit updated the task description. (Show Details)Nov 4 2019, 2:01 PM
WMDE-leszek updated the task description. (Show Details)Nov 11 2019, 5:18 PM
WMDE-leszek added a subscriber: Quiddity.

Wikimedia Technical Conference
Atlanta, GA USA
November 12 - 15, 2019

Session Name / Topic
Quo Vadis Beta Cluster? Towards production-like testing and staging
environments
Session Leader: Jean-Rene + Leszek; Facilitator: James; Scribe: Nick +
Leszek

Session Attendees
greg, andre, bryan, giuseppe, thedj, tobi, daniel, musikanimal,
niharika, daniel Z, reedy, maté, joaquin, petr, tpt, kosta

Notes:

  • About multiple possible solutions, depending on requirements
  • Desired outcome: a set of high level requirements / use-cases
    • these will become the seed elements for a Working Group
  • Seed questions:
    • Why and when is parity with production important? -- not necessarily required for all types of testing
    • What are the most impactful problems wrt testing environments today?
    • What are some of the high level ideas to improve the testing environments? You can ask "why" to help focus
  • Break them down into "Must have" and "Wants"

[Group discussions]

session overview comments

  • 1 BD
    • our group ended up with too many people with too many scars, so kept rabbit-holing
    • spent most of the time talking about"when is parity important" - service integration, piece interacts with job queue and kafka and and and, then you need a more complete stack than one might normally use locally
      • Also need content complexity and variety, for different types of testing - E.g. templates can be hard to test for, because often not present in testing environments we have.
      • Full hardware and network parity for integration testing
      • last job of one person, had 3 separate identical testing environments, like our staging environment but a lot smaller (and thus affordable)
    • performance regression testing - not going to be identical, but *consistency* (and testing over time) is enough for most use-cases
  • 2 GL
    • Overcharged[?] what beta cluster is in many ways
    • not the place where users a
    • data parity also came up
    • caching should be consistent - not 3 layers of madness in prod, but purging normalization, etc
    • performance testing is hard, but idea: branch hosting. Could show to Design people that way
    • Lack of production data
    • global sideeffects from multiple people testing things at the same time - e.g. someone testing compression. Could resolve by [?]. Isolated areas - expensive but impactful.
    • TLDR: Cannot have 1 environment that does all things.
  • 3. DK
    • 3 areas of parity
      • scale of data, scale of traffic (perhaps unrealistic)
      • having similar set up (extensions, services)
      • community defined 'stuff' - e.g. gadgets templates
      • hardware paritty, at least same behaviour and scalability
      • security patches that are live
      • want to be able to test the deployment methods - what scap does (10 mins, except when it breaks...)
    • Problems
      • can use beta cluster to test that the code does what expected, but it is difficult to catch unwanted side effects (there is no comprehensive test suite that could be used for this)
    • 4. AK
      • group with many SREs...
      • Really important to have parity
        • end-to-end testing
        • integration tests, between services etc
        • deployment tooling (scap, helm, etc) and deploy testing
      • Need some parity
        • dev environments
        • schema changes (subset of the above)
        • data parity
        • performance with hardware, difficult one
        • QA
        • config management (e.g. puppet)
      • Not at all
        • unit tests
        • showcasing to PMs
        • branch testing
  • - GL: 3 different envs we're talking about. They all came up when talking about beta cluster is [?] .
      • end-to-end / integration testing
      • user-testing: branch hosting, experiments, etc. user-environment testing. Should be easy to spin up.
    • AK: Personal comment on staging, as there has been quite some history there. I think we should drop the idea of the staging environment, as we everyone would like to have one (would be expensive to maintain all of them)
    • AK: We have so much data that we can't create a staging environment. We should focus on testing everything that is possible to test.
      • GL: end-to-end integration testing. performance testing.
    • Summary of the let's ditch the term discussion: The ambiguity of the term "staging environment" is the issue that is causing some emotional response
    • PG: Would it help to draw a pipeline to illustrate the stages and boundaries that we need to serve?
      • That could be the step to do with the outcome of this session, to start to bucket these requirements
    • We're always focused on a monolithic single env to solve all our needs, but that's not really the desire
  • [everyone place post-its on posters clustered into Musts and Wants]
  • - TODO for doc sprint - transcribe Must and Want lists from posters to the notes
    • Now done?
  • Musts
    • Deployment Tooling
    • Deployment of Services
    • Integration tests
    • edge to edge tests
    • browser tests
    • content complexity & variety are needed to get full coverage
    • actively monitored & maintained platform to create trust in results
    • mean time to test result needs to be low enough to maintain flow
    • isolation of tests running in parallel to reduce false positive failures
    • service integration needs config/complexity parity
    • Branch hosting
    • no significant downtime
    • Parity
      • services parity
      • data: pages, langlinks, templates...
      • partial on-demand pull (of content)
    • services testing
    • caching
      • purging behaviour
      • URL normalization
    • spin up test servers with different config, number of wikis, extensions, etc.
    • full user story automated tests (that aren't slow)
    • more rogue admins
    • impactful problems
      • lack of production data
      • global side effects from testing
  • Wants
    • Dev env Schema changes
    • branch testing
    • QA showcase
    • Performance
    • Config Management
    • performance regression detection
    • full hardware and network parity for complete system integration testing
    • per patch QA environment with config, content, etc
    • data full import
    • load balancing
Jrbranaa updated the task description. (Show Details)Nov 15 2019, 7:19 PM
greg closed this task as Resolved.Dec 17 2019, 10:53 PM

Thanks for making this a good session at TechConf this year. Follow-up actions are recorded in a central planning spreadsheet (owned by me) and I'll begin farming them out to responsible parties in January 2020.