Page MenuHomePhabricator

2019 Tech Conf Unconference: New CI/Argo
Closed, ResolvedPublic

Description

Let's talk about New CI/Argo. A hosted k8s cluster for Wikimedia's new CI for building and promoting images to Wikimedia production


Session: 2019 Tech Conf Unconference: New CI/Argo

Ticket: https://phabricator.wikimedia.org/T238261

Leads: Tyler, Giuseppe

Scribe(s): Brennen, Will, (James)

  • T: Background. Reading material is on the ticket.
    • We're running on a piece of Python software called Zuul, takes your patches from Gerrit, schedules them. Since the version we're running, OpenStack has redone Zuul completely. Zuul communicates through Jenkins. Jenkins has unmaintained plugins, Jenkins is scary software. We need to do something about CI; it's going to be a lot of work no matter what.
    • February of 2019 we started a CI Working group - Lars announced it, we called for a ton of suggestions, blog posts, etc. We got roughly 10 tools we chose to evaluate more deeply.
    • March 2019 - report lists all criteria.
    • June 2019 - GitLab CI, Zuul v3, Argo.
    • July 2019 - We did proofs-of-concept for each.
    • CI Architecture document - we're on v3 now. In October we finished the evaluations: Is this implementable at all given the constraints.
      • We did a weighted evaluation in a spreadsheet.
  • That brings us to now. Argo came out on top based on the criteria.
  • KNative solution - Kubernetes.
    • Responds to events happening, waits for pods to do stuff.
    • Seakeeper proposal.
      • Modelling of how many builds we do in a day.
      • Known unknowns.
      • Project namespaces
  • Gating
    • Currently Zuul offers project gating - builds speculative future states of repos. Patches get tested as if things have been merged to master.
    • QUESTION: How necessary is that?
    • GG: Saves time in the best case scenario, loses it in the worst case.
    • GL: It's debatable that gating is the right thing to do. You have 2 patches - you test your patch on top of something that fails.
    • TC: It'll still be correct, just takes more computation.
    • Timo: That isn't "wasted time", just "time we didn't win". Without a dependent speculative pipeline only one patch could be merged at a time... Do we currently have examples of where it is actually slower than doing one (patch) at a time.
    • TC: I asked Dan to look at this - an interesting metric he came away with is that it's something ridiculous - days of computation time per month that get thrown away, that we don't use the results of. That could be indicative that this is a very useful feature, or it could be indicative...
    • GL: I would say the contrary. If we're throwing away days of computation...
    • Tyler: It's indicative that this feature is saving production to some degree. If these patches can't merge together...
    • GL: Your +2 will have to wait
    • JH: I understand that serializing the merge will save computation...
    • AS/Timo: It's slowing things down to serialize it. It would take longer, less compute.
    • Joe: We need better numbers about this
    • Timo: There is also the case where time wasn't thrown away, on busy days we will gain because the patches can merge together
    • Adam: In the best case scenario, for 10 things at 10 mins each, it either tasks 10 minutes or 10-100 mins in the worst case
    • Less of an issue if merging a patches doesn't take a long time.
    • INFO: James: just as a point of info, we currently land ~500 patches a week in code that goes to prod, so a patch every 20 mins averaged over the week, please don't let merging take longer than 20 mins or we'll never ship
    • Tyler: There is a forecast for build concurrency
    • DECISION: Agreed that we need it.
  • Security workflows and artifact building / storage
    • Joe: for those trying to create docker images, secure environment where you can store secrets – can't be the same environment as CI, as CI is very security-exposed, accepting kind of arbitrary input and accepting that from anyone
    • You don't want Travis to also build your keys / do signing ???
    • This connects to a problem with Argo, it's a large well done system, it has about 20 docker images, built individually. Maintaining them for all sec patches is pretty hard and not justified. Moving the CI part out of the secure perimeter of prod probably makes sense. We would need a separate way to build, does anyone have any ideas about whether we need an artifact building system?
    • Ricardo: It came up recently, some usecasee have to build in CI some compiled JS that get's commited back to the repo, JS minification is a build step that generates artifacts
    • GL: So minification of JS - build stack that produces binary artifacts
    • Florian: If you want a production release you need some way to silo it
    • GL: for some who has spent time building for MW, does it need to circle back to gerrit? So you do the build then you push back to gerrit or it's an artifact that is just published
    • LIW: From my point of view we don't ever push anything back to Gerrit from CI. It's an antipattern.
      • [agreement from several parties]
    • AS: But if the question is are we doing it...
    • JF: We're not doing it right now; people are generating build artifacts on their local machines.
      • INFO: Currently deploying from (copies of) git, so in MW land it has to be there
    • GL: We don't deploy from git
      • [we very slightly mangle things out of git checkouts, and sync the files, both done using scap]
    • Timo: We check out from repos, scap has effectively build steps, localisation etc. It is potential more appropriate place for transformations rather than CI. It is a known problem that artifacts are generated in non-secure non-deterministic way
    • GL: In the build step for scap you know what you need at a given step, no one sends a binary blob to code review so it makes no sense
    • KH: In CI we do need to do the builds so that browser tests for example can run - but in terms of the artifacts that we use in production we'd do that in deploy
    • AK: Tangentially related, where are we gonna push the built artifacts?
    • Antoine: For artifacts storing, we currently delete or compress them to reduce the number. We also delete most after just 5 or 15 days. We probably need a scalable system - may be swift - about 1 terrabyte per day. That is a problem we are struggling with. An s3 storage would be very helpful as somewhere we can just put things
    • Pablo: May be we can also just differentiate between the types of things we are storing, persistent or not
    • Brooke: One of the things I've encountered before - we're tlaking about this as a system we don't currently have, which is shaping the discussion... Where I worked before we had an artifact store, we had a rule that you couldn't deploy unless it's in the artifact store [did I get this right? - BB]
    • Riccardo: One things we can do to reduce space need is deleting. Clear out logs and old artifacts when tests pass.
    • GL: INFO I think we are confusing CI artifacts and intentionally built artifacts, but that's OK, we're just gathering questions
    • TC: These artifacts do get pushed to different places,
      • when it's a docker image that get's pushed to prod and need to go to the docker registery.
      • Another use case could be like gerrit, jar builds.
      • Artifacts that aren't taken care of are: JS compile, Go binaries (we now put them in docker).
      • Is logs a part of that discussion?
    • GL: For many kinds of artifacts we already have a repository, what we need is a way to build them securely
    • Timo: What is the generic outcome we want to work towards in this session?
    • GL: Trying to better understand the things we're not completely aware of...
    • General unknowns ???
      • Timo: Things ??? [generated artifacts?] sit in gerrit unreviewed
    • Timo: there is no trust model in npm, the one element you can use is the publisher. So we assume that one lib is safe but it has deps of deps of deps of deps....which may or may not be secure
      • There's an actual problem here of things being a black box.
    • GL: Using an artifact repository would require pushing to it regularly, how many would put up with that?
    • Antoine:I would yes, to freeze the state
    • Pablo: Do we want to use an existing one or build our own?
    • Lars: You want secure envs for building, can you give a list of requirements as to what qualifies as a secure env?
    • GL: I can talk about what SRE mostly considers secure:
      • Mainly an area without execution of plastic code like CI
      • Should be able to store credentials, securely separated from CI
        • e.g. Private keys, gpg keys, API keys, to be able to use them
        • e.g. Docker registry password
        • e.g. Security patches for MW deployments before public release
      • Trusted as opposed to CI where anyone can submit to those systems
      • Separate security perim that is different, separate log rules
    • LIW: Would a container be good enough? If not that VM?
    • Alex: Every year multiple are published so no. The system needs to be almost air gapped with laser sharks, minimal access to prod and that's a basic level
    • GL: Suppose Argo is both paths, it would need to at the very least be a separate SRE controlled cluster
    • LIW: Regarding access to secrets...
      • ...anchor? [didn't catch this]
    • Alex: Hashicorp's Vault? That would work
    • Riccardo: A little bit of an elephant in the room: Another use case: Number of people that see those and access them - generated artifacts should not be public while we have everything public by definition
    • GL: If you're making a release you're making your private sec patch public
    • Riccardo: Embargoed patches and stuff.
    • Ricarrdo: It would be nice in a new system, that security patches and embargoed things are not breaking in prod. Specific usecase and hard to solve.
    • GG: You just need a second system.
      • ...or n systems.
    • Timo: Edge case: A second CI cluster / etc. means you can trust the people who run it...
      • [missed this]
    • JF: mediawiki/vendor.git , Package.lock, composer lock - locking stuff from remote sources is an already solved problem. Yes, it's a pain to have both a public and a private CI & docker registry, but that's the cost of doing things correctly.
    • GG: These are important conversations, hopefully notes are going ok - they are
      • These are a good starting point for having explicitly written-down requirements for security needs for SRE requirements, etc.
      • This will help us choose the direction we go software wise but also where we put the software
      • If we now depend on SRE for n more clusters, or if this is pointing us towards a throw-money-at-the-problem thing and hosted k8s clusters.... But we need this list of requirements to make this choice. I want that to be a clear explicit outcome of this.
      • Have been looking for hosted ArgoCI / CD - Intuit (who make Argo) posted a DigitalOcean suite of images for Argo
  • TC: Covered a lot of the known unknowns as well as unknown unknowns.
  • TC: Not super comfortable about hosting on 3rd party
    • QUESTION: GL: Why?
    • TC: Trying to figure out why I felt this way at lunch the other day. The hosted providers are mainly Google or Amazon.
      • To some degree, corporations are capricious - it depends on what we want to trust them with
    • LIW: Idea: One reason we want to be wary about hosting CI elsewhere - it's a window of attack that we don't control. If we host CI somewhere and that creates something we deploy to production, that's an avenue of attack.
    • GL: Apart from the fact that we can't do CI right now until our core code bases improve. The idea would be the CI part could be outside the systems. The CD would need to be handled internally in some way. The reasons to do CI externally is useful
      • Google were losing in Cloud, which is why they provide K8 solutions. Most of the solutions out there will likely be supported by Argo
      • Hopefully wouldn't give you vendor lock-in (because of standardization of k8s)
      • I feel we are one jenkins 0 day from someone wandering into prod
      • CI systems are complex and flexible, a huge attack surface.
    • LIW: An externally hosted CI system and if that builds correctly, we run the same job again in our systems - since most jobs fail, this requires less compute. i.e. generate the deployed artifacts inside our systems
    • Timo: Being independent comes up a lot in these conversations.
      • An important principle is that the software runs on free software, that it does not compromise prod,
      • A 3rd component: Where is it hosted - maybe bigger concerns if it's Google or Amazon
      • Maybe some host that's more connected to the free software world
    • GL: one of the advantages of K8 is there is lower effort in switching platforms
    • Timo: It sends a public message if are servers go to Amazon
    • GL: We don't have enough people in the room for this discussion
    • LIW: How about we use all the k8s hosting?
    • Timo: Diversify
  • QUESTION: KH: Tangentially, do you all have any concerns about the relative youth of this project?
    • GL: Dan asked them and they are in the process of forming a sub as part of a sub of Linux Foundation. It should become default for cloud native. There is good reassurance regarding the life expectancy of the project
    • TC: There are two components we aren't recommending ARGO CI and ARGO CD. But the workflow ARGO CORE and Events we would be tying together.
      • KH: So one person has done most of the commits?
    • GG: Do you know the median contributor number to any open source project? One.
    • LIW: Kosta is right - this is a concern - we're taking a calculated risk. All software sucks. Jenkins is a disaster and you don't want to look at the Linux code.
  • GL: Given what Timo has just said regarding independence etc. - how would you feel if we decided to go another way and just use Travis, GitLab, etc.
    • [note taker quietly sputters]
    • Antoine: I'm in favour of this but concerned it will match our needs, if it's not cost effective or doesn't match our needs but we should evaluate it
    • GL: So your answer is yes? If it fit requirements you'd be open to it.
    • Antoine: of course, because it is the best solution
    • Pablo: From the technical standpoint, we would want to avoid getting locked in, use them all and avoid getting locked in
    • Timo: I think Lar's suggestion was if we ad it federated. I too would be in favour of the suggestion of using gitlab/travis or other similar group. We should talk to Travis or Gitlab
    • GL: This seems like 4 jumps ahead...
    • QUESTION: Ricardo: any concerns around PII or concerns?
      • GG: Probably not an issue but should be added as a point in the list of sec concerns
      • Pablo: Already public
    • GL: any other feelings about never using a commercial solution?
    • GL: Counterpoint here, I think Free Sofware is Free Knowledge.
      • Why don't we use Zend?
      • … or Varnish commercial
      • ...where do we put the point that we stop?
  • Next steps
    • Lars: If anyone has anyone has any more input for CI, please feed them to me via CI
    • Work together and make that final list of requirements

Event Timeline

Background

Our current CI system has reached end of life on several fronts:

Zuul 2.5

  • Zuul is a server system that listens to changes from Gerrit and queues them to run in Jenkins
  • The version of Zuul we are running in production is beyond end-of-life and is no longer receiving updates from upstream
  • We have discovered at least one bug (albeit a minor inconvenience) that is fixed in the latest Zuul version that cannot be trivially backported to our version

Jenkins

  • Jenkins and Zuul communicate through Jenkins’ plugin system
  • The Jenkins Gearman plugin required for Jenkins and Zuul to communicate was maintained by Zuul’s upstream, which has end-of-lifed the version of Zuul that requires Jenkins and, hence, these plugins are no longer being updated
  • Gearman does not support Jenkins Pipelines -- nor will it ever. This is the mechanism by which we are able to run the Deployment Pipeline. This involves strange workarounds

New CI WG

chasemp mentioned this in Unknown Object (Task).Jan 17 2020, 10:24 PM

@thcipriani: Thank you for proposing and/or hosting this session. This open task only has the archived project tag Wikimedia-Technical-Conference-2019.
If there is nothing more to do in this very task, please change the task status to resolved via the Add Action...Change Status dropdown.
If there is more to do, then please either add appropriate non-archived project tags to this task (via the Add Action...Change Project Tags dropdown), or make sure that appropriate follow up tasks have been created and resolve this very task. Thank you for helping clean up!