Page MenuHomePhabricator

Work with RelEng to add PipelineBot to Vector
Open, Needs TriagePublic

Description

This task tracks working with the Release Engineering team to enable PipelineBot in Vector. The workflows needed are as follows.

Workflows

Code review workflow

  1. A patch is submitted.
  2. NPM and system dependencies are installed or already downloaded within the image. For example, npm ci.
  3. An NPM build script is executed. For example, npm run build.
  4. Test scripts are executed on / with the build products.
Solution

We currently run npm test. Add a pipeline config that will execute npm test in a container instead.

References

Deployment workflow

  1. A Docker image is booted and a patch is mounted.
  2. NPM and system dependencies are installed or already downloaded within the image. For example, npm install.
  3. An NPM build script is executed. For example, npm run build.
  4. The artifacts are copied out of the Docker image to a remote server or repo. For example, rsync build-box:/mnt/vector/patch101/dist 192.168.somewhere/vector/patch101/
Solution

????

Backport workflow

Largely the same as deployment, perhaps identically:

  1. Backport the source code fix as a Gerrit / GitLab patch on an old branch.
  2. The patch goes through the normal pipeline as per usual.
Solution

????

Acceptance criteria

  • Vector master is built automatically and rides the train like all other code.
  • Build products are available in the tarball _or_ a prioritized task to add them is made.
  • Build failures block code merges in Gerrit and deployments.
  • Any changes to the build script only requires changes to Vector itself and not other repos. It's important that frontend developers don't have to become release engineers to make changes.
  • The process for adding or removing dependencies is minimized. If it's more than npm install and a security review, it should be documented. It's important that frontend developers don't have to become politicians or process experts to make changes.
  • Backport workflows are as similar to standard deployments as possible. Ideally, just a cherry-pick and going through the same process.
  • A Docker configuration that contains all dependencies is made.
  • A PipelineLib configuration is made.
  • The above changes have little or no impact on frontend developer workflows.
  • Documentation for anything a non-RelEng developer would want to know about the above is included in the repo.

Other references

Event Timeline

Restricted Application added a subscriber: Aklapper. · View Herald TranscriptJul 9 2020, 2:42 PM

Moving the conversation from email. Thanks for bearing with my catching up, @Niedzielski.

Code review workflow

  1. A patch is submitted.
  2. NPM and system dependencies are installed or already downloaded within the image. For example, npm install.
  3. An NPM build script is executed. For example, npm run build.
  4. Test scripts are executed on / with the build products.

This workflow seems really straightforward and PipelineLib should be able to drive it. You can look to our "How to define a node test pipeline" guide for an example on what that might look like, and I or someone else in RelEng can assist in writing/testing a specific .pipeline/config.yaml and getting the Zuul pieces hooked up.

Deployment workflow

  1. A Docker image is booted and a patch is mounted.
  2. NPM and system dependencies are installed or already downloaded within the image. For example, npm install.
  3. An NPM build script is executed. For example, npm run build.
  4. The artifacts are copied out of the Docker image to a remote server or repo. For example, rsync build-box:/mnt/vector/patch101/dist 192.168.somewhere/vector/patch101/

I'm confused about this part.

  • Is this what you want to happen during gate-and-submit (after someone +2s but before the merge) or post merge?
  • Which branches do you need this to happen for? (master, wmf/ releases, REL_ releases?)
  • What is the purpose of copying artifacts to a remote server? What is consuming those artifacts?
  • Do the artifacts need to be integrated at some point into the directory tree of the deployed MediaWiki and pushed to frontend mw servers? (scap deployed)

Backport workflow

Largely the same as deployment, perhaps identically:

  1. Backport the source code fix as a Gerrit / GitLab patch on an old branch.
  2. The patch goes through the normal pipeline as per usual.

It makes sense to me that this would be more or less identical to the previous workflow which seems to already work from the latest merged patch, but I guess I have the same identical questions too.

Thank you, @dduvall!

Is this what you want to happen during gate-and-submit (after someone +2s but before the merge) or post merge?

I'm worried there's a subtlety to your understanding of Gerrit post-merge and my coarse understanding of Git post-merge that I am missing so I will try to be explicit. After all Git operations are performed, that is the best time to build. That is, the final source inputs should be used to generate the final build outputs. If for some reason the build fails, however, that should block a patch from being merged.

Which branches do you need this to happen for? (master, wmf/ releases, REL_ releases?)

feat/search now, master soon, and presumably all release / deployment branches.

What is the purpose of copying artifacts to a remote server? What is consuming those artifacts?

As we know, the process of building generates build products and they're less useful if they're not actually published somewhere. For Vector, the build products generated by continuous integration infrastructure (via Gerrit, PipelineLib, and whatever else is needed) must be made actually available in production. For example, these are some of the build artifacts generated by (and versioned in) the Popups extension:

These correspond to files actually versioned in Git:

So, we have to get those artifacts out of CI and into production somehow. In Popups, we version them in the repo because there is no CI-powered build step. In Vector, as an initial step towards progress, we could do that too with the difference being the building part (npm run build) would be invoked by CI instead of the developer on their local machine.

Do the artifacts need to be integrated at some point into the directory tree of the deployed MediaWiki and pushed to frontend mw servers? (scap deployed)

It's my understanding that the answer depends on how build products are stored. Maybe one of these options:

  1. The build products are included in the release some time before the train runs:
    1. Add a commit at deploy time to a release branch of the repo being built.
    2. Add a commit at deploy time to the master branch of the repo being built.
    3. Add a new repository for deploys only and add the commit to it. This could be a giant repo for all deploys or one deploy repo per source repo.
  2. The build products are _not_ included in the release:
    1. "rsync" the build products to the production server only.
  3. Something else?

1.a or 1.b sounds best to me as a starting point since it doesn't require new repo dependencies and consumers can choose whether they wish to build from source or not. MobileFrontend and Popups currently use something like 1.b but for _every commit_. This has unwanted side-effects and is not ideal, but one thing that works well is that the existing processes for deploying code work without changing.

My only strong opinion is that I would rather take little steps now rather than wait for big steps later.

I'm worried there's a subtlety to your understanding of Gerrit post-merge and my coarse understanding of Git post-merge that I am missing so I will try to be explicit. After all Git operations are performed, that is the best time to build. That is, the final source inputs should be used to generate the final build outputs. If for some reason the build fails, however, that should block a patch from being merged.

Got it! That sounds like our gate-and-submit Zuul pipeline—operate on the target branch with currently pending patches applied in the order they will be merged but block patch from merge on failure.

For Vector, the build products generated by continuous integration infrastructure (via Gerrit, PipelineLib, and whatever else is needed) must be made actually available in production.

Based on your description here, would it make sense generalize this to "publish built artifacts for production deployment" and RelEng can figure out where that would be since we're performing/organizing deploys? I think the specificity of "copy to remote server" was leading me to think there might be some other consumers of the artifacts.

Which branches do you need this to happen for? (master, wmf/ releases, REL_ releases?)

feat/search now, master soon, and presumably all release / deployment branches.

If we're scoping this to builds for production deployment, I think we should limit the target branches to just wmf/* for now.

Do the artifacts need to be integrated at some point into the directory tree of the deployed MediaWiki and pushed to frontend mw servers? (scap deployed)

It's my understanding that the answer depends on how build products are stored. Maybe one of these options:

  1. The build products are included in the release some time before the train runs:
    1. Add a commit at deploy time to a release branch of the repo being built.
    2. Add a commit at deploy time to the master branch of the repo being built.
    3. Add a new repository for deploys only and add the commit to it. This could be a giant repo for all deploys or one deploy repo per source repo.

This is still tripping me up a bit. When I read "included in the release" I thought perhaps this applied to tarball releases as well. If we're limiting our scope to production deployment, I think it's more accurate to say that we want build products prepared (or more specifically prepared automatically via scap prep) on the deployment server for subsequent syncs to MW production servers. But then again, I'm still assuming we're talking only about production bound build artifacts here!

  1. The build products are _not_ included in the release:
    1. "rsync" the build products to the production server only.
  2. Something else?

1.a or 1.b sounds best to me as a starting point since it doesn't require new repo dependencies and consumers can choose whether they wish to build from source or not. MobileFrontend and Popups currently use something like 1.b but for _every commit_. This has unwanted side-effects and is not ideal, but one thing that works well is that the existing processes for deploying code work without changing.

Git does seem like the shortest path to implementation since that's what scap prep speaks and that's how people are used to pulling in patches to /srv/mediawiki-staging on the deployment server. However, the workflow I'm currently imagining—where artifacts are built from the tip of a release branch before the patch is actually merged which then creates another commit that contains the artifacts which is then followed by the patch itself following the merge—seems like it would introduce problems.

I'm wondering if instead, we can publish build artifacts to an existing repository that's more suited to binaries and such and modify scap prep to pull from that, applying build artifacts to the /srv/mediawiki-staging tree prior to train deployments and perhaps for backport deploys as well—which incidentally used to be called SWAT but are now in some circles being referred to as BACON deploys meaning BAckport and CONfig change deploys. :)

One option for a binary repository might be our existing Archiva install but we'd have to check in with other folks before committing to something like that.

@thcipriani, @20after4, @LarsWirzenius do you have thoughts on the approach I'm describing that would use a separate artifact store and require changes to scap prep? I realize we don't currently use scap prep for BACON deploys but we could start.

My only strong opinion is that I would rather take little steps now rather than wait for big steps later.

I 99% agree, but I also want to ensure we're not creating obstacles for ourselves two-steps down the road and that we're mindful of folks taking steps in obliquely similar directions—containerization and those working on deploy tooling. I'm confident we can find a good middle ground!

Niedzielski updated the task description. (Show Details)Aug 7 2020, 7:03 PM
Niedzielski updated the task description. (Show Details)Aug 7 2020, 7:15 PM
Niedzielski updated the task description. (Show Details)
Niedzielski updated the task description. (Show Details)

Based on your description here, would it make sense generalize this to "publish built artifacts for production deployment" and RelEng can figure out where that would be since we're performing/organizing deploys?

I think that would be fine. If it's not we'll change it!

If we're scoping this to builds for production deployment, I think we should limit the target branches to just wmf/* for now.

I'm not sure what the implications of this are as most of my work is on master but I trust your judgement and that most problems can be solved iteratively.

This is still tripping me up a bit. When I read "included in the release" I thought perhaps this applied to tarball releases as well. If we're limiting our scope to production deployment, I think it's more accurate to say that we want build products prepared (or more specifically prepared automatically via scap prep) on the deployment server for subsequent syncs to MW production servers. But then again, I'm still assuming we're talking only about production bound build artifacts here!

From my perspective, it would be very useful to include the build products in the tarballs so that consumers can decide whether they want to use the officially built assets or build it themselves since both source and outputs would be included. For instance, some redistributions may require building everything from source.

There are two versions of Vector:

  1. Latest mode: this version is actively developed and will require build products to function.
  2. Legacy mode: this version is feature frozen and all changes are avoided. This is the current default.

If we can't ship the build products in the tarballs, we will not be able to set "Latest" Vector as the default, only the unbuilt Legacy version would be supported. That may be ok for the near term but it would be very useful to resolve this soon.

One option for a binary repository might be our existing Archiva install

This is bit outside my expertise but I can tell you that build products we create are not special. They're just files.


I've tried to add some definition to the task description. If you see any mistakes or improvements, please edit! In a nutshell, I believe what we know so far is:

  • We need a Docker configuration that contains all dependencies.
  • We need a PipelineLib configuration.
  • We need to document anything a non-RelEng developer would want to know about the above.
Niedzielski updated the task description. (Show Details)Aug 7 2020, 7:33 PM
Niedzielski updated the task description. (Show Details)

@dduvall In looking at the jobs currently configured to run in Jenkins for Vector, I see for example that Vector's tests are run through the skin-quibble template which executes the mwgate-node10-docker job, which eventually executes npm run test.
To get Vector using PipelineLib, I'm adding a blubber.yaml with a test variant. But would you suggest removing the skin-quibble template and replacing it with another generic template, so as to not run npm run test twice? Since skin-quibble is running other jobs as well, I'm wondering if you have a suggestion on how to modify the config to essentially not have Jenkins run npm run test twice? I'm not too familiar with the different job templates available/their usages, hoping for some ideas if you had any?

dduvall added a subscriber: hashar.Oct 9 2020, 9:09 PM

To get Vector using PipelineLib, I'm adding a blubber.yaml with a test variant. But would you suggest removing the skin-quibble template and replacing it with another generic template, so as to not run npm run test twice? Since skin-quibble is running other jobs as well, I'm wondering if you have a suggestion on how to modify the config to essentially not have Jenkins run npm run test twice? I'm not too familiar with the different job templates available/their usages, hoping for some ideas if you had any?

Not wanting to duplicate the npm test execution in multiple places makes complete sense, but yes as you mentioned, removing the skin-quibble template will remove a number of jobs for all kinds of PHP runtime variants and such. The extension-gate template is also applied to Vector and that seems to include the mwgate-node10-docker job as well. That zuul/layout.yaml file is a bit of a nightmare.

One option I can think of would be to define duplicates of skin-quibble and extension-gate that include the array of PHP jobs but omit the node job. Something like skin-quibble-without-node or whatever. Then again, I'm a little worried someone might miss that when they go to remove/add PHP variants from the standard set of templates. @hashar, any opinion/thoughts on this?

Change 633803 had a related patch set uploaded (by Nikki Nikkhoui; owner: Nikki Nikkhoui):
[mediawiki/skins/Vector@master] Blubber config

https://gerrit.wikimedia.org/r/633803

nnikkhoui added a comment.EditedOct 13 2020, 6:20 PM

@dduvall Thanks for the response Dan! From my limited knowledge of how the jobs are configured that sounds like it would work for this use case. I was looking at the Zuul docs and the parent attribute looked really appealing to use, but looks like it won't combined properties but only inherit those that are undefined, wondering if there's something similar that could be used?

In reply to the few technical questions asked above:

The Zuul project template skin-quibble is applied on every MediaWiki skins and configures jobs to run on various branches or after a change is merged. One of its duty is to ensure we run npm install && npm run test for every skins and that logic is implemented in the job mwgate-node10-docker.

npm test is an entry point for CI ( https://www.mediawiki.org/wiki/Continuous_integration/Entry_points ). For MediaWiki core, extensions and skins, its sole intent is to run linters against a single repository, the execution environment only has NodeJS v10 and nothing else (no PHP or Chromium for example).

As such, npm run test should not be reused for any purpose and should only be run from the mwgate-node10-docker job. That being said, it can potentially be reused in another script to short-circuit a long running command, that is quite handy when its being run locally, since one would want ESLint to pass in the first place. So one can end up with something such as:

package.json
{
  "scripts": {
    "test": "grunt lint",
    "build": "some long test suite",
    "validate": "npm run test && npm run build"
  }
}

Locally one can then run npm run validate and have both being executed with the short-circuit.

On CI we would have mwgate-node10-docker running the linter, and a second job running npm run build. Which I guess is what the PipelineLib should be invoking.

@dduvall Thanks for the response Dan! From my limited knowledge of how the jobs are configured that sounds like it would work for this use case. I was looking at the Zuul docs and the parent attribute looked really appealing to use, but looks like it won't combined properties but only inherit those that are undefined, wondering if there's something similar that could be used?

The parent configuration setting is for a later version of Zuul (v3) than the one we are using (2.5). In Zuul v3 that is used to wrap jobs with some parent one. It is not relevant to our Jenkins / Zuul v2.5 system.

I'm worried there's a subtlety to your understanding of Gerrit post-merge and my coarse understanding of Git post-merge that I am missing so I will try to be explicit. After all Git operations are performed, that is the best time to build. That is, the final source inputs should be used to generate the final build outputs. If for some reason the build fails, however, that should block a patch from being merged.

Got it! That sounds like our gate-and-submit Zuul pipeline—operate on the target branch with currently pending patches applied in the order they will be merged but block patch from merge on failure.

gate-and-submit might well retrigger the build if a change ahead has failed, and if another job failed, the build would still have happened. Even if all changes / jobs passed, there is no guarantee of the execution order. If one +2 A and B, all jobs pass and they get merged, the "npm run build" job for A might have been executed AFTER the one for B. So if you get the published assets identified by timestamp, they are in the wrong order (when looking at the git topology order).

As such, publishing of artifacts must be done either in the postmerge pipeline (after a change has been merged, and is thus known to be valid) or in the publish pipeline (after a tag has been published). The later prevents to have to build everything everytime a change get merged, if we were to automatically create a MediaWiki tarball, we would probably only need them for tags.

To prevent a faulty change from being merged, the build must happen in gate-and-submit. Probably with some end to end tests to verify it is working fine. But the build result in gate-and-submit must not be published.

Change 633803 abandoned by Nikki Nikkhoui:
[mediawiki/skins/Vector@master] Blubber config

Reason:
Branched off incorrect branch

https://gerrit.wikimedia.org/r/633803

npm test is an entry point for CI ( https://www.mediawiki.org/wiki/Continuous_integration/Entry_points ). For MediaWiki core, extensions and skins, its sole intent is to run linters against a single repository, the execution environment only has NodeJS v10 and nothing else (no PHP or Chromium for example).

As such, npm run test should not be reused for any purpose and should only be run from the mwgate-node10-docker job.

@hashar I appreciate all the background info, really helpful! It's good to know that npm test should only be used for linting, if thats the case then the current package.json should be changed as well. So you are suggesting to keep skin-quibble template as is, but then adding another template or job that would run other scripts (such as npm run build). I have 2 additional questions:

  1. Since blubber docs mention a test variant is required, what would that test variant just have no entrypoint and be more like a build variant in that case?
  2. I would need to add another job to Vector in /zuul/layout.yaml in that case, would something like the service-pipeline-test template be appropriate to use?

Change 634288 had a related patch set uploaded (by Nikki Nikkhoui; owner: Nikki Nikkhoui):
[mediawiki/skins/Vector@feat/search] Blubber config

https://gerrit.wikimedia.org/r/634288

@hashar I appreciate all the background info, really helpful! It's good to know that npm test should only be used for linting, if thats the case then the current package.json should be changed as well. So you are suggesting to keep skin-quibble template as is, but then adding another template or job that would run other scripts (such as npm run build).

Exactly. Because skin-quibble adds a lot more other jobs for all the Zuul pipelines we have. One can see it as enforcing a specific CI config on all MediaWiki extensions and skins. Adding npm run build is indeed a different job that would come on top of the existing ones.

I have 2 additional questions:

  1. Since blubber docs mention a test variant is required, what would that test variant just have no entrypoint and be more like a build variant in that case?

In the Pipeline / Blubber, test refers to an image and it could use npm run build && npm run integration for example. Probably something such as:

version: v4
base: docker-registry.wikimedia.org/nodejs10-slim:0.0.4
variants:
    build:
        base: docker-registry.wikimedia.org/nodejs10-devel:0.0.3
        copies: [local]
        builder:
            command: [npm, run, build]
    test:
        includes: [build]
        entrypoint: [npm, run, integration]
    production:
        copies: [build]
        node:
                env: production

If I understand it properly, the Pipeline would create a build image using npm run build and reuse that as an ancestor for the test image in which it will run npm run integration. The later supposedly should be a test suite to verify the build works properly.

  1. I would need to add another job to Vector in /zuul/layout.yaml in that case, would something like the service-pipeline-test template be appropriate to use?

If the above stands right, yes that is all you would need.

And if you want the production image to be published to docker-registry.wikimedia.org, you would instead use the Zuul template service-pipeline-test-and-publish. The outcome will be an image containing whatever got cloned (due to copies: [local]) and the outcome of npm run build without any development dependencies installed.


As a side note, the Pipeline is intended to generate Docker images which would run on the Kubernetes cluster. We usually commit the artifacts to a deployment repository (mediawiki/vendor.git, wikimedia/portals/deploy), eventually using git-fat to offload big files to Archiva (examples: operations/software/gerrit, analytics/refinery).

I am not sure using the Docker registry for storing and deploying build artifacts is the intended use. At a minimum that should be checked with SRE, and on release engineering side, that would mean adding some Docker support to Scap in order to be able to fetch the image containing the artifacts. But that discussion might be better suited in another task, a design doc or some meetings :]

@hashar I appreciate all the background info, really helpful!

Yes! Thanks for explaining all that, @hashar.

Exactly. Because skin-quibble adds a lot more other jobs for all the Zuul pipelines we have. One can see it as enforcing a specific CI config on all MediaWiki extensions and skins. Adding npm run build is indeed a different job that would come on top of the existing ones.

I guess we'll have to address this somehow in the future if we're going to migrate away from static JJB-defined CI jobs and towards PipelineLib (or other) repo-defined jobs. Eventually we won't want this duplication.

I have 2 additional questions:

  1. Since blubber docs mention a test variant is required, what would that test variant just have no entrypoint and be more like a build variant in that case?
  1. I would need to add another job to Vector in /zuul/layout.yaml in that case, would something like the service-pipeline-test template be appropriate to use?

There are two sort of mutual exclusive ways to go with this. You can use the service-pipeline-test job that relies on just a .pipeline/blubber.yaml file and hardcodes a certain workflow (build single test variant, run single test variant), or you can define a new job in jjb/project-pipelines.yaml for Vector that will delegate control over the workflow to a .pipeline/config.yaml file in the repo.

For historical context, the latter PipelineLib-based approach was developed after the former job was in use but was found to be too inflexible in some cases. This explains why the Blubber/Pipeline docs you linked to is sort of incomplete. We should mention the option of using PipelineLib on there (will do).

As a side note, the Pipeline is intended to generate Docker images which would run on the Kubernetes cluster. We usually commit the artifacts to a deployment repository (mediawiki/vendor.git, wikimedia/portals/deploy), eventually using git-fat to offload big files to Archiva (examples: operations/software/gerrit, analytics/refinery).

It's important to note that this is what we're moving away from (at least for services), Kubernetes replacing scap3 deployed services which rely on deploy repos.

I am not sure using the Docker registry for storing and deploying build artifacts is the intended use. At a minimum that should be checked with SRE, and on release engineering side, that would mean adding some Docker support to Scap in order to be able to fetch the image containing the artifacts. But that discussion might be better suited in another task, a design doc or some meetings :]

This is indeed a huge discussion. See T265183: In a k8s world: where does MediaWiki code live? in which I've proposed that very thing: Have scap unpack docker images as a transitionary measure while we migrate to MW-on-K8s. Feedback welcome! :)

It's important to note that this is what we're moving away from (at least for services), Kubernetes replacing scap3 deployed services which rely on deploy repos.

Thanks @hashar and @dduvall, I will in that case definitely be watching T265183, I didn't realize that option was actively being discussed in another capacity. I'll be really interested to see where that lands! In the meantime, will make the necessary changes to blubber.yaml based off both of your suggestions, i don't think Vector will need anything more than a test variant if we make sure the package.json is setup appropriately :)

Demian updated the task description. (Show Details)Oct 20 2020, 2:29 AM
Demian added a subscriber: Demian.EditedOct 20 2020, 5:06 PM

@hashar IMO ideally npm test would run the linter, unit tests and integration tests (latter in core only). npm test not running actual (unit) tests is counterintuitive.
What's the reason to not run npm run lint in mwgate-node10-docker directly? Assuming it's defined in all skins before such transition.

@hashar IMO ideally npm test would run the linter, unit tests and integration tests (latter in core only). npm test not running actual (unit) tests is counterintuitive.
What's the reason to not run npm run lint in mwgate-node10-docker directly? Assuming it's defined in all skins before such transition.

That is a technical debt! Originally we had specific linters hardcoded in the central CI configuration such as a specific version of JSHint being enforced for all repositories. Whenever we upgraded jshint, we had to pass through all repositories and branches to fix up issues reported by the newer version.

We then went to have:

And the CI jobs have been very dumb they merely just invoke npm ci && npm run test and composer install && composer test (the later having variants for different PHP versions).

The setup is pretty much the same on every extensions and skins which ease sharing the maintenance by the several people involved it. We for example take care of updating the libraries via LibUp , introduce new linters or fix up issues that might arise. Then developers are still free to move at faster cadence and adopt a newer linter by just bumping the version.

See also https://www.mediawiki.org/wiki/Continuous_integration/Entry_points

Essentially npm run test is what we have settled on a few years ago. There is nothing preventing to change it to a different name such as lint but more than a handful of repositories actually run tests in there.

Anyway, for the scope of this task, just adopt another entry point such as build, package and have the Pipeline test variant to invoke it instead of npm run test.

@hashar Thanks for the details!