Page MenuHomePhabricator

Remove the "check" pipeline and Zuul's user-filter
Closed, ResolvedPublic

Description

The differentiation in check/test pipeline proves to be another barrier new developers are facing. It is/was necessary because we run/ran tests on permanent instances and didn't want anyone to be able to execute just any code there. This isn't a concern for one-off instances that are destroyed after each test was finished. As I understand it getting rid of it this differentiation has been a goal for quite a while and there's no doubt we want to get there.

This goal was described and tracked in T47499, which got closed because the way chosen to reach this goal was Nodepool. The job of tracking the effort to get everything into one-off instances (Nodepool) was then taken over by Continuous-Integration-Scaling and it's tasks (e.g. T119138, T134381, T119140) . These were declined and the project was archived, because nodepool is now considered legacy and the plan is to have docker for everything. There is T190097: Migrate all CI jobs from Nodepool, deprecate its use, but I fail to find a project or task "Migrate all CI jobs from permanent instances".

I wonder, but cannot easily answer the questions which jobs (or whether any at all) we still run on permanent instances, if the check/test differentiation is still needed, and if it is, if we're seeing progress in a 'permanent instances -> docker' transition, and how long the whitelist will continue to be necessary.

See also:

Event Timeline

EddieGP updated the task description. (Show Details)
Krinkle renamed this task from Get rid of the check pipeline to Remove the "check" pipeline and Zuul's user-filter.Aug 30 2018, 11:33 PM

The check pipeline is used for php -l, composer validate and the commit message validator. I think we can drop support for them.

  • a Code-Review +1 by a whitelisted person is enough to run the tests
  • one less Zuul pipeline to manage, the check one has a list of email that duplicate the one from the test pipeline
  • let us drop jobs that are still running on permanent slaves

Overall that will reduce the complexity of the CI configuration not much being lost beside the few trivial checks currently done.

From the Grafana board Ihave just created: https://grafana.wikimedia.org/dashboard/db/zuul-pipeline The check pipeline had 55 builds over the last 7 days so that is hardly a loss.

Some notes from IRC (lightly edited):

[12:59:41] <James_F> How far away are we from running the CI jobs without a whitelist?
[13:00:20] <James_F> I noted the nodepool task got resolved, yay. ;-) I vaguely recall moving to docker being a necessary but not sufficient condition/
[13:01:21] <greg-g> nodepool was actually the first attempt at removing the whitelist....
[13:01:28] <thcipriani> James_F: hash<tab> mentioned this yesterday. With nodepool gone we still have to phase out a couple physical hosts that are still lingering, but I don't know the steps past that one.
[13:01:50] <James_F> greg-g: Indeed.
[13:02:02] -*- James_F nods.
[13:02:31] <James_F> Is there a task for "run CI for all"?
[13:03:01] <legoktm> I think someone needs to look at the threat models, and what exactly does "arbitrary code access in a docker instance runing on a cloud vps instance" let you escalate to
[13:03:34] <James_F> Ah, yes, T192217.
[13:03:34] <stashbot> T192217: Remove the "check" pipeline and Zuul's user-filter - https://phabricator.wikimedia.org/T192217
[13:03:37] <legoktm> and then realize that it doesn't matter since we already let basically everyone do so 
[13:04:46] <legoktm> maybe a bit of preventative stuff like closing outbound mail ports and other network firewalling

[13:03:01] <legoktm> I think someone needs to look at the threat models, and what exactly does "arbitrary code access in a docker instance runing on a cloud vps instance" let you escalate to

FWIW, I'm not particularly worried about this one - I think we can drop the whitelist easily once everything runs in docker. The threat is with those jobs that don't run in docker (yet).

Re-asking the question from the task description: Which/How many jobs (any at all?) do we still run on permanent instances?

We still have some jobs running on permanent slaves, they will be moved to Docker containers and I have created a subproject for that migration: Continuous-Integration-Infrastructure (Slipway) I wrote a test to report them on https://integration.wikimedia.org/ci/job/integration-config-qa/lastCompletedBuild/testReport/test_integration/IntegrationTests/test_jenkins_jobs_assigned_nodes/

The jobs in check pipeline are mostly for php lint which is superseded by parallel-lint and composer-validate which is already run by the composer test job. The usage per pipeline is reported on https://grafana.wikimedia.org/dashboard/db/zuul-pipeline , over 30 days:

PipelineJobs
check280
test12447
gate-and-submit9700

The whitelist must be kept for now. I am not comfortable in running code in Docker on WMCS, we would have to harness them a little bit more. Specially reduce possible network access.

Change 493188 had a related patch set uploaded (by Hashar; owner: Hashar):
[integration/config@master] zuul: remove "check" pipeline

https://gerrit.wikimedia.org/r/493188

Change 493188 merged by jenkins-bot:
[integration/config@master] zuul: remove "check" pipeline

https://gerrit.wikimedia.org/r/493188

hashar claimed this task.

The check pipeline is gone!

Nitpicky comment: the description "Jobs restricted to trusted users. Will vote +2." for the test pipeline should be changed.

Nitpicky comment: the description "Jobs restricted to trusted users. Will vote +2." for the test pipeline should be changed.

I have idea for new description: "Jobs triggered per default after upload of new patch set. Will vote +2."

The check pipeline is gone!

Thanks! It was my mistake to name this task "Get rid of the check pipeline", which was then renamed to the current title. The goal actually is to remove the user filter and let everyone run all tests. I expected both "remove whitelist" and "remove check pipeline" to happen at the same time (when test would've been moved over from permanent slaves to docker), but it turns out these are two distinct things. We should either re-open this task or create another one, as the goal to get completely rid of the whitelist is still relevant (probably "someday", though).

hashar added a parent task: Restricted Task.Mar 19 2019, 7:13 PM

@EddieGP I wanted to remove the check pipeline, notably because most of those jobs were legacy anyway and have since been replaced by better alternative. An example is untrusted developers being able to run the linters/style checker via composer test and/or npm test.

As for running tests for anyone, yes that is planned, there are multiple discussions / project going on inside Release-Engineering-Team . But in short the idea is to overhaul the CI infrastructure, switch to Docker containers running in Kubernetes and have the run isolated from the rest of the infra/CI jobs. Then eventually, maybe we will open CI run to any patch regardless of the author.

So that is definitely in our minds, even though it might not be captured in a proper task :]

It is expected that in the next iteration of the CI infrastructure ("Seakeaper"), that at least by default any Gerrit account will be permitted to have its code execute and tests run against it. More info at T243436 and related tasks and documents.