Page MenuHomePhabricator

[QB] [WDQS-GUI] Move build scripts from CI to the repository
Closed, ResolvedPublic5 Estimated Story Points

Description

Both the Wikidata Query Builder and the Wikidata Query Service UI require an extra CI build job before they can be deployed (initially created in T160943). However, that job currently run several shell commands, mostly related to git, directly as the Jenkins User, which is prone to cause problems. See for example T328441.

Ideally, these shell commands would move to a file in the respective repositories and then would be invoked from a docker container.

For more context, see the comments on this patch: Allow our own repo as safe to git (Ife520c20), and see also the IRC logs of channel #wikimedia-releng for 2023-02-01 starting at 09:43:07: https://wm-bot.wmflabs.org/libera_logs/%23wikimedia-releng/20230201.txt.

Details

Show related patches Customize query in gerrit

Event Timeline

There are a very large number of changes, so older changes are hidden. Show Older Changes
Restricted Application added a subscriber: Aklapper. · View Herald Transcript

Task Review Notes:

  • This ticket is a result of a workaround applied while addressing an incident where we could not make new deployment builds for QSGUI and QB.
  • The followup details a more robust solution to the problem at hand.

Task Prio Notes:

  • Does not affect end users / production
  • Does not affect development efforts
  • Does not affect onboarding efforts
  • Affects additional stakeholders
ItamarWMDE renamed this task from Move build scripts from CI to the repository to [SW] Move build scripts from CI to the repository.Feb 15 2023, 8:57 AM
ItamarWMDE renamed this task from [SW] Move build scripts from CI to the repository to Move build scripts from CI to the repository.Feb 15 2023, 10:20 AM
ItamarWMDE renamed this task from Move build scripts from CI to the repository to [QB] [WDQS-GUI] Move build scripts from CI to the repository.Mar 14 2023, 1:35 PM

It has been a long while since I last looked into this, and I'm not fully sure what @hashar had mind, but my understanding would be that the rough process would look like:

  1. look at the shell commands and such in the jobs edited in Allow our own repo as safe to git (Ife520c20), and extract them into scripts (node scripts? bash?) in the repository
  2. then edit these integration config jobs to run those scripts with a container, for example that docker-run-with-log-cache-src
  3. make sure that the core issue of different users is fixed
    • (or is that not actually something that we can fix like this, and the major benefit is that we can deal with this issue in our own code base?)

Change 949833 had a related patch set uploaded (by Noa wmde; author: Noa wmde):

[wikidata/query/gui@master] Move build scripts from jenkins job into the repository

https://gerrit.wikimedia.org/r/949833

Change 951066 had a related patch set uploaded (by Noa wmde; author: Noa wmde):

[integration/config@master] wikidata-query-gui-build: run shell scripts in a container

https://gerrit.wikimedia.org/r/951066

Change 952317 had a related patch set uploaded (by Noa wmde; author: Noa wmde):

[wikidata/query/gui@master] Adding an empty commit to test build job

https://gerrit.wikimedia.org/r/952317

Change 952317 merged by jenkins-bot:

[wikidata/query/gui@master] Adding an empty commit to test build job

https://gerrit.wikimedia.org/r/952317

Change 951066 merged by jenkins-bot:

[integration/config@master] wikidata-query-gui-build: run shell scripts in a container

https://gerrit.wikimedia.org/r/951066

Change 952393 had a related patch set uploaded (by Hashar; author: Hashar):

[integration/config@master] Dockerfiles: add openssh-client to node test images

https://gerrit.wikimedia.org/r/952393

Change 952393 merged by jenkins-bot:

[integration/config@master] Dockerfiles: add openssh-client to node test images

https://gerrit.wikimedia.org/r/952393

Change 952395 had a related patch set uploaded (by Hashar; author: Hashar):

[integration/config@master] jjb: use image with ssh client for wikidata-query-gui-build

https://gerrit.wikimedia.org/r/952395

Change 952395 merged by jenkins-bot:

[integration/config@master] jjb: use image with ssh client for wikidata-query-gui-build

https://gerrit.wikimedia.org/r/952395

Change 952451 had a related patch set uploaded (by Hashar; author: Hashar):

[integration/config@master] jjb: wikidata-query-gui-build push to git from Jenkins

https://gerrit.wikimedia.org/r/952451

Change 952463 had a related patch set uploaded (by WDQSGuiBuilder; author: WDQSGuiBuilder):

[wikidata/query/gui-deploy@production] Merging from 472449b750ff7e26994a172d693a4f0fd6724286

https://gerrit.wikimedia.org/r/952463

Change 952463 abandoned by Hashar:

[wikidata/query/gui-deploy@production] Merging from 472449b750ff7e26994a172d693a4f0fd6724286

Reason:

https://gerrit.wikimedia.org/r/952463

Change 952465 had a related patch set uploaded (by WDQSGuiBuilder; author: WDQSGuiBuilder):

[wikidata/query/gui-deploy@production] Merging from 43dcb29b6d5f57c5297188a743a6e7406174f344

https://gerrit.wikimedia.org/r/952465

Change 952451 merged by jenkins-bot:

[integration/config@master] jjb: wikidata-query-gui-build push to git from Jenkins

https://gerrit.wikimedia.org/r/952451

There were a few gotchas here and there due to how we run the CI jobs and some trouble passing the ssh agent inside the container (I went to push from the Jenkins job as a workaround).

@noarave thank you so much for doing this work! I am a huge fan of moving most of the logic out of the Jenkins job toward scripts in the source repository \o/

Change 949833 merged by jenkins-bot:

[wikidata/query/gui@master] Move build scripts from Jenkins job into the repository

https://gerrit.wikimedia.org/r/949833

@hashar thank you for all the help!
Do the config patches regarding wikidata-query-gui-build require deployment in a backport window? if so, should I schedule one or does Release Engineering schedule deployments for integration/config?

The change made to integration/config are immediately deployed after they get merged. It is done manually by running the ./fab script at the root of the repository and is outside scap. That is done outside of the MediaWiki backport & config windows.

Essentially the changes are already deployed (the wikidata-query-gui-build Jenkins job has been updated).

I see, for some reason I thought a manual deployment is necessary. I'm starting to work on the wikidata-query-builder-build part of this task so a similar patch should reach your way soon.

Change 953558 had a related patch set uploaded (by Noa wmde; author: Noa wmde):

[wikidata/query-builder@master] Query Builder: move scripts from jenkins into repository

https://gerrit.wikimedia.org/r/953558

Change 953559 had a related patch set uploaded (by Noa wmde; author: Noa wmde):

[integration/config@master] wikidata-query-builder-build: run shell scripts in a container

https://gerrit.wikimedia.org/r/953559

@hashar we would appreciate any feedback on the open patches. thank you!

ItamarWMDE changed the task status from Open to Stalled.Sep 13 2023, 8:02 AM

@hashar sorry to ping you, but I think we need you for the review of this PR.

I will mark this ticket as stalled until we have a response.

Change 960620 had a related patch set uploaded (by WDQSGuiBuilder; author: WDQSGuiBuilder):

[wikidata/query/gui-deploy@production] Merging from 43dcb29b6d5f57c5297188a743a6e7406174f344

https://gerrit.wikimedia.org/r/960620

Change 960620 abandoned by Lucas Werkmeister (WMDE):

[wikidata/query/gui-deploy@production] Merging from 43dcb29b6d5f57c5297188a743a6e7406174f344

Reason:

Duplicate of Icdd62c98ac – the query-gui-build job is currently fetching PS14 of change 949833, rather than master.

https://gerrit.wikimedia.org/r/960620

Duplicate of Icdd62c98ac – the query-gui-build job is currently fetching PS14 of change 949833, rather than master.

Correction: that’s just because I used “rebuild last”, and the last build had a custom ZUUL_REF. Rebuilding with ZUUL_REF set to master fetches the right commit, but then failed in #89:

[wikidata-query-gui-build] $ /bin/bash -xe /tmp/jenkins11582515933093200314.sh
+ git -C ./src/build push origin HEAD:refs/for/production%ready
Host key verification failed.
fatal: Could not read from remote repository.

The Jenkins job clones wikidata/query/gui and shells out to scripts/set-up-git.sh which:

git clone "https://gerrit.wikimedia.org/r/wikidata/query/gui-deploy" ./build
git -C ./build \
    remote set-url --push origin \
    ssh://wdqsguibuilder@gerrit.wikimedia.org:29418/wikidata/query/gui-deploy

The git push happens from the Jenkins host rather than from the container, which was done to take advantage of the ssh agent being run in the job and which is not reachable from within the container.

I don't know why it would fail the host key verification sometime and sometime would not. I am guessing the Gerrit ssh fingerprint is not in the host known_hosts, then I don't see why the command would work on some Jenkins agents but fails on others :-\

On an agent that executed a successful build:

hashar@integration-agent-docker-1055:~$ sudo su - jenkins-deploy
jenkins-deploy@integration-agent-docker-1055:~$ ssh -p 29418 gerrit.wikimedia.org
wdqsguibuilder@gerrit.wikimedia.org: Permission denied (publickey).

On an agent that failed the build due to the host key verification:

hashar@integration-agent-docker-1053:~$ sudo su - jenkins-deploy
jenkins-deploy@integration-agent-docker-1053:~$ ssh -p 29418 gerrit.wikimedia.org
The authenticity of host '[gerrit.wikimedia.org]:29418 ([208.80.154.151]:29418)' can't be established.
RSA key fingerprint is SHA256:j7HQoQ6fIuEgDHjONjI2CZ+2Iwxqgo2Ur5LbPqBgxOU.
Are you sure you want to continue connecting (yes/no/[fingerprint])? no
Host key verification failed.

If I run ssh in verbose mode on the successful agent integration-agent-docker-1055

debug1: Reading configuration data /mnt/home/jenkins-deploy/.ssh/config
debug1: /mnt/home/jenkins-deploy/.ssh/config line 1: Applying options for gerrit.wikimedia.org
debug1: /mnt/home/jenkins-deploy/.ssh/config line 4: Applying options for gerrit.wikimedia.org
debug1: /mnt/home/jenkins-deploy/.ssh/config line 6: Applying options for gerrit.wikimedia.org

Which is:

/mnt/home/jenkins-deploy/.ssh/config
Host gerrit.wikimedia.org
  StrictHostKeyChecking no
  User wdqsguibuilder
Host gerrit.wikimedia.org
  StrictHostKeyChecking no
Host gerrit.wikimedia.org
  StrictHostKeyChecking no

File created on Sep 11 09:23 and interestingly it hardocde the user to wdqsguibuilder.

There is no such thing on the failing agent (integration-agent-docker-1053).

That instance got created in June and based on last nobody logged in beside me. I clearly don't remember having crafted such a file, and most certainly I would not have used StrictHostKeyChecking no (I'd rather provision the Gerrit fingerprint to the known hosts). Maybe it is Jenkins adding it :(

Mentioned in SAL (#wikimedia-releng) [2023-09-26T07:39:44Z] <hashar> integration: sudo cumin --force 'name:docker' 'rm -f /mnt/home/jenkins-deploy/.ssh/config' # T328543#9198204

Change 961025 had a related patch set uploaded (by Hashar; author: Hashar):

[operations/puppet@production] ci: add Gerrit ssh key to ssh_known_hosts

https://gerrit.wikimedia.org/r/961025

I have cherry picked https://gerrit.wikimedia.org/r/c/operations/puppet/+/961025 on the integration Puppet master:

Notice: /Stage[main]/Profile::Ci::Slave::Labs::Common/Sshkey[gerrit]/ensure: created

Which made the key available globally:

cat /etc/ssh/ssh_known_hosts 
# HEADER: This file was autogenerated at 2023-09-26 08:10:44 +0000
# HEADER: by puppet.  While it can still be managed manually, it
# HEADER: is definitely not recommended.
gerrit.wikimedia.org ssh-rsa AAAAB3NzaC1yc2EAAAADAQABAAAAgQCF8pwFLehzCXhbF1jfHWtd9d1LFq2NirplEBQYs7AOrGwQ/6ZZI0gvZFYiEiaw1o+F1CMfoHdny1VfWOJF3mJ1y9QMKAacc8/Z3tG39jBKRQCuxmYLO1SWymv7/Uvx9WQlkNRoTdTTa9OJFy6UqvLQEXKYaokfMIUHZ+oVFf1CgQ==

And the ssh host key is recognized:

$ sudo -u jenkins-deploy ssh -p 29418 gerrit.wikimedia.org
jenkins-deploy@gerrit.wikimedia.org: Permission denied (publickey).

I force ran Puppet on all agents using on integration-cumin:

sudo cumin --ignore-exit-code --batch-size 4 --force 'name:docker' 'puppet agent -t'

And the file is present everywhere:

$ sudo cumin --force 'name:docker' 'ls -1 /etc/ssh/ssh_known_hosts'
19 hosts will be targeted:
integration-agent-docker-[1040-1057].integration.eqiad1.wikimedia.cloud,integration-agent-puppet-docker-1003.integration.eqiad1.wikimedia.cloud
FORCE mode enabled, continuing without confirmation
===== NODE GROUP =====                                                                                                                                                                        
(19) integration-agent-docker-[1040-1057].integration.eqiad1.wikimedia.cloud,integration-agent-puppet-docker-1003.integration.eqiad1.wikimedia.cloud                                          
----- OUTPUT of 'ls -1 /etc/ssh/ssh_known_hosts' -----                                                                                                                                        
/etc/ssh/ssh_known_hosts                                                                                                                                                                      
================                                                                                                                                                                              
PASS |████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 100% (19/19) [00:00<00:00, 27.14hosts/s]          
FAIL |                                                                                                                                             |   0% (0/19) [00:00<?, ?hosts/s]
100.0% (19/19) success ratio (>= 100.0% threshold) for command: 'ls -1 /etc/ssh/ssh_known_hosts'.
100.0% (19/19) success ratio (>= 100.0% threshold) of nodes successfully executed all commands.

The build should work now!

Change 953559 merged by jenkins-bot:

[integration/config@master] wikidata-query-builder-build: run shell scripts in a container

https://gerrit.wikimedia.org/r/953559

Change 961056 had a related patch set uploaded (by Hashar; author: Hashar):

[integration/config@master] wikidata-query-builder-builder: use wdqsguibuilder

https://gerrit.wikimedia.org/r/961056

Change 961056 merged by jenkins-bot:

[integration/config@master] wikidata-query-builder-builder: use wdqsguibuilder

https://gerrit.wikimedia.org/r/961056

noarave changed the task status from Stalled to Open.Sep 26 2023, 10:27 AM
noarave moved this task from Stalled to Doing on the Wikidata Dev Team (Quality Tools "Sprint") board.

Change 961075 had a related patch set uploaded (by Noa wmde; author: Noa wmde):

[integration/config@master] Fix wrong deploy dir for wikidata-query-builder-build

https://gerrit.wikimedia.org/r/961075

Change 961075 merged by jenkins-bot:

[integration/config@master] Fix wrong deploy dir for wikidata-query-builder-build

https://gerrit.wikimedia.org/r/961075

Change 961991 had a related patch set uploaded (by WDQSGuiBuilder; author: WDQSGuiBuilder):

[wikidata/query-builder/deploy@production] Merging from 176678c2db83f44d868e808d3da83df11d19c1a2

https://gerrit.wikimedia.org/r/961991

Change 961991 abandoned by Noa wmde:

[wikidata/query-builder/deploy@production] Merging from 176678c2db83f44d868e808d3da83df11d19c1a2

Reason:

This was a side effect of a success Jenkins job run test

https://gerrit.wikimedia.org/r/961991

This is the last open patch of the task and I believe it is now finally ready to be reviewed
https://gerrit.wikimedia.org/r/c/wikidata/query-builder/+/953558/

Change 963196 had a related patch set uploaded (by WDQSGuiBuilder; author: WDQSGuiBuilder):

[wikidata/query-builder/deploy@production] Merging from 472864b6e92537d079ad719b070ba95f205a81f9

https://gerrit.wikimedia.org/r/963196

Change 953558 merged by jenkins-bot:

[wikidata/query-builder@master] Query Builder: move scripts from jenkins into repository

https://gerrit.wikimedia.org/r/953558

Change 965122 had a related patch set uploaded (by Jbond; author: jbond):

[operations/puppet@production] gerrit: make gerrit ssh key more DRY

https://gerrit.wikimedia.org/r/965122

Build scripts are now indeed in the repository and not in operations. Thank you @noarave and @hashar !

Change 961025 abandoned by Hashar:

[operations/puppet@production] ci: add Gerrit ssh key to ssh_known_hosts

Reason:

In favor of shorter/cleaner AND working https://gerrit.wikimedia.org/r/c/operations/puppet/+/965122

https://gerrit.wikimedia.org/r/961025

Change 965122 merged by Jbond:

[operations/puppet@production] gerrit: make gerrit ssh key more DRY

https://gerrit.wikimedia.org/r/965122

One of the issue we have encountered is the git push to Gerrit failed due to the ssh host key not being present on the Jenkins instances. That got solved by https://gerrit.wikimedia.org/r/c/operations/puppet/+/965122