Page MenuHomePhabricator

CI job beta-mediawiki-config-update-eqiad has stopped running
Closed, ResolvedPublic

Description

It looks like the last run of this job was on March 15th at 09:33am UTC.

https://integration.wikimedia.org/ci/view/Beta/job/beta-mediawiki-config-update-eqiad/14422/

Triggered by change: 496742,1
Branch: master
Pipeline: postmerge

This is blocking the testing of T216631 on betalabs.

Details

Related Gerrit Patches:

Event Timeline

SBisson created this task.Mar 22 2019, 5:49 PM
Restricted Application added a subscriber: Aklapper. · View Herald TranscriptMar 22 2019, 5:49 PM
hashar triaged this task as Unbreak Now! priority.Mar 22 2019, 5:51 PM
hashar updated the task description. (Show Details)
Restricted Application added subscribers: Liuxinyu970226, TerraCodes. · View Herald TranscriptMar 22 2019, 5:51 PM
hashar updated the task description. (Show Details)Mar 22 2019, 5:56 PM
hashar added a subscriber: Krinkle.Mar 22 2019, 6:01 PM

The job is triggered when a change is merged in operations/mediawiki-config. Commits that have been made:

f21763295 Mon Mar 18 07:58:34 2019 +0100
ccecdc857 Mon Mar 18 06:27:31 2019 +0000
5dcbc5ce9 Mon Mar 18 06:59:27 2019 +0100
f88f50dd0 Fri Mar 15 09:29:06 2019 +0000
c64a3a935 Fri Mar 15 09:53:57 2019 +0100
879d8ae72 Fri Mar 15 08:49:10 2019 +0000

The regression most probably happened in the Zuul config which is in integration/config:

$ git shortlog --format='%h %s' --since='Fri Mar 15 09:29:06 2019 +0000' --until='Mon Mar 18 06:59:27 2019 +0100'
James D. Forrester (2):
      64922a34 Provide extension-javascript-documentation template
      8c5d2563 Replace *-jsduck-* jobs with *-node10-docs-* ones

Timo Tijhof (5):
      9466a107 Update mwext-EventLogging postmerge from jsduck to generic node10
      9868f23f Disable post-merge doc/coverage publish for l10n commits
      9954d4d1 Set low priority on job scheduling for postmerge pipeline
      9e71f2c9 zuul: Fix doc-publish postmerge of EventLogging
      85eccb74 Create generic and mwext variants of node10 doc-related jobs

Umherirrender (5):
      8efa14f0 [Disambiguator] Add phan
      90cbd632 [Josa] Add phan
      36456b6a [GeoCrumbs] Add phan
      cf3d396f [Insider] Add phan
      12e5f990 [Listings] Add phan

Phan changes by Umherirrender are harmless.

At a first glance, I want to blame:
9868f23f Disable post-merge doc/coverage publish for l10n commits by @Krinkle

For https://gerrit.wikimedia.org/r/#/c/operations/mediawiki-config/+/498416/ from Zuul debug logs:

It gets the change-merged event:

2019-03-22 16:22:36,855 DEBUG zuul.Scheduler: Adding trigger event: <TriggerEvent change-merged operations/mediawiki-config master 498416,1>
2019-03-22 16:22:36,855 DEBUG zuul.Scheduler: Run handler awake
2019-03-22 16:22:36,856 DEBUG zuul.Scheduler: Done adding trigger event: <TriggerEvent change-merged operations/mediawiki-config master 498416,1>
2019-03-22 16:22:36,856 DEBUG zuul.Scheduler: Fetching trigger event
2019-03-22 16:22:36,856 DEBUG zuul.Scheduler: Processing trigger event <TriggerEvent change-merged operations/mediawiki-config master 498416,1>

And it is simply not added to the postmerge pipeline.

2019-03-22 16:22:36,872 DEBUG zuul.IndependentPipelineManager: Starting queue processor: postmerge
2019-03-22 16:22:36,872 DEBUG zuul.IndependentPipelineManager: Finished queue processor: postmerge (changed: False)

It is seriously broken though, I see changes for operations/puppet entering the postmerge pipeline and triggering no job:

2019-03-22 18:04:53,134 DEBUG zuul.IndependentPipelineManager: Starting queue processor: postmerge
2019-03-22 18:04:53,134 DEBUG zuul.IndependentPipelineManager: Checking for changes needed by <Change 0x7f467a78b2d0 498439,1>:

2019-03-22 18:04:53,136 DEBUG zuul.IndependentPipelineManager: No jobs for change <Change 0x7f467a78b2d0 498439,1>
2019-03-22 18:04:53,136 DEBUG zuul.IndependentPipelineManager: Removing change <Change 0x7f467a78b2d0 498439,1> from queue
2019-03-22 18:04:53,137 DEBUG zuul.IndependentPipelineManager: Finished queue processor: postmerge (changed: True)

But operations/puppet is not in the postmerge pipeline:

zuul/layout.yaml
- name: operations/puppet
  test-prio:
    - operations-puppet-tests-stretch-docker
  experimental:
    - operations-puppet-catalog-compiler-test

I have absolutely no idea what is going on. Meanwhile I have manually enqueued the latest change to mediawiki-config. On contint1001.wikimedia.org Ihave done:

zuul enqueue --trigger gerrit --pipeline postmerge --project operations/mediawiki-config --change 498416,1

Still from Zuul debug logs, it no more matches events since Mar 15th indeed :(

$ grep -c 'Event change-merged operations/mediawiki-config.*matched' /var/log/zuul/debug.log.2019-03-*
debug.log.2019-03-01:5
debug.log.2019-03-02:0
debug.log.2019-03-03:0
debug.log.2019-03-04:30
debug.log.2019-03-05:22
debug.log.2019-03-06:23
debug.log.2019-03-07:24
debug.log.2019-03-08:16
debug.log.2019-03-09:1
debug.log.2019-03-10:0
debug.log.2019-03-11:14
debug.log.2019-03-12:8
debug.log.2019-03-13:15
debug.log.2019-03-14:22
debug.log.2019-03-15:10
debug.log.2019-03-16:0
debug.log.2019-03-17:0
debug.log.2019-03-18:0
debug.log.2019-03-19:0
debug.log.2019-03-20:0
debug.log.2019-03-21:0

Change 498453 had a related patch set uploaded (by Hashar; owner: Hashar):
[integration/config@master] test postmerge rejects l10n-bot

https://gerrit.wikimedia.org/r/498453

I am looking at projects having a postmerge pipeline and try to find the last time an event got accepted. mediawiki/core has slightly more traffic.

The last postmerge event that matched for mediawiki/core was 496563,4 at 21:46. After that 496773,1 at 22:03 did not match. https://gerrit.wikimedia.org/r/#/c/mediawiki/core/+/496773/ is not from l10n-bot and is for the master branch.

2019-03-15 21:46:16,415 DEBUG zuul.Scheduler: Processing trigger event <TriggerEvent change-merged mediawiki/core master 496563,4>
2019-03-15 21:46:16,416 DEBUG zuul.IndependentPipelineManager: Event <TriggerEvent change-merged mediawiki/core master 496563,4> for change <Change 0x7fd12ef0ff90 496563,4> matched <EventFilter types: change-merged branches: (?!^refs/meta/config) ignore_deletes: True> in pipeline <IndependentPipelineManager postmerge>
2019-03-15 22:03:38,726 DEBUG zuul.Scheduler: Adding trigger event: <TriggerEvent change-merged mediawiki/core master 496773,1>
2019-03-15 22:03:38,726 DEBUG zuul.Scheduler: Done adding trigger event: <TriggerEvent change-merged mediawiki/core master 496773,1>

That is a short time window!

Change 498466 had a related patch set uploaded (by Hashar; owner: Hashar):
[integration/config@master] Revert "Disable post-merge doc/coverage publish for l10n commits"

https://gerrit.wikimedia.org/r/498466

Change 498466 merged by jenkins-bot:
[integration/config@master] Revert "Disable post-merge doc/coverage publish for l10n commits"

https://gerrit.wikimedia.org/r/498466

hashar added a subscriber: Gehel.

@Gehel noticed the same issue on search/glent a few days ago and I did some investigation at the time T218550#5031902 .

NOTE: still have to confirm whether that is actually solved.
hashar closed this task as Resolved.Mar 22 2019, 9:06 PM
hashar claimed this task.

It is working again, albeit l10n-bot enqueues jobs in postmerge pipeline.

Mentioned in SAL (#wikimedia-releng) [2019-04-01T10:50:18Z] <hashar> Manually triggering postmerge step of citoid due to T219017 for mvolz. On contint1001: zuul enqueue --trigger gerrit --pipeline postmerge --project mediawiki/services/citoid --change 497315,1