Page MenuHomePhabricator

Advanced search daily test runs failing on publish to prometheus
Closed, ResolvedPublic

Description

Advanced Search daily betacluster runs have been failing consistently on the "Publish-to-prometheus" step

https://integration.wikimedia.org/ci/job/selenium-daily-beta-AdvancedSearch/

The jobs that are failing in this list are Advanced Search:
https://integration.wikimedia.org/ci/job/publish-to-prometheus/

07:57:14 Started by upstream project "selenium-daily-beta-AdvancedSearch" build number 1843
07:57:14 originally caused by:
07:57:14  Started by timer
07:57:14 Running as SYSTEM
07:57:14 Building remotely on contint1002 (pipelinelib-publish pipelinelib productionAgents pipelinelib-promote pipelinelib-build train) in workspace /srv/jenkins-agent/workspace/publish-to-prometheus
07:57:14 [ssh-agent] Looking for ssh-agent implementation...
07:57:14 $ ssh-agent
07:57:14 SSH_AUTH_SOCK=/tmp/ssh-OVHTP9n3gKlw/agent.3463583
07:57:14 SSH_AGENT_PID=3463586
07:57:14 [ssh-agent] Started.
07:57:14 Running ssh-add (command line suppressed)
07:57:14 Identity added: /srv/jenkins-agent/workspace/publish-to-prometheus@tmp/private_key_9145659960726428945.key (/srv/jenkins-agent/workspace/publish-to-prometheus@tmp/private_key_9145659960726428945.key)
07:57:14 [ssh-agent] Using credentials jenkins-deploy (key to connect to labs instances set up with role::ci::slave::labs::common)
07:57:14 [WS-CLEANUP] Deleting project workspace...
07:57:14 [WS-CLEANUP] Deferred wipeout is used...
07:57:14 [WS-CLEANUP] Done
07:57:14 [publish-to-prometheus] $ /bin/bash -xe /tmp/jenkins3495561176294208522.sh
07:57:14 + set -u
07:57:14 + set +x
07:57:14 Fetching from:
07:57:14 - Instance...: 172.16.19.79
07:57:14 - Workspace..: /srv/jenkins/workspace/selenium-daily-beta-AdvancedSearch
07:57:14 - Subdir.....: log/
07:57:14 - Pattern....: *.prom
07:57:14 + rsync --archive --stats --compress --ignore-missing-args '--rsh=/usr/bin/ssh -a -T -o ConnectTimeout=6 -o UserKnownHostsFile=/dev/null -o StrictHostKeyChecking=no' 'jenkins-deploy@172.16.19.79:/srv/jenkins/workspace/selenium-daily-beta-AdvancedSearch/log//*.prom' .
07:57:15 Warning: Permanently added '172.16.19.79' (ECDSA) to the list of known hosts.
07:57:16 
07:57:16 Number of files: 1 (reg: 1)
07:57:16 Number of created files: 1 (reg: 1)
07:57:16 Number of deleted files: 0
07:57:16 Number of regular files transferred: 1
07:57:16 Total file size: 11,210 bytes
07:57:16 Total transferred file size: 11,210 bytes
07:57:16 Literal data: 11,210 bytes
07:57:16 Matched data: 0 bytes
07:57:16 File list size: 116
07:57:16 File list generation time: 0.015 seconds
07:57:16 File list transfer time: 0.000 seconds
07:57:16 Total bytes sent: 43
07:57:16 Total bytes received: 1,380
07:57:16 
07:57:16 sent 43 bytes  received 1,380 bytes  569.20 bytes/sec
07:57:16 total size is 11,210  speedup is 7.88
07:57:16 [publish-to-prometheus] $ /usr/bin/env bash /tmp/jenkins8283439741254553506.sh
07:57:16 + PROMETHEUS_GATEWAY=http://prometheus-pushgateway.discovery.wmnet/metrics/job/browsertests
07:57:16 + shopt -s nullglob
07:57:16 + PROM_FILES=(*.prom)
07:57:16 + '[' 1 -eq 0 ']'
07:57:16 + grep -P '^(wdio_|cypress_)' -- advancedsearch-project-metrics-2026-04-15T12-57-14-385Z-233784a0.prom
07:57:16 ++ curl --silent --show-error --connect-timeout 2 --max-time 6 --fail --output /dev/stderr --write-out '%{http_code}' --data-binary @- http://prometheus-pushgateway.discovery.wmnet/metrics/job/browsertests
07:57:16 curl: (22) The requested URL returned error: 400 Bad Request
07:57:16 + '[' 200 -eq 400 ']'
07:57:16 Build step 'Execute shell' marked build as failure
07:57:16 $ ssh-agent -k
07:57:16 unset SSH_AUTH_SOCK;
07:57:16 unset SSH_AGENT_PID;
07:57:16 echo Agent pid 3463586 killed;
07:57:16 [ssh-agent] Stopped.
07:57:16 Finished: FAILURE

Event Timeline

I removed the "publish-to-prometheus" step in Jenkins for that job for now and re-runned but got a failing test (sent it to you on Slack @vaughnwalters

I think the best way for now is to remove producing the Prometheus file in CI, I can do that with a new release. We want to move away from the push gateway and add more tags (we want to see the branch for each test for example).

Peter changed the task status from Open to In Progress.Apr 15 2026, 9:15 PM
Peter edited projects, added: Test Platform (Plovdiv 25); removed: Test Platform.

heya peter I am getting that same failure again in the daily run at https://integration.wikimedia.org/ci/job/publish-to-prometheus/3685/console

-data-binary @- http://prometheus-pushgateway.discovery.wmnet/metrics/job/browsertests
07:57:05 curl: (22) The requested URL returned error: 400 Bad Request

Sorry I haven't rolled out the change to disable it. I'll do that ASAP.

Change #1276277 had a related patch set uploaded (by Phedenskog; author: Phedenskog):

[mediawiki/core@master] selenium: Fix duplicate metrics in PrometheusFileReporter

https://gerrit.wikimedia.org/r/1276277

Change #1276278 had a related patch set uploaded (by Phedenskog; author: Phedenskog):

[mediawiki/core@master] selenium: Fix negative retry numbers for Prometheus

https://gerrit.wikimedia.org/r/1276278

Change #1276280 had a related patch set uploaded (by Phedenskog; author: Phedenskog):

[mediawiki/core@master] selenium: Update per-test skipped count when test was already started

https://gerrit.wikimedia.org/r/1276280

Change #1276277 merged by jenkins-bot:

[mediawiki/core@master] selenium: Fix duplicate metrics in PrometheusFileReporter

https://gerrit.wikimedia.org/r/1276277

Heya, I think this should not be marked as resolved, this test is still failing for the past six days on the same step:

https://integration.wikimedia.org/ci/job/publish-to-prometheus/

Heya, I think this should not be marked as resolved, this test is still failing for the past six days on the same step:

https://integration.wikimedia.org/ci/job/publish-to-prometheus/

Daily selenium test is now passing again after upgrade to 6.5.2.