Page MenuHomePhabricator

ABran-WMF (arnaudb)
SRE

Today

  • No visible events.

Tomorrow

  • No visible events.

Tuesday

  • No visible events.

User Details

User Since
Aug 29 2023, 8:30 AM (137 w, 4 d)
Availability
Available
IRC Nick
arnaudb
LDAP User
Arnaudb
MediaWiki User
ABran-WMF [ Global Accounts ]

Recent Activity

Fri, Apr 17

ABran-WMF updated the task description for T423601: Update and improve operation runbooks and documentation for Gerrit.
Fri, Apr 17, 1:47 PM · Documentation, Sustainability (Incident Followup), Gerrit, collaboration-services
ABran-WMF added a comment to T423601: Update and improve operation runbooks and documentation for Gerrit.

I've also updated the "restart gerrit" part: https://wikitech.wikimedia.org/wiki/Gerrit/Operations#Restarting

Fri, Apr 17, 1:46 PM · Documentation, Sustainability (Incident Followup), Gerrit, collaboration-services
ABran-WMF added a comment to T423601: Update and improve operation runbooks and documentation for Gerrit.

I've added a bit of documentation on gerrit ssh commands: https://wikitech.wikimedia.org/wiki/Gerrit/Operations#Use_gerrit_ssh_commands

Fri, Apr 17, 1:43 PM · Documentation, Sustainability (Incident Followup), Gerrit, collaboration-services
ABran-WMF added a comment to T423683: SystemdUnitFailed - gitlab1004 - sync-gitlab-group-with-ldap.

We should maybe have a prometheus metric exposed for 500 errors thrown from LDAP and/or failed syncs. That error looks transient because it recovers quickly (it was also the case this morning with T423674: SystemdUnitFailed - sync-gitlab-group-with-ldap.service on gitlab1004:9100), so it should maybe be caught/retried to generate an alert only if it lasts more than a few minutes/hours.

Fri, Apr 17, 1:13 PM · collaboration-services
ABran-WMF updated the task description for T333143: Move Gerrit data out of root partition.
Fri, Apr 17, 9:05 AM · Patch-For-Review, Sustainability (Incident Followup), Release-Engineering-Team, collaboration-services, Gerrit
ABran-WMF added a subtask for T423027: 2026-04-12 Gerrit Outage (was: DiskSpace): T333143: Move Gerrit data out of root partition.
Fri, Apr 17, 9:03 AM · Patch-For-Review, Wikimedia-Incident, Gerrit, collaboration-services
ABran-WMF added a parent task for T333143: Move Gerrit data out of root partition: T423027: 2026-04-12 Gerrit Outage (was: DiskSpace).
Fri, Apr 17, 9:03 AM · Patch-For-Review, Sustainability (Incident Followup), Release-Engineering-Team, collaboration-services, Gerrit
ABran-WMF added a parent task for T423035: Gerrit outage didn't page until 4.5 hours after the first alert: T423027: 2026-04-12 Gerrit Outage (was: DiskSpace).
Fri, Apr 17, 9:02 AM · Sustainability (Incident Followup), observability, collaboration-services
ABran-WMF added subtasks for T423027: 2026-04-12 Gerrit Outage (was: DiskSpace): T423035: Gerrit outage didn't page until 4.5 hours after the first alert, T423123: Alert when Gerrit CI (Zuul, Jenkins, Gearman) is down/stuck, T423601: Update and improve operation runbooks and documentation for Gerrit.
Fri, Apr 17, 9:02 AM · Patch-For-Review, Wikimedia-Incident, Gerrit, collaboration-services
ABran-WMF added a parent task for T423123: Alert when Gerrit CI (Zuul, Jenkins, Gearman) is down/stuck: T423027: 2026-04-12 Gerrit Outage (was: DiskSpace).
Fri, Apr 17, 9:02 AM · Sustainability (Incident Followup), Gerrit, Release-Engineering-Team, collaboration-services
ABran-WMF added a parent task for T423601: Update and improve operation runbooks and documentation for Gerrit: T423027: 2026-04-12 Gerrit Outage (was: DiskSpace).
Fri, Apr 17, 9:02 AM · Documentation, Sustainability (Incident Followup), Gerrit, collaboration-services
ABran-WMF added a comment to T421827: gerrit: Adapt timeouts to avoid 502 errors in CI jobs.

I have tried to limit max_concurrent_streams to 50, still inconclusive for the connection interruption in CI

Fri, Apr 17, 8:18 AM · ci-test-error (WMF-deployed Build Failure), Patch-For-Review, collaboration-services, Release-Engineering-Team, Traffic, Gerrit
ABran-WMF closed T423674: SystemdUnitFailed - sync-gitlab-group-with-ldap.service on gitlab1004:9100 as Resolved.

a 500 error was thrown and the script crashed. It was a transient error

Fri, Apr 17, 7:02 AM · collaboration-services
ABran-WMF renamed T423674: SystemdUnitFailed - sync-gitlab-group-with-ldap.service on gitlab1004:9100 from SystemdUnitFailed to SystemdUnitFailed - sync-gitlab-group-with-ldap.service on gitlab1004:9100.
Fri, Apr 17, 7:00 AM · collaboration-services

Thu, Apr 16

ABran-WMF updated the task description for T333143: Move Gerrit data out of root partition.
Thu, Apr 16, 2:15 PM · Patch-For-Review, Sustainability (Incident Followup), Release-Engineering-Team, collaboration-services, Gerrit
ABran-WMF added a comment to T421827: gerrit: Adapt timeouts to avoid 502 errors in CI jobs.

I have tried to increase per_connection_buffer_limit_bytes first to 4MB, then to 16MB to see if the stream watermark messages were a symptoms of the 502s. The message rate has reduced, but 502s are still happening (https://integration.wikimedia.org/ci/view/All/job/quibble-vendor-mysql-php83/lastFailedBuild/console). I'll keep on debugging.

Thu, Apr 16, 1:08 PM · ci-test-error (WMF-deployed Build Failure), Patch-For-Review, collaboration-services, Release-Engineering-Team, Traffic, Gerrit

Wed, Apr 15

ABran-WMF added a comment to T421827: gerrit: Adapt timeouts to avoid 502 errors in CI jobs.

quibble-with-gated-extensions-vendor-mysql-php83/27335 also failed during the clone phase with the same error. The failure happens in prepareRepo() / repo.prune() / git remote prune --dry-run origin:

Wed, Apr 15, 2:08 PM · ci-test-error (WMF-deployed Build Failure), Patch-For-Review, collaboration-services, Release-Engineering-Team, Traffic, Gerrit
ABran-WMF added a comment to T421827: gerrit: Adapt timeouts to avoid 502 errors in CI jobs.

Verbose logging has been enabled on Envoy on gerrit2003:

curl -i -X POST 'http://localhost:9631/logging?paths=http:debug,router:debug,connection:debug'
Wed, Apr 15, 8:30 AM · ci-test-error (WMF-deployed Build Failure), Patch-For-Review, collaboration-services, Release-Engineering-Team, Traffic, Gerrit

Tue, Apr 14

ABran-WMF closed T423254: PuppetFailure as Resolved.

linked to T423156: setup phab1006

Tue, Apr 14, 12:09 PM · collaboration-services
ABran-WMF added a comment to T333143: Move Gerrit data out of root partition.

We might want to stop puppet first and have a merged change that will update httpd's config

Suggested run book to move remaining data:

prep:

  • mkdir /srv/gerrit/site_path (we need/should use some new directory and this is what that is, /var/lib/gerrit (or previously /var/lib/gerrit2 is the default Gerrit Site Path.
  • pre-rsync data from /var/lib/gerrit/ to /srv/gerrit/site_path/

migrate:

  • stop gerrit
  • rsync data from /var/lib/gerrit/ to /srv/gerrit/site_path/ one more time
  • mv /var/lib/gerrit /srv/gerrit/var-lib-gerrit-backup (just in case, but don't forget it forever, /var/lib/gerrit needs to move out of the way though, or we simply "mv /var/lib/gerrit /srv/gerrit/site_path" and forget rsync!?)
  • mount --bind /srv/gerrit/site_path /var/lib/gerrit (alternative: ln -s /srv/gerrit/site_path /var/lib/gerrit)
  • start gerrit

?

Tue, Apr 14, 7:09 AM · Patch-For-Review, Sustainability (Incident Followup), Release-Engineering-Team, collaboration-services, Gerrit

Thu, Apr 9

ABran-WMF moved T421827: gerrit: Adapt timeouts to avoid 502 errors in CI jobs from Awaiting Input to Work in Progress on the collaboration-services board.

No impact from that change over the past few hours:

root@contint1002:/srv/jenkins/builds# for i in $(fdfind --type file --changed-within 2h '^log$') ; do rg "GnuTLS recv error" -zli "${i}" -q && echo ${i} ; done|wc -l
7

I will keep on digging

Thu, Apr 9, 1:28 PM · ci-test-error (WMF-deployed Build Failure), Patch-For-Review, collaboration-services, Release-Engineering-Team, Traffic, Gerrit
ABran-WMF added a comment to T421827: gerrit: Adapt timeouts to avoid 502 errors in CI jobs.

Envoy now stops reusing connections to httpd on our Gerrit primary instance. I'll monitor the impact on the GnuTLS recv error rate.

Thu, Apr 9, 9:52 AM · ci-test-error (WMF-deployed Build Failure), Patch-For-Review, collaboration-services, Release-Engineering-Team, Traffic, Gerrit
ABran-WMF added a comment to T421827: gerrit: Adapt timeouts to avoid 502 errors in CI jobs.

A first manual test on gerrit-spare shows no breakage, I will now try to apply that change manually on the primary Gerrit instance before merging.

Transactions:		      408    hits
Availability:		      100.00 %
Thu, Apr 9, 9:27 AM · ci-test-error (WMF-deployed Build Failure), Patch-For-Review, collaboration-services, Release-Engineering-Team, Traffic, Gerrit
ABran-WMF added a comment to T420623: netbox report error for puppetdb serial versus netbox serial for backup1012.

CC @ABran-WMF @Jelto that backups from gerrit & gitlab (and attempted recoveries) will be unavailable for that 2 hour window.

Thu, Apr 9, 9:14 AM · collaboration-services, SRE, ops-eqiad, DC-Ops
ABran-WMF added a comment to T421827: gerrit: Adapt timeouts to avoid 502 errors in CI jobs.

Thanks @DLynch @ArthurTaylor for reporting these builds.

Thu, Apr 9, 9:01 AM · ci-test-error (WMF-deployed Build Failure), Patch-For-Review, collaboration-services, Release-Engineering-Team, Traffic, Gerrit

Wed, Apr 8

ABran-WMF added a comment to T421827: gerrit: Adapt timeouts to avoid 502 errors in CI jobs.
Wed, Apr 8, 12:11 PM · ci-test-error (WMF-deployed Build Failure), Patch-For-Review, collaboration-services, Release-Engineering-Team, Traffic, Gerrit
ABran-WMF added a comment to T421827: gerrit: Adapt timeouts to avoid 502 errors in CI jobs.

Connection reuse has been disabled again. Please let me know if the 502 errors are still happening.

Wed, Apr 8, 7:37 AM · ci-test-error (WMF-deployed Build Failure), Patch-For-Review, collaboration-services, Release-Engineering-Team, Traffic, Gerrit
ABran-WMF added a comment to T422070: GitlabPackagePullerFailedOnRun - apt-staging2001:9100.

I agree, we should align the permission and make it either private or public, like the other projects. I can reach out to Nat.

Wed, Apr 8, 7:35 AM · collaboration-services
ABran-WMF moved T422559: @wikimedia.org email addresses don't seem to be receiving emails sent by the test Phabricator instance from Incoming to Consultation on the collaboration-services board.
Wed, Apr 8, 6:51 AM · Infrastructure-Foundations, Mail, collaboration-services, VPS-project-Phabricator

Tue, Apr 7

ABran-WMF added a comment to T422070: GitlabPackagePullerFailedOnRun - apt-staging2001:9100.

I don't think this script should fail hard for a improperly configured project. I'll add proper error handling to the script which should fix broken package pulling.

Tue, Apr 7, 5:41 PM · collaboration-services
ABran-WMF closed T422493: GerritHAProxyServiceUnavailable - Gerrit tcp-proxy (HAProxy) service gerrit_ssh is DOWN in esams as Resolved.

alerts were stemming from esams maintenance.

Tue, Apr 7, 1:51 PM · collaboration-services
ABran-WMF merged task T422492: GerritHAProxyBackendUnavailable - tcp-proxy3001:9422 into T422493: GerritHAProxyServiceUnavailable - Gerrit tcp-proxy (HAProxy) service gerrit_ssh is DOWN in esams.
Tue, Apr 7, 1:51 PM · collaboration-services
ABran-WMF merged T422492: GerritHAProxyBackendUnavailable - tcp-proxy3001:9422 into T422493: GerritHAProxyServiceUnavailable - Gerrit tcp-proxy (HAProxy) service gerrit_ssh is DOWN in esams.
Tue, Apr 7, 1:51 PM · collaboration-services
ABran-WMF renamed T422493: GerritHAProxyServiceUnavailable - Gerrit tcp-proxy (HAProxy) service gerrit_ssh is DOWN in esams from GerritHAProxyServiceUnavailable to GerritHAProxyServiceUnavailable - Gerrit tcp-proxy (HAProxy) service gerrit_ssh is DOWN in esams.
Tue, Apr 7, 1:50 PM · collaboration-services
ABran-WMF renamed T422492: GerritHAProxyBackendUnavailable - tcp-proxy3001:9422 from GerritHAProxyBackendUnavailable to GerritHAProxyBackendUnavailable - tcp-proxy3001:9422.
Tue, Apr 7, 1:50 PM · collaboration-services
ABran-WMF moved T246763: Jenkins job failing intermittently due to Gerrit HTTP 502 errors when interacting with repos from Awaiting Input to Work in Progress (Tracking tasks) on the collaboration-services board.
Tue, Apr 7, 1:44 PM · collaboration-services, Release-Engineering-Team, Patch-For-Review, ci-test-error (WMF-deployed Build Failure), Release-Engineering-Team-TODO (2020-04 to 2020-06 (Q4)), Gerrit
ABran-WMF closed T422468: Gerrit load balancer services still in lvs_setup as Resolved.

thanks for highlighting this @taavi, the change has been merged to move these entries to state: production, feel free to reopen that issue if needed.

Tue, Apr 7, 1:38 PM · collaboration-services
ABran-WMF added a comment to T421827: gerrit: Adapt timeouts to avoid 502 errors in CI jobs.

I forgot to add on my previous comment:

# for i in $(fdfind --type f --change-newer-than 2026-03-25 '^log$') ; do rg "GnuTLS recv error" -zli "${i}" -q && echo ${i} ; done|wc -l
114

There is a clear increase of these errors in the logs on contint, after we introduced Envoy and enabled connection-reuse.

Tue, Apr 7, 1:28 PM · ci-test-error (WMF-deployed Build Failure), Patch-For-Review, collaboration-services, Release-Engineering-Team, Traffic, Gerrit
ABran-WMF added a comment to T421827: gerrit: Adapt timeouts to avoid 502 errors in CI jobs.

To be clear, this is basically just speculation based on the limited information I can find; but I am wondering about whether this type of error might potentially be unrelated to the addition of Envoy to Gerrit's stack.
Searching Gerrit for the text GnuTLS recv error (-54) (to try and find instances where this CI error has been pasted into a Gerrit comment), it seems like an early singular occurrence of this issue may have been on 2026-02-18, with occurrences starting regularly from 2026-03-16 onwards. IIUC from T420909#11752446, Envoy was added to Gerrit's stack on 2026-03-26.

Tue, Apr 7, 12:44 PM · ci-test-error (WMF-deployed Build Failure), Patch-For-Review, collaboration-services, Release-Engineering-Team, Traffic, Gerrit
ABran-WMF changed the status of T422468: Gerrit load balancer services still in lvs_setup from Open to In Progress.
Tue, Apr 7, 9:06 AM · collaboration-services
ABran-WMF changed the status of T417996: Fix up Gerrit sshd.idleTimeout from Open to Stalled.
Tue, Apr 7, 7:10 AM · Continuous-Integration-Infrastructure (Zuul upgrade), collaboration-services, Release-Engineering-Team, Gerrit
ABran-WMF merged task T422356: apt-staging2001 - GitlabPackagePullerFailedOnPrepare into T422070: GitlabPackagePullerFailedOnRun - apt-staging2001:9100.
Tue, Apr 7, 6:45 AM · collaboration-services
ABran-WMF merged T422356: apt-staging2001 - GitlabPackagePullerFailedOnPrepare into T422070: GitlabPackagePullerFailedOnRun - apt-staging2001:9100.
Tue, Apr 7, 6:45 AM · collaboration-services
ABran-WMF merged T422456: GitlabPackagePullerFailedOnPrepare into T422070: GitlabPackagePullerFailedOnRun - apt-staging2001:9100.
Tue, Apr 7, 6:45 AM · collaboration-services
ABran-WMF merged task T422456: GitlabPackagePullerFailedOnPrepare into T422070: GitlabPackagePullerFailedOnRun - apt-staging2001:9100.
Tue, Apr 7, 6:45 AM · collaboration-services
ABran-WMF added a comment to T421827: gerrit: Adapt timeouts to avoid 502 errors in CI jobs.

It is not an issue with the software (git) being used on the client side (Jenkins) but with the infrastructure. Connections get aborted based on idle timeout, maximum duration or timeout to first response byte and there are four systems in the chain: ATS, Envoy, Apache, Gerrit.

Tue, Apr 7, 6:19 AM · ci-test-error (WMF-deployed Build Failure), Patch-For-Review, collaboration-services, Release-Engineering-Team, Traffic, Gerrit

Fri, Apr 3

ABran-WMF closed T422226: PuppetFailure - Puppet has failed on contint2003:9100 as Resolved.

I think that was resolved by @Dzahn

Fri, Apr 3, 8:23 AM · collaboration-services
ABran-WMF renamed T422226: PuppetFailure - Puppet has failed on contint2003:9100 from PuppetFailure to PuppetFailure - Puppet has failed on contint2003:9100.
Fri, Apr 3, 8:22 AM · collaboration-services
ABran-WMF added a comment to T421827: gerrit: Adapt timeouts to avoid 502 errors in CI jobs.

Thanks for the follow-up @SomeRandomDeveloper! Given the results of our recent tweaks on Envoy, it is a bit hard to debug further without more details.
Would it be possible to run that job with a bit more verbosity? Something like GIT_CURL_VERBOSE=1 GIT_TRACE_CURL=1 git $command -v.

Fri, Apr 3, 6:30 AM · ci-test-error (WMF-deployed Build Failure), Patch-For-Review, collaboration-services, Release-Engineering-Team, Traffic, Gerrit

Thu, Apr 2

ABran-WMF closed T411904: ATS/Gerrit: validate TLS hosts for gerrit (revert workaround that skips validation), a subtask of T411895: gerrit behind CDN, as Resolved.
Thu, Apr 2, 12:18 PM · Patch-For-Review, Gerrit, collaboration-services
ABran-WMF closed T411904: ATS/Gerrit: validate TLS hosts for gerrit (revert workaround that skips validation) as Resolved.

@Dzahn closing that one because we removed these lines with T420909: gerrit: Add Envoy in Gerrit's stack

Thu, Apr 2, 12:18 PM · Traffic, Gerrit, collaboration-services
ABran-WMF closed T246763: Jenkins job failing intermittently due to Gerrit HTTP 502 errors when interacting with repos as Resolved.

As mentioned in T421827: gerrit: Adapt timeouts to avoid 502 errors in CI jobs:

[...] Please let me know if you still see these 502 errors.

Thu, Apr 2, 12:14 PM · collaboration-services, Release-Engineering-Team, Patch-For-Review, ci-test-error (WMF-deployed Build Failure), Release-Engineering-Team-TODO (2020-04 to 2020-06 (Q4)), Gerrit
ABran-WMF changed the status of T422070: GitlabPackagePullerFailedOnRun - apt-staging2001:9100 from In Progress to Stalled.

We need to update permissions on wmf-navigator, marking this Stalled until then

Thu, Apr 2, 9:50 AM · collaboration-services
ABran-WMF moved T422070: GitlabPackagePullerFailedOnRun - apt-staging2001:9100 from Work in Progress to Awaiting Input on the collaboration-services board.
Thu, Apr 2, 9:49 AM · collaboration-services
ABran-WMF merged T422120: GitlabPackagePullerFailedOnPrepare into T422070: GitlabPackagePullerFailedOnRun - apt-staging2001:9100.
Thu, Apr 2, 9:48 AM · collaboration-services
ABran-WMF merged task T422120: GitlabPackagePullerFailedOnPrepare into T422070: GitlabPackagePullerFailedOnRun - apt-staging2001:9100.
Thu, Apr 2, 9:48 AM · collaboration-services
ABran-WMF added a comment to T421827: gerrit: Adapt timeouts to avoid 502 errors in CI jobs.

thanks for these @SomeRandomDeveloper @DLynch, I've merged a config update. Please let me know if you still see these 502 errors.

Thu, Apr 2, 9:45 AM · ci-test-error (WMF-deployed Build Failure), Patch-For-Review, collaboration-services, Release-Engineering-Team, Traffic, Gerrit
ABran-WMF added a comment to T422070: GitlabPackagePullerFailedOnRun - apt-staging2001:9100.

got it: 2026-04-02 07:44:44,661 ERROR [gitlab_package_puller.GitlabPackagePuller] Skipping project repos/projects/wmf-navigator after GitLab API error while preparing package fetch: 403: 403 Forbidden

Thu, Apr 2, 7:45 AM · collaboration-services
ABran-WMF changed the status of T422070: GitlabPackagePullerFailedOnRun - apt-staging2001:9100 from Open to In Progress.

agreed @Arnoldokoth it seems a permission on Gitlab might have been changed, preventing the script to access a repo. The error is thrown in after the Zotero line:

Thu, Apr 2, 7:42 AM · collaboration-services
ABran-WMF renamed T422070: GitlabPackagePullerFailedOnRun - apt-staging2001:9100 from GitlabPackagePullerFailedOnRun to GitlabPackagePullerFailedOnRun - apt-staging2001:9100.
Thu, Apr 2, 7:36 AM · collaboration-services

Wed, Apr 1

ABran-WMF added a comment to T421827: gerrit: Adapt timeouts to avoid 502 errors in CI jobs.

The change has been merged, please let us know if that does not fix the situation.

Wed, Apr 1, 3:24 PM · ci-test-error (WMF-deployed Build Failure), Patch-For-Review, collaboration-services, Release-Engineering-Team, Traffic, Gerrit
ABran-WMF added a subtask for T379714: Upgrade to Gerrit 3.11: Unknown Object (Task).
Wed, Apr 1, 9:54 AM · Patch-For-Review, Release-Engineering-Team (Priority Backlog 📥), Gerrit (Gerrit 3.11), collaboration-services
ABran-WMF added a comment to T421827: gerrit: Adapt timeouts to avoid 502 errors in CI jobs.

@ArthurTaylor @SomeRandomDeveloper is it OK for me to retry some of these jobs to test my change?

Wed, Apr 1, 9:53 AM · ci-test-error (WMF-deployed Build Failure), Patch-For-Review, collaboration-services, Release-Engineering-Team, Traffic, Gerrit
ABran-WMF reopened T421827: gerrit: Adapt timeouts to avoid 502 errors in CI jobs, a subtask of T420909: gerrit: Add Envoy in Gerrit's stack, as Open.
Wed, Apr 1, 9:09 AM · Patch-For-Review, collaboration-services, Release-Engineering-Team, Traffic, Gerrit
ABran-WMF reopened T421827: gerrit: Adapt timeouts to avoid 502 errors in CI jobs as "Open".

thanks for raising these, I'll check

Wed, Apr 1, 9:09 AM · ci-test-error (WMF-deployed Build Failure), Patch-For-Review, collaboration-services, Release-Engineering-Team, Traffic, Gerrit
ABran-WMF reopened T421827: gerrit: Adapt timeouts to avoid 502 errors in CI jobs, a subtask of T421736: Daily mediawiki-core-phpmetrics job broken since March 26, 2026, as Open.
Wed, Apr 1, 9:09 AM · collaboration-services, Release-Engineering-Team, Continuous-Integration-Infrastructure
ABran-WMF updated the task description for T417996: Fix up Gerrit sshd.idleTimeout.
Wed, Apr 1, 8:39 AM · Continuous-Integration-Infrastructure (Zuul upgrade), collaboration-services, Release-Engineering-Team, Gerrit
ABran-WMF assigned T417996: Fix up Gerrit sshd.idleTimeout to hashar.

I'm not sure this is still a required, I've updated the timers in 1266149. Please @hashar, let me know if that is no longer required, I'll drop the patch and mark this as Resolved

Wed, Apr 1, 8:35 AM · Continuous-Integration-Infrastructure (Zuul upgrade), collaboration-services, Release-Engineering-Team, Gerrit
ABran-WMF closed T246763: Jenkins job failing intermittently due to Gerrit HTTP 502 errors when interacting with repos as Resolved.
Wed, Apr 1, 7:42 AM · collaboration-services, Release-Engineering-Team, Patch-For-Review, ci-test-error (WMF-deployed Build Failure), Release-Engineering-Team-TODO (2020-04 to 2020-06 (Q4)), Gerrit
ABran-WMF closed T421827: gerrit: Adapt timeouts to avoid 502 errors in CI jobs, a subtask of T420909: gerrit: Add Envoy in Gerrit's stack, as Resolved.
Wed, Apr 1, 6:50 AM · Patch-For-Review, collaboration-services, Release-Engineering-Team, Traffic, Gerrit
ABran-WMF closed T421827: gerrit: Adapt timeouts to avoid 502 errors in CI jobs as Resolved.
Wed, Apr 1, 6:50 AM · ci-test-error (WMF-deployed Build Failure), Patch-For-Review, collaboration-services, Release-Engineering-Team, Traffic, Gerrit
ABran-WMF closed T421827: gerrit: Adapt timeouts to avoid 502 errors in CI jobs, a subtask of T421736: Daily mediawiki-core-phpmetrics job broken since March 26, 2026, as Resolved.
Wed, Apr 1, 6:50 AM · collaboration-services, Release-Engineering-Team, Continuous-Integration-Infrastructure

Tue, Mar 31

ABran-WMF moved T246763: Jenkins job failing intermittently due to Gerrit HTTP 502 errors when interacting with repos from Work in Progress to Awaiting Input on the collaboration-services board.

follow up done in T421736: Daily mediawiki-core-phpmetrics job broken since March 26, 2026 and T421827: gerrit: Adapt timeouts to avoid 502 errors in CI jobs:
Envoy upstream timeout has been updated to allow longer git fetch commands, let me know if that issue is still present after that.

Tue, Mar 31, 8:37 AM · collaboration-services, Release-Engineering-Team, Patch-For-Review, ci-test-error (WMF-deployed Build Failure), Release-Engineering-Team-TODO (2020-04 to 2020-06 (Q4)), Gerrit
ABran-WMF moved T421827: gerrit: Adapt timeouts to avoid 502 errors in CI jobs from Work in Progress to Awaiting Input on the collaboration-services board.

after merging that config change, mediawiki-core-phpmetrics (from T421736: Daily mediawiki-core-phpmetrics job broken since March 26, 2026) now shows a successful build, @Jdlrobson @DLynch please let me know if your issue is still present after that change.

Tue, Mar 31, 8:33 AM · ci-test-error (WMF-deployed Build Failure), Patch-For-Review, collaboration-services, Release-Engineering-Team, Traffic, Gerrit
ABran-WMF changed the status of T421736: Daily mediawiki-core-phpmetrics job broken since March 26, 2026 from Open to In Progress.

@dancy I've updated Envoy's configuration to increase the upstream_response_timeout to 300s instead of 120 → I've triggered a mediawiki-core-phpmetrics build and it went OK: https://integration.wikimedia.org/ci/job/mediawiki-core-phpmetrics/705/console

Tue, Mar 31, 8:29 AM · collaboration-services, Release-Engineering-Team, Continuous-Integration-Infrastructure
ABran-WMF added a parent task for T421827: gerrit: Adapt timeouts to avoid 502 errors in CI jobs: T421736: Daily mediawiki-core-phpmetrics job broken since March 26, 2026.
Tue, Mar 31, 8:07 AM · ci-test-error (WMF-deployed Build Failure), Patch-For-Review, collaboration-services, Release-Engineering-Team, Traffic, Gerrit
ABran-WMF added a subtask for T421736: Daily mediawiki-core-phpmetrics job broken since March 26, 2026: T421827: gerrit: Adapt timeouts to avoid 502 errors in CI jobs.
Tue, Mar 31, 8:07 AM · collaboration-services, Release-Engineering-Team, Continuous-Integration-Infrastructure
ABran-WMF changed the status of T421827: gerrit: Adapt timeouts to avoid 502 errors in CI jobs, a subtask of T420909: gerrit: Add Envoy in Gerrit's stack, from Open to In Progress.
Tue, Mar 31, 7:45 AM · Patch-For-Review, collaboration-services, Release-Engineering-Team, Traffic, Gerrit
ABran-WMF changed the status of T421827: gerrit: Adapt timeouts to avoid 502 errors in CI jobs from Open to In Progress.
Tue, Mar 31, 7:45 AM · ci-test-error (WMF-deployed Build Failure), Patch-For-Review, collaboration-services, Release-Engineering-Team, Traffic, Gerrit
ABran-WMF closed T420909: gerrit: Add Envoy in Gerrit's stack, a subtask of T420189: Gerrit: Debug connection re-use on Gerrit's httpd causing Gerrit interface to be very slow, as Resolved.
Tue, Mar 31, 6:47 AM · Patch-For-Review, collaboration-services, Release-Engineering-Team, Traffic, Gerrit
ABran-WMF closed T420909: gerrit: Add Envoy in Gerrit's stack as Resolved.

Envoy has been added to our stack, we need to follow up in T421827: gerrit: Adapt timeouts to avoid 502 errors in CI jobs to adapt configurations

Tue, Mar 31, 6:47 AM · Patch-For-Review, collaboration-services, Release-Engineering-Team, Traffic, Gerrit
ABran-WMF created T421827: gerrit: Adapt timeouts to avoid 502 errors in CI jobs.
Tue, Mar 31, 6:46 AM · ci-test-error (WMF-deployed Build Failure), Patch-For-Review, collaboration-services, Release-Engineering-Team, Traffic, Gerrit

Mon, Mar 30

ABran-WMF updated subscribers of T421736: Daily mediawiki-core-phpmetrics job broken since March 26, 2026.

thanks @dancy for highlighting this, that is probably stemming from T420909: gerrit: Add Envoy in Gerrit's stack.
I think operations/puppet/+/1262020 could fix it, wdyt @hashar?

Mon, Mar 30, 3:44 PM · collaboration-services, Release-Engineering-Team, Continuous-Integration-Infrastructure
ABran-WMF moved T246763: Jenkins job failing intermittently due to Gerrit HTTP 502 errors when interacting with repos from Work in Progress (Tracking tasks) to Work in Progress on the collaboration-services board.

Yes, you should have access to do that now.

Mon, Mar 30, 3:20 PM · collaboration-services, Release-Engineering-Team, Patch-For-Review, ci-test-error (WMF-deployed Build Failure), Release-Engineering-Team-TODO (2020-04 to 2020-06 (Q4)), Gerrit
ABran-WMF added a subtask for T415237: etherpad table size is 233GB / plan to delete all etherpads in May 2026: T421315: Update the default wording shown at the top of all new WM-etherpads to explain the auto-deletion.
Mon, Mar 30, 12:30 PM · User-notice, collaboration-services, Wikimedia-Etherpad, Data-Persistence
ABran-WMF added a subtask for T420793: Warn Etherpad users of upcoming purge: T421315: Update the default wording shown at the top of all new WM-etherpads to explain the auto-deletion.
Mon, Mar 30, 12:29 PM · collaboration-services, Wikimedia-Etherpad
ABran-WMF added parent tasks for T421315: Update the default wording shown at the top of all new WM-etherpads to explain the auto-deletion: T420793: Warn Etherpad users of upcoming purge, T415237: etherpad table size is 233GB / plan to delete all etherpads in May 2026.
Mon, Mar 30, 12:29 PM · collaboration-services, Wikimedia-Etherpad
ABran-WMF added a comment to T402260: Replace Spamassassin with Rspam for VRTS on Postfix.

The new training flow keeps the existing VRTS export unchanged: vrts.TicketExport2Mbox.pl still produces mbox input in /var/spool/spam/{spam,ham}, and SpamAssassin is still trained with sa-learn --mbox and spamassassin --add-to-blacklist/--add-to-whitelist --mbox.

Mon, Mar 30, 12:22 PM · Patch-For-Review, collaboration-services, vrts, Znuny, Infrastructure-Foundations, Mail, SRE
ABran-WMF added a comment to T246763: Jenkins job failing intermittently due to Gerrit HTTP 502 errors when interacting with repos.

Try now please.

Mon, Mar 30, 6:40 AM · collaboration-services, Release-Engineering-Team, Patch-For-Review, ci-test-error (WMF-deployed Build Failure), Release-Engineering-Team-TODO (2020-04 to 2020-06 (Q4)), Gerrit

Fri, Mar 27

ABran-WMF added a comment to T420909: gerrit: Add Envoy in Gerrit's stack.

we tweaked several knobs on httpd and Envoy and still have the same underlying issue, I think aligning Jetty with the rest of the timers could yield more results

Fri, Mar 27, 9:42 AM · Patch-For-Review, collaboration-services, Release-Engineering-Team, Traffic, Gerrit
ABran-WMF added a comment to T420909: gerrit: Add Envoy in Gerrit's stack.
profile::tlsproxy::envoy::upstream_tls: true
profile::tlsproxy::envoy::upstream_response_timeout: 120.0

We need the counterpart for downstream (Apache) which currently has a timeout of 122. Maybe:

profile::tlsproxy::envoy::downstream_idle_timeout: 110
Fri, Mar 27, 6:30 AM · Patch-For-Review, collaboration-services, Release-Engineering-Team, Traffic, Gerrit
ABran-WMF added a comment to T246763: Jenkins job failing intermittently due to Gerrit HTTP 502 errors when interacting with repos.

Note: If you get a 404 when visiting the job URL, click the "Sign In" button on the top left of the page, then try again.

Fri, Mar 27, 6:15 AM · collaboration-services, Release-Engineering-Team, Patch-For-Review, ci-test-error (WMF-deployed Build Failure), Release-Engineering-Team-TODO (2020-04 to 2020-06 (Q4)), Gerrit
ABran-WMF closed T417279: Harmonize DNS on all gerrit instances, a subtask of T387833: Gerrit switchover process, as Resolved.
Fri, Mar 27, 6:14 AM · Gerrit, Patch-For-Review, collaboration-services
ABran-WMF closed T417279: Harmonize DNS on all gerrit instances as Resolved.

We moved all gerrit instances behind CDN, they all have the same DNS pattern

Fri, Mar 27, 6:13 AM · Gerrit, collaboration-services

Thu, Mar 26

ABran-WMF added a comment to T402260: Replace Spamassassin with Rspam for VRTS on Postfix.

I've been able to test my change on Pontoon:

Thu, Mar 26, 3:59 PM · Patch-For-Review, collaboration-services, vrts, Znuny, Infrastructure-Foundations, Mail, SRE
ABran-WMF changed the status of T278495: Figure out plan for mailman IP situation from Stalled to Open.
Thu, Mar 26, 1:41 PM · collaboration-services, SRE, Wikimedia-Mailing-lists
ABran-WMF closed T246763: Jenkins job failing intermittently due to Gerrit HTTP 502 errors when interacting with repos as Resolved.

Hey @neriah, thanks for raising that issue. With {T420595} and  T420909: gerrit: Add Envoy in Gerrit's stack the situation should be improved.
I'm marking this task as resolved, please let us know if it does not, or it does not sufficiently, and feel free to reopen this task.

Thu, Mar 26, 7:26 AM · collaboration-services, Release-Engineering-Team, Patch-For-Review, ci-test-error (WMF-deployed Build Failure), Release-Engineering-Team-TODO (2020-04 to 2020-06 (Q4)), Gerrit
ABran-WMF moved T246763: Jenkins job failing intermittently due to Gerrit HTTP 502 errors when interacting with repos from Incoming to Work in Progress (Tracking tasks) on the collaboration-services board.
Thu, Mar 26, 7:20 AM · collaboration-services, Release-Engineering-Team, Patch-For-Review, ci-test-error (WMF-deployed Build Failure), Release-Engineering-Team-TODO (2020-04 to 2020-06 (Q4)), Gerrit
ABran-WMF renamed T421335: SystemdUnitFailed - backup-restore.service on gitlab2002:9100 from SystemdUnitFailed to SystemdUnitFailed - backup-restore.service on gitlab2002:9100.
Thu, Mar 26, 7:20 AM · collaboration-services
ABran-WMF closed T420909: gerrit: Add Envoy in Gerrit's stack as Resolved.

gerrit-spare now uses Envoy to expose its service to the CDN.

so does gerrit-replica

Thu, Mar 26, 6:41 AM · Patch-For-Review, collaboration-services, Release-Engineering-Team, Traffic, Gerrit