Page MenuHomePhabricator

Implement a retry policy for network errors in CI
Closed, DuplicatePublic

Description

after T421827: gerrit: Adapt timeouts to avoid 502 errors in CI jobs it seemed the best solution to address the remaining network errors in CI jobs was to implement a retry policy on git network errors.

A basic CR created in that task tries to explore that solution.

Related Objects

StatusSubtypeAssignedTask
ResolvedDzahn
ResolvedNone
ResolvedCDanis
ResolvedABran-WMF
ResolvedJelto
ResolvedDzahn
ResolvedDzahn
ResolvedJelto
ResolvedVgutierrez
ResolvedABran-WMF
ResolvedABran-WMF
ResolvedABran-WMF
ResolvedABran-WMF
ResolvedABran-WMF
ResolvedABran-WMF
Resolved dancy
ResolvedABran-WMF
DuplicateABran-WMF
OpenNone

Event Timeline

Change #1278483 had a related patch set uploaded (by Arnaudb; author: Arnaudb):

[integration/quibble@master] zuul: retry policy on network errors

https://gerrit.wikimedia.org/r/1278483

Example of this happening still: unable to access 'https://gerrit.wikimedia.org/r/mediawiki/extensions/WikiEditor/': GnuTLS recv error (-54): Error in the pull function.' @ https://integration.wikimedia.org/ci/job/quibble-with-gated-extensions-vendor-mysql-php83/31065/console

Dreamy_Jazz subscribed.

And again fatal: unable to access 'https://gerrit.wikimedia.org/r/mediawiki/extensions/WikimediaCampaignEvents/': GnuTLS recv error (-54): Error in the pull function.: https://integration.wikimedia.org/ci/job/quibble-with-gated-extensions-vendor-mysql-php83/31294/console#console-section-4

If folks are planning to land a patch to Quibble to try and help with these CI errors, I wonder whether it might be worth taking the opportunity to also add some verbose logging to Quibble's git commands; to see if the additional logging could help to work out what's actually causing the issues in the first place (xref T421827#11785023).
I guess... whether or not they can be worked-around in Wikimedia CI by the addition of a retry policy, it doesn't feel like the new layers put in front of Gerrit should have caused these additional errors; so I suppose (IMO) it might be ideal if we could find out what's causing them.

Another example (of several in recent days) that's notable because it's really burning up the current backport window: https://integration.wikimedia.org/ci/job/quibble-with-gated-extensions-vendor-mysql-php83/33059/console

(for this task's record, T420865: Fetches from Gerrit aborted due to: GnuTLS recv error (-54): Error in the pull function was reopened / unmerged as a duplicate of T420909: gerrit: Add Envoy in Gerrit's stack. so I assume that it might now be the 'canonical' task for this issue? but i'm not completely sure tbh; IMO the way the tasks about this problem have been organised has been a bit confusing.)

ABran-WMF claimed this task.
ABran-WMF added a subscriber: hashar.

so I assume that it might now be the 'canonical' task for this issue?

Since @hashar reopened T420865#11897507 I think it will be less confusing to focus on a single task to track that effort, I'll mark that one as resolved.

(boldly closing as a duplicate instead, as IMO it is a bit confusing to have this task as 'resolved' when the retry policy patch hasn't actually yet been merged FWICS)

Change #1278483 had a related patch set uploaded (by Thcipriani; author: Arnaudb):

[integration/quibble@master] zuul: retry policy on network errors

https://gerrit.wikimedia.org/r/1278483