Page MenuHomePhabricator

publish-to-doc job is close to its timeout, with some builds lost
Closed, ResolvedPublic

Description

It seems the rsync job currently has a timeout of 3 minutes. The successful runs now take about 2.5 minutes for MediaWiki core, and in some cases slightly longer, thus getting discarded.

https://integration.wikimedia.org/ci/job/publish-to-doc1001/73099/console

00:00:00.001 Started by upstream project "mediawiki-core-doxygen-docker" build number 16595
…
00:00:00.003 Building remotely on contint2001 (dockerPublish pipelinelib blubber productionAgents train) in workspace /srv/jenkins-slave/workspace/publish-to-doc1001
…
00:00:00.154 [publish-to-doc1001] …
00:00:00.157 Fetching from:
00:00:00.157 - Instance...: 172.16.7.208
00:00:00.157 - Workspace..: /srv/jenkins/workspace/workspace/mediawiki-core-doxygen-docker
00:00:00.157 - Subdir.....: log/build/html
00:00:00.158 + rsync --archive --compress '--rsh=/usr/bin/ssh -a -T -o ConnectTimeout=6 -o UserKnownHostsFile=/dev/null -o StrictHostKeyChecking=no' jenkins-deploy@172.16.7.208:/srv/jenkins/workspace/workspace/mediawiki-core-doxygen-docker/log/build/html/. .
00:02:32.238 Creating remote directory mediawiki-core/master/php
00:02:32.594 sending incremental file list
00:02:32.903 
00:02:32.903 sent 110 bytes  received 27 bytes  91.33 bytes/sec
00:02:32.903 total size is 0  speedup is 0.00
00:02:32.906 Publishing ...
00:02:32.907 + rsync --archive --compress --delete-after . rsync://doc1001.eqiad.wmnet/doc/mediawiki-core/master/php
00:03:00.113 Build timed out (after 3 minutes). Marking the build as failed.
…
00:03:00.347 Finished: FAILURE

Event Timeline

Krinkle renamed this task from publish-to-doc job it close to its timeout / sometimes aborted to publish-to-doc job is close to its timeout, with some builds lost.Jun 14 2020, 3:46 PM

I gave it a quick look this morning:

Mediawiki core refactored hooks and the hookrunner implements dozens of interfaces, thus the Graphiviz collaboration diagram for each of those classes ends up being a fairly large png file (up to 4 MB). As a result, the total size of the generated documentation has grown by an order of magnitude.

The labs instances have an egress traffic shaping set at 240MBits (can be seen via sudo tc class show dev eth0). We had the same issue with the castor instance (T232644).

We can thus solve this by:

A) bumping the timeout in the jjb job

B) Setting in hiera labstore::traffic_shaping::egress: 100mbps (and running tc-setup on all instances).

Change 605549 had a related patch set uploaded (by Hashar; owner: Hashar):
[integration/config@master] Raise timeout for publish-to-doc1001

https://gerrit.wikimedia.org/r/605549

Change 605550 had a related patch set uploaded (by Hashar; owner: Hashar):
[operations/puppet@production] contint: raise egress traffic shaping limit

https://gerrit.wikimedia.org/r/605550

Change 605549 merged by jenkins-bot:
[integration/config@master] Raise timeout for publish-to-doc1001

https://gerrit.wikimedia.org/r/605549

The Jenkins job now times out after 6 minutes and I have changed the egress traffic shaping. Now pending review/merge of puppet patch:

https://gerrit.wikimedia.org/r/#/c/operations/puppet/+/605550/

Change 605550 merged by Bstorm:
[operations/puppet@production] contint: raise egress traffic shaping limit

https://gerrit.wikimedia.org/r/605550

Solved by raising the timeout from 3 minutes to 6 and tripling the available bandwidth.

Mentioned in SAL (#wikimedia-releng) [2020-06-18T18:28:04Z] <hashar> integration-castor03: remove labstore::traffic_shaping::egress: 100mbps in horizon. It is now applied project wide via puppet.git # T232644 T255371