Page MenuHomePhabricator

MediaWiki periodic Doxygen Jenkins job fails to publish
Closed, ResolvedPublic

Description

The mediawiki-core-doxygen-docker job runs on an hourly basis. It started failing on March 27th 22:05 UTC.

The documentation itself seems to be properly generated:

22:11:13 ---------------------------------------------------
22:11:13 Doxygen execution finished.
22:11:13 Check above for possible errors.
22:11:13 
22:11:13 You might want to delete the temporary file:
22:11:13  /tmp/MWDocGen-xf2zTH
22:11:13 ---------------------------------------------------
22:11:14 [mediawiki-core-doxygen-docker] $ /bin/bash -xe /tmp/jenkins9741863308026659468.sh
22:11:14 + install -m 666 console.txt log/build/
22:11:14 + install -m 666 errors.txt log/build/

But the documentation publishing is failing:

22:11:14 [parameterized-trigger] Current build has no parameters.
22:11:14 Waiting for the completion of publish-to-doc
22:11:14 publish-to-doc #74266 started.
22:12:14 publish-to-doc #74266 completed. Result was FAILURE

The publish-to-doc job full output is:

Started by upstream project "mediawiki-core-doxygen-docker" build number 42088
originally caused by:
 Started by an SCM change
Running as SYSTEM
Building remotely on contint2001 (dockerPublish pipelinelib blubber productionAgents chartPromote train) in workspace /srv/jenkins-slave/workspace/publish-to-doc
[ssh-agent] Looking for ssh-agent implementation...
[ssh-agent]   Exec ssh-agent (binary ssh-agent on a remote machine)
$ ssh-agent
SSH_AUTH_SOCK=/tmp/ssh-kjhTP5Fi0Icr/agent.6772
SSH_AGENT_PID=6782
[ssh-agent] Started.
Running ssh-add (command line suppressed)
Identity added: /srv/jenkins-slave/workspace/publish-to-doc@tmp/private_key_11238384053154175065.key (/srv/jenkins-slave/workspace/publish-to-doc@tmp/private_key_11238384053154175065.key)
[ssh-agent] Using credentials jenkins-deploy (key to connect to labs instances set up with role::ci::slave::labs::common)
[WS-CLEANUP] Deleting project workspace...
[WS-CLEANUP] Deferred wipeout is used...
[publish-to-doc] $ /bin/bash -xe /tmp/jenkins2905738910039968438.sh
+ set -u
+ set +x
Fetching from:
- Instance...: 172.16.5.94
- Workspace..: /srv/jenkins/workspace/mediawiki-core-doxygen-docker
- Subdir.....: log/build/html
+ rsync --archive --stats --compress '--rsh=/usr/bin/ssh -a -T -o ConnectTimeout=6 -o UserKnownHostsFile=/dev/null -o StrictHostKeyChecking=no' jenkins-deploy@172.16.5.94:/srv/jenkins/workspace/mediawiki-core-doxygen-docker/log/build/html/. .
Warning: Permanently added '172.16.5.94' (ECDSA) to the list of known hosts.

Number of files: 30,901 (reg: 30,899, dir: 2)
Number of created files: 30,900 (reg: 30,899, dir: 1)
Number of deleted files: 0
Number of regular files transferred: 30,899
Total file size: 592,199,969 bytes
Total transferred file size: 592,199,969 bytes
Literal data: 592,199,969 bytes
Matched data: 0 bytes
File list size: 1,709,485
File list generation time: 0.102 seconds
File list transfer time: 0.000 seconds
Total bytes sent: 587,148
Total bytes received: 187,086,036

sent 587,148 bytes  received 187,086,036 bytes  6,824,479.42 bytes/sec
total size is 592,199,969  speedup is 3.16
Creating remote directory mediawiki-core/master/php
Publishing ...
+ rsync --archive --stats --compress --delete-after . rsync://doc.discovery.wmnet/doc/mediawiki-core/master/php
rsync: failed to set times on "/mediawiki-core/master/php/." (in doc): Operation not permitted (1)
rsync: failed to set times on "/mediawiki-core/master/php/search" (in doc): Operation not permitted (1)

Number of files: 30,901 (reg: 30,899, dir: 2)
Number of created files: 0
Number of deleted files: 0
Number of regular files transferred: 30,899
Total file size: 592,199,969 bytes
Total transferred file size: 592,199,969 bytes
Literal data: 6,798,363 bytes
Matched data: 585,401,606 bytes
File list size: 1,703,630
File list generation time: 0.216 seconds
File list transfer time: 0.000 seconds
Total bytes sent: 5,656,391
Total bytes received: 5,344,249

sent 5,656,391 bytes  received 5,344,249 bytes  372,903.05 bytes/sec
total size is 592,199,969  speedup is 53.83
rsync error: some files/attrs were not transferred (see previous errors) (code 23) at main.c(1207) [sender=3.1.3]
Build step 'Execute shell' marked build as failure
$ ssh-agent -k
unset SSH_AUTH_SOCK;
unset SSH_AGENT_PID;
echo Agent pid 6782 killed;
[ssh-agent] Stopped.
[WS-CLEANUP] Deleting project workspace...
[WS-CLEANUP] Deferred wipeout is used...
[WS-CLEANUP] done
Finished: FAILURE

Possibly due to some permission issues?

22:11:42 Creating remote directory mediawiki-core/master/php
22:11:42 Publishing ...
22:11:42 + rsync --archive --stats --compress --delete-after . rsync://doc.discovery.wmnet/doc/mediawiki-core/master/php
22:11:43 rsync: failed to set times on "/mediawiki-core/master/php/." (in doc): Operation not permitted (1)
22:11:49 rsync: failed to set times on "/mediawiki-core/master/php/search" (in doc): Operation not permitted (1)
...

22:12:13 sent 5,656,391 bytes  received 5,344,249 bytes  372,903.05 bytes/sec
22:12:13 total size is 592,199,969  speedup is 53.83
22:12:13 rsync error: some files/attrs were not transferred (see previous errors) (code 23) at main.c(1207) [sender=3.1.3]
22:12:13 Build step 'Execute shell' marked build as failure

Event Timeline

I am pretty sure that is due to fb0b73e29f06b2a5b4089f09a6de77806308d3ec for T319477. It sets the doc-uploader UID to 922:

modules/profile/manifests/doc.pp
-    user { 'doc-uploader':
-        ensure => present,
-        shell  => '/bin/false',
-        system => true,
+    systemd::sysuser { 'doc-uploader':
+      ensure      => present,
+      id          => '922:922',
+      description => 'doc-uploader system user',

The rsync module has:

/etc/rsync.d/frag-doc
[ doc ]
path            = /srv/doc
uid             = doc-uploader
gid             = doc-uploader
...

/srv/doc has been changed, I believe by Puppet, to be owned by uid 922 and gid 922:

$ ls -ldan /srv/doc
drwxr-xr-x 137 922 922 4096 Mar 12 17:00 /srv/doc/

But the underlying files are still owned by the old uid:

doc1002:~$ ls -lda /srv/doc/mediawiki-core
               vvv
drwxrwxr-x 116 499 doc-uploader 4096 Mar 14 03:31 /srv/doc/mediawiki-core/
               ^^^

Checking on the server, there is a mixup with recently uploaded documentation owned by doc-uploader with uid 922 while others are still owned by the old uid 499. One should recursively fix up the rights on all of the hosts (on doc1003 the files are owned by debmonitor).

Hosts affected (that might be cumin query P:doc):

doc1002.eqiad.wmnet
doc1003.eqiad.wmnet
doc2001.codfw.wmnet
doc2002.codfw.wmnet
fixup
sudo chown -R doc-uploader:doc-uploader /srv/doc

(I am pretty sure Puppet should NOT recursively apply the permission on each run cause there are a few millions files in there).

cgoubert@cumin1001:~$ sudo cumin 'P:doc' 'chown -R doc-uploader:doc-uploader /srv/doc'
4 hosts will be targeted:
doc[2001-2002].codfw.wmnet,doc[1002-1003].eqiad.wmnet
OK to proceed on 4 hosts? Enter the number of affected hosts to confirm or "q" to quit: 4
===== NO OUTPUT =====                                                                                          
PASS |β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ| 100% (4/4) [01:11<00:00, 17.79s/hosts]
FAIL |                                                                         |   0% (0/4) [01:11<?, ?hosts/s]
100.0% (4/4) success ratio (>= 100.0% threshold) for command: 'chown -R doc-upl...ploader /srv/doc'.
100.0% (4/4) success ratio (>= 100.0% threshold) of nodes successfully executed all commands.
``
hashar claimed this task.

That has fixed the publishing of the MediaWiki documentation ( https://integration.wikimedia.org/ci/job/mediawiki-core-doxygen-docker/ passed). Thank you @Clement_Goubert