Page MenuHomePhabricator

scap should not run mediawiki-image-download on pooled=inactive servers
Closed, DuplicatePublic

Description

We had this a couple of times in the past (T363086,T362938) :

When a kubernetes node is set pooled=inactive scap still tries to run mediawiki-image-download and fails(?) if the server is not reachable:

15:08:17 /usr/bin/sudo /usr/local/sbin/mediawiki-image-download 2024-05-01-150512-publish (ran as mwdeploy@mw2382.codfw.wmnet) returned [255]: ssh: connect to host mw2382.codfw.wmnet port 22: Connection timed out

IIUC inactive mediawiki appservers do not receive code via scap, I think it should behave the same in this case.

Alternatively there could be a threshold of failed targets (like if 10% of hosts fail mediawiki-image-download it's still fine).

Event Timeline

Change #1026446 had a related patch set uploaded (by JMeybohm; author: JMeybohm):

[operations/puppet@production] Remove mw2382 as kubernetes node to prevent scap failures

https://gerrit.wikimedia.org/r/1026446

Change #1026446 merged by JMeybohm:

[operations/puppet@production] Remove mw2382 as kubernetes node to prevent scap failures

https://gerrit.wikimedia.org/r/1026446

FYI this happened for me again, despite the above patch

19:48:44 /usr/bin/sudo /usr/local/sbin/mediawiki-image-download 2024-05-02-194555-publish (ran as mwdeploy@mw2382.codfw.wmnet) returned [255]: ssh: connect to host mw2382.codfw.wmnet port 22: Connection timed out