scap should not run mediawiki-image-download on pooled=inactive servers
Closed, DuplicatePublic
Actions

Assigned To

None

Authored By

	JMeybohm
	May 2 2024, 8:10 AM

Description

We had this a couple of times in the past (T363086,T362938) :

When a kubernetes node is set pooled=inactive scap still tries to run mediawiki-image-download and fails(?) if the server is not reachable:

15:08:17 /usr/bin/sudo /usr/local/sbin/mediawiki-image-download 2024-05-01-150512-publish (ran as mwdeploy@mw2382.codfw.wmnet) returned [255]: ssh: connect to host mw2382.codfw.wmnet port 22: Connection timed out

IIUC inactive mediawiki appservers do not receive code via scap, I think it should behave the same in this case.

Alternatively there could be a threshold of failed targets (like if 10% of hosts fail mediawiki-image-download it's still fine).

Details

	Subject	Repo	Branch	Lines +/-
	Remove mw2382 as kubernetes node to prevent scap failures	operations/puppet	production	+0 -1

Customize query in gerrit

Related Objects

Mentioned In: T362938: Degraded RAID on mw2382
Mentioned Here: T367862: Use conftool to build scap's kubernetes_workers host list
T362938: Degraded RAID on mw2382
T363086: ManagementSSHDown parse1002.eqiad.wmnet

Event Timeline

JMeybohm created this task.May 2 2024, 8:10 AM

Restricted Application added a subscriber: Aklapper. · View Herald TranscriptMay 2 2024, 8:10 AM

Change #1026446 had a related patch set uploaded (by JMeybohm; author: JMeybohm):

[operations/puppet@production] Remove mw2382 as kubernetes node to prevent scap failures

https://gerrit.wikimedia.org/r/1026446

gerritbot added a project: Patch-For-Review.May 2 2024, 8:14 AM

JMeybohm mentioned this in T362938: Degraded RAID on mw2382.May 2 2024, 8:14 AM

Change #1026446 merged by JMeybohm:

[operations/puppet@production] Remove mw2382 as kubernetes node to prevent scap failures

https://gerrit.wikimedia.org/r/1026446

Maintenance_bot removed a project: Patch-For-Review.May 2 2024, 8:30 AM

FYI this happened for me again, despite the above patch

19:48:44 /usr/bin/sudo /usr/local/sbin/mediawiki-image-download 2024-05-02-194555-publish (ran as mwdeploy@mw2382.codfw.wmnet) returned [255]: ssh: connect to host mw2382.codfw.wmnet port 22: Connection timed out

See https://phabricator.wikimedia.org/T367862 for an approach with puppet to fix this issue

jnuche closed this task as a duplicate of T367862: Use conftool to build scap's kubernetes_workers host list.Jun 20 2024, 10:59 AM

scap should not run mediawiki-image-download on pooled=inactive serversClosed, DuplicatePublicActions

Description

Details

Related Objects

Event Timeline

scap should not run mediawiki-image-download on pooled=inactive servers
Closed, DuplicatePublic
Actions