Page MenuHomePhabricator

wikikube-worker1001 failed to docker pull on two consecutive deployments
Closed, ResolvedPublic

Description

13:51:05 Started docker pull on k8s nodes
13:55:27 /usr/bin/sudo /usr/local/sbin/mediawiki-image-download 2024-09-19-135040-publish (ran as mwdeploy@wikikube-worker1001.eqiad.wmnet) returned [255]: ssh: connect to host wikikube-worker1001.eqiad.wmnet port 22: Connection timed out

13:55:27 docker_pull_k8s: 100% (in-flight: 0; ok: 431; fail: 1; left: 0)        
13:55:27 1 K8s nodes failed to pull the multiversion image
13:55:27 Finished docker pull on k8s nodes (duration: 04m 21s)
14:06:06 Started docker pull on k8s nodes                                                                                                                                                     
14:10:30 /usr/bin/sudo /usr/local/sbin/mediawiki-image-download 2024-09-19-140541-publish (ran as mwdeploy@wikikube-worker1001.eqiad.wmnet) returned [255]: ssh: connect to host wikikube-worker1001.eqiad.wmnet port 22: Connection timed out                                                                                                                                              
                                                                                                                                                                                              
14:10:30 docker_pull_k8s: 100% (in-flight: 0; ok: 431; fail: 1; left: 0)                                                                                                                      
14:10:30 1 K8s nodes failed to pull the multiversion image                                                                                                                                    
14:10:30 Finished docker pull on k8s nodes (duration: 04m 24s)

No other errors during the rest of the deployment – I’m not sure if that means the deployment to wikikube-worker1001 still succeeded or if Kubernetes was okay with it failing.

Event Timeline

I have frankly no idea what tags to add to this… ops-eqiad is just a wild guess. But given that it happened twice, it feels worth investigating by someone™.

taavi added subscribers: akosiaris, taavi.

this probably has something to do with it? /cc @akosiaris

The last Puppet run was at Mon Sep 16 09:50:38 UTC 2024 (4583 minutes ago). Puppet is disabled. alex fooing around - akosiaris
JMeybohm claimed this task.
JMeybohm subscribed.

I think I know what this was about...re-enabled and ran puppet which should have fixed firewall rules for SSH from deploy hosts. Sorry for the trouble!

Thanks for fixing it @JMeybohm. For posterity's sake, the "fooing" part was related to T374366 and trying to figure out the race condition(s).