Page MenuHomePhabricator

integration-agent-docker-1061 is offline
Closed, ResolvedPublic

Description

For some reason integration-agent-docker-1061 went offline and can't be reconnected. https://integration.wikimedia.org/ci/computer/integration%2Dagent%2Ddocker%2D1061/log gives:

[05/07/25 06:26:54] [SSH] Opening SSH connection to 172.16.6.139:22.
[05/07/25 06:26:54] [SSH] SSH host key matches key seen previously for this host. Connection will be allowed.
ERROR: Server rejected the 1 private key(s) for jenkins-deploy (credentialId:jenkins-deploy-toolforge/method:publickey)
ERROR: Failed to authenticate as jenkins-deploy with credential=jenkins-deploy-toolforge
...
[05/07/25 06:26:55] [SSH] Authentication failed.
Authentication failed.
[05/07/25 06:26:55] Launch failed - cleaning up connection
[05/07/25 06:26:55] [SSH] Connection closed.

The instance logs at https://horizon.wikimedia.org/project/instances/69cb4a50-ab68-498e-9b7c-d9681d13ff0e/ have/had some issue:

The last Puppet run was at Thu Mar 20 22:12:06 UTC 2025 (7 minutes ago). 
Last Puppet commit: (08f11b4d25) Dzahn - cloud.yaml: Supply default for profile::tlsproxy::envoy::global_cert_name
Last login: Thu Mar 20 21:44:43 UTC 2025 on ttyS0
[?2004hroot@integration-agent-docker-1061:~# [571761.397088] INFO: task kworker/u16:5:1084672 blocked for more than 120 seconds.
[571761.400288]       Not tainted 5.10.0-34-cloud-amd64 #1 Debian 5.10.234-1
[571761.403487] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[571761.407374] INFO: task kworker/u16:7:1091131 blocked for more than 120 seconds.
[571761.410920]       Not tainted 5.10.0-34-cloud-amd64 #1 Debian 5.10.234-1
[571761.412814] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
...
[572498.421035] systemd[1]: Failed to start Journal Service.

The instance was created on March 20, 2025, 9:37 p.m. by @thcipriani .

I don't know what jenkins-deploy-toolforge is for, that looks like a mistake: T393543

Event Timeline

Mentioned in SAL (#wikimedia-releng) [2025-05-07T07:03:12Z] <hashar> Hard rebooted integration-agent-docker-1061 via Horizon, the instance is not reachable by ssh and looks bricked # T393542

hashar claimed this task.

I have rebooted the instance and it might have been running just fine since Puppet did ran before the reboot. From the console I have:

The last Puppet run was at Wed May  7 06:57:20 UTC 2025 (5 minutes ago).

The Jenkins controller has sshed to it so it is back in the pool and I am not bothering investigating further :)