Page MenuHomePhabricator

Agent deployment-deploy03 is offline
Closed, ResolvedPublic

Description

Agent deployment-deploy03 is offline, with states changing between This node is being launched. and This agent is offline because Jenkins failed to launch the agent process on it.. Unable to view the log.

[02/08/23 08:51:38] [SSH] Opening SSH connection to 172.16.4.233:22.
connect timed out

Occurred after/around this beta-code-update-eqiad run:

14:13:00 Started by timer
14:13:00 Running as SYSTEM
14:13:00 Building remotely on deployment-deploy03 (BetaClusterBastion) in workspace /srv/jenkins/home/jenkins-deploy/workspace/beta-code-update-eqiad
14:16:35 FATAL: java.io.IOException: Unexpected termination of the channel
14:16:35 java.io.EOFException
14:16:35 	at java.base/java.io.ObjectInputStream$PeekInputStream.readFully(ObjectInputStream.java:2894)
14:16:35 	at java.base/java.io.ObjectInputStream$BlockDataInputStream.readShort(ObjectInputStream.java:3389)
14:16:35 	at java.base/java.io.ObjectInputStream.readStreamHeader(ObjectInputStream.java:931)
14:16:35 	at java.base/java.io.ObjectInputStream.<init>(ObjectInputStream.java:374)
14:16:35 	at hudson.remoting.ObjectInputStreamEx.<init>(ObjectInputStreamEx.java:49)
14:16:35 	at hudson.remoting.Command.readFrom(Command.java:142)
14:16:35 	at hudson.remoting.Command.readFrom(Command.java:128)
14:16:35 	at hudson.remoting.AbstractSynchronousByteArrayCommandTransport.read(AbstractSynchronousByteArrayCommandTransport.java:35)
14:16:35 	at hudson.remoting.SynchronousCommandTransport$ReaderThread.run(SynchronousCommandTransport.java:61)
14:16:35 Caused: java.io.IOException: Unexpected termination of the channel
14:16:35 	at hudson.remoting.SynchronousCommandTransport$ReaderThread.run(SynchronousCommandTransport.java:75)
14:16:35 Also:   hudson.remoting.Channel$CallSiteStackTrace: Remote call to deployment-deploy03
14:16:35 		at hudson.remoting.Channel.attachCallSiteStackTrace(Channel.java:1784)
14:16:35 		at hudson.remoting.Request.call(Request.java:199)
14:16:35 		at hudson.remoting.Channel.call(Channel.java:999)
14:16:35 		at hudson.FilePath.act(FilePath.java:1186)
14:16:35 		at hudson.FilePath.act(FilePath.java:1175)
14:16:35 		at hudson.FilePath.mkdirs(FilePath.java:1366)
14:16:35 		at hudson.model.AbstractProject.checkout(AbstractProject.java:1235)
14:16:35 		at hudson.model.AbstractBuild$AbstractBuildExecution.defaultCheckout(AbstractBuild.java:649)
14:16:35 		at jenkins.scm.SCMCheckoutStrategy.checkout(SCMCheckoutStrategy.java:85)
14:16:35 		at hudson.model.AbstractBuild$AbstractBuildExecution.run(AbstractBuild.java:521)
14:16:35 		at hudson.model.Run.execute(Run.java:1900)
14:16:35 		at hudson.model.FreeStyleBuild.run(FreeStyleBuild.java:44)
14:16:35 		at hudson.model.ResourceController.execute(ResourceController.java:107)
14:16:35 		at hudson.model.Executor.run(Executor.java:449)
14:16:35 Caused: hudson.remoting.RequestAbortedException
14:16:35 	at hudson.remoting.Request.abort(Request.java:346)
14:16:35 	at hudson.remoting.Channel.terminate(Channel.java:1080)
14:16:35 	at hudson.remoting.SynchronousCommandTransport$ReaderThread.run(SynchronousCommandTransport.java:90)
14:16:35 IRC notifier plugin: Sending notification to: #wikimedia-releng
14:39:22 IRC notifier plugin: There was an error sending notification to: #wikimedia-releng
14:39:22 java.lang.RuntimeException: Waiting for mode response interrupted
14:39:22 	at org.pircbotx.Channel.getMode(Channel.java:194)
14:39:22 	at hudson.plugins.ircbot.v2.IRCConnection.send(IRCConnection.java:414)
14:39:22 	at hudson.plugins.ircbot.v2.IRCConnection.send(IRCConnection.java:406)
14:39:22 	at hudson.plugins.im.IMPublisher.sendNotification(IMPublisher.java:488)
14:39:22 	at hudson.plugins.im.IMPublisher.notifyChatsOnBuildEnd(IMPublisher.java:718)
14:39:22 	at hudson.plugins.im.IMPublisher.notifyOnBuildEnd(IMPublisher.java:412)
14:39:22 	at hudson.plugins.im.IMPublisher.internalPerform(IMPublisher.java:364)
14:39:22 	at hudson.plugins.im.IMPublisher.perform(IMPublisher.java:342)
14:39:22 	at hudson.tasks.BuildStepMonitor$1.perform(BuildStepMonitor.java:20)
14:39:22 	at hudson.model.AbstractBuild$AbstractBuildExecution.perform(AbstractBuild.java:818)
14:39:22 	at hudson.model.AbstractBuild$AbstractBuildExecution.performAllBuildSteps(AbstractBuild.java:767)
14:39:22 	at hudson.model.Build$BuildExecution.post2(Build.java:179)
14:39:22 	at hudson.model.AbstractBuild$AbstractBuildExecution.post(AbstractBuild.java:711)
14:39:22 	at hudson.model.Run.execute(Run.java:1925)
14:39:22 	at hudson.model.FreeStyleBuild.run(FreeStyleBuild.java:44)
14:39:22 	at hudson.model.ResourceController.execute(ResourceController.java:107)
14:39:22 	at hudson.model.Executor.run(Executor.java:449)
14:39:22 Caused by: java.lang.InterruptedException
14:39:22 	at java.base/java.util.concurrent.locks.AbstractQueuedSynchronizer.doAcquireSharedInterruptibly(AbstractQueuedSynchronizer.java:1040)
14:39:22 	at java.base/java.util.concurrent.locks.AbstractQueuedSynchronizer.acquireSharedInterruptibly(AbstractQueuedSynchronizer.java:1345)
14:39:22 	at java.base/java.util.concurrent.CountDownLatch.await(CountDownLatch.java:232)
14:39:22 	at org.pircbotx.Channel.getMode(Channel.java:190)
14:39:22 	... 16 more

Event Timeline

TheresNoTime renamed this task from beta-code-update-eqiad: FATAL: java.io.IOException: Unexpected termination of the channel to Agent deployment-deploy03 is offline.Feb 7 2023, 3:27 PM
TheresNoTime added a project: Jenkins.
TheresNoTime updated the task description. (Show Details)
TheresNoTime triaged this task as Unbreak Now! priority.Feb 7 2023, 4:50 PM
TheresNoTime added a project: ci-test-error.

UBN!, affects beta deployment of merged patches/engineer testing :-(

Following https://www.mediawiki.org/wiki/Continuous_integration/Jenkins#Agent_remote_call_failed (not entirely related, but details an agent restart) shows the agent is not running:

[samtar@deployment-deploy03 ~]$ ps aux | grep -i jav[a]
[samtar@deployment-deploy03 ~]$

We can ssh from the bastion to the deployment-deploy03.deployment-prep.eqiad1.wikimedia.cloud instance when passing through the WMCS bastion. However from the CI server that times out at TCP level:

deployment-contint2001:~$ ssh -o ConnectTimeout=3 -v 172.16.4.233
OpenSSH_7.9p1 Debian-10+deb10u2, OpenSSL 1.1.1n  15 Mar 2022
debug1: Reading configuration data /etc/ssh/ssh_config
debug1: /etc/ssh/ssh_config line 2: Applying options for *
debug1: Connecting to 172.16.4.233 [172.16.4.233] port 22.
debug1: connect to address 172.16.4.233 port 22: Connection timed out
ssh: connect to host 172.16.4.233 port 22: Connection timed out

I am guessing it is either a missing firewall rule on the instance / WMCS project or the routing/firewalling between production and WMCS is rejecting the traffic (in which case I would assume the rest of the agents to have the same issue since most are on WMCS).

That is the ferm/iptables configuration on the instance. It is rejecting traffic originating from contint2001.wikimedia.org. That can be seen in /var/log/syslog on the instance:

Feb  8 09:01:31 deployment-deploy03 ulogd[609]: [fw-in-drop] IN=eth0 OUT= MAC=fa:16:3e:75:7c:52:fa:16:3e:ae:f5:88:08:00 SRC=208.80.153.15 DST=172.16.4.233 LEN=60 TOS=00 PREC=0x00 TTL=59 ID=12226 DF PROTO=TCP SPT=46902 DPT=22 SEQ=2617083181 ACK=0 WINDOW=42340 SYN URGP=0 MARK=0

The last job build was on Feb 7th at 14:13 UTC?) and the first match for contint2001 ip is at 14:20:39 UTC:

grep 208.80.153.15 /var/log/syslog.1|head -n1
Feb  7 14:20:39 deployment-deploy03 ulogd[3696]: [fw-in-drop] IN=eth0 OUT= MAC=fa:16:3e:75:7c:52:fa:16:3e:ae:f5:88:08:00 SRC=208.80.153.15 DST=172.16.4.233 LEN=60 TOS=00 PREC=0x00 TTL=59 ID=27735 DF PROTO=TCP SPT=48540 DPT=22 SEQ=1580465003 ACK=0 WINDOW=42340 SYN URGP=0 MARK=0

/etc/ferm/conf.d last touched at Feb 3rd 19:54. The last file touched is 00_defs on January 31st 16:23. But maybe a file got deleted. Anyway looking at the puppet log:

puppet.log.5.gz:Feb  3 19:54:16 deployment-deploy03 puppet-agent[3210]: Computing checksum on file /etc/ferm/conf.d/10_ssh_pybal
puppet.log.5.gz:Feb  3 19:54:16 deployment-deploy03 puppet-agent[3210]: (/Stage[main]/Ferm/File[/etc/ferm/conf.d/10_ssh_pybal]) Filebucketed /etc/ferm/conf.d/10_ssh_pybal to puppet with sum 550a8b7d9d5d9d7b5816c1c84288a81d
puppet.log.5.gz:Feb  3 19:54:16 deployment-deploy03 puppet-agent[3210]: (/Stage[main]/Ferm/File[/etc/ferm/conf.d/10_ssh_pybal]/ensure) removed
puppet.log.5.gz:Feb  3 19:54:16 deployment-deploy03 puppet-agent[3210]: (/etc/ferm/conf.d) Scheduling refresh of Service[ferm]
puppet.log.5.gz:Feb  3 19:54:26 deployment-deploy03 puppet-agent[3210]: (/Stage[main]/Ferm/Service[ferm]/ensure) ensure changed 'stopped' to 'running' (corrective)
puppet.log.5.gz:Feb  3 19:54:26 deployment-deploy03 puppet-agent[3210]: (/Stage[main]/Ferm/Service[ferm]) Unscheduling refresh on Service[ferm]

So tentatively ferm has been broken for a while and resumed on Feb 3rd? What I imagine is the rule that allows ssh connections from contint machines has been removed a while ago. The connection from contint2001 to the instance was already established and thus stayed around. Yesterday the network operations would have stopped the interface/terminated the connection which now can no more connect.

Question: how was the ssh ferm rule defined and why isn't it applied anymore?

Mentioned in SAL (#wikimedia-releng) [2023-02-08T09:15:59Z] <hashar> deployment-prep: deployment-deploy03 systemctl stop ferm to allow jenkins to connect to the instance # T329056

We used to have a ferm rule to allow ssh connections. That got removed in June 2020 by https://gerrit.wikimedia.org/r/c/operations/puppet/+/606737 cause the CI instances do not have any ferm rules applied from any role/profile.

On deployment-deploy03, different set of profiles are applied and some of them do apply ferm rules. Hence ferm is activated and the above rule not being present anymore, the ssh connection is rejected. It must have been a problem since we have merged that change or because a profile applied to the deployment-deploy03 instance suddenly started adding ferm rule (and thus enabling the service).

We need to revert https://gerrit.wikimedia.org/r/c/operations/puppet/+/606737

Change 887363 had a related patch set uploaded (by Hashar; author: Hashar):

[operations/puppet@production] Revert "contint: remove obsolete firewall rules from labs"

https://gerrit.wikimedia.org/r/887363

TheresNoTime lowered the priority of this task from Unbreak Now! to Needs Triage.Feb 8 2023, 9:37 AM

Dropping from UBN!, immediate fault resolved by @hashar and deployment jobs have run successfully.
Root cause resolution ongoing.

I have cherry picked https://gerrit.wikimedia.org/r/c/operations/puppet/+/887363 on deployment-puppetmaster04.deployment-prep.eqiad.wmflabs. Ran Puppet on deployment-deploy03.deployment-prep.eqiad1.wikimedia.cloud:

Notice: /Stage[main]/Profile::Ci::Slave::Labs::Common/Ferm::Service[contint1002_ssh_to_slaves]/File[/etc/ferm/conf.d/10_contint1002_ssh_to_slaves]/ensure: defined content as '{md5}ba030efcb65d8d744d545f8f68f39643'
Info: /Stage[main]/Profile::Ci::Slave::Labs::Common/Ferm::Service[contint1002_ssh_to_slaves]/File[/etc/ferm/conf.d/10_contint1002_ssh_to_slaves]: Scheduling refresh of Service[ferm]
Notice: /Stage[main]/Profile::Ci::Slave::Labs::Common/Ferm::Service[contint2001_ssh_to_slaves]/File[/etc/ferm/conf.d/10_contint2001_ssh_to_slaves]/ensure: defined content as '{md5}0f13c82a1e046b0a4f9f0b0cf419510d'
Info: /Stage[main]/Profile::Ci::Slave::Labs::Common/Ferm::Service[contint2001_ssh_to_slaves]/File[/etc/ferm/conf.d/10_contint2001_ssh_to_slaves]: Scheduling refresh of Service[ferm]
Notice: /Stage[main]/Ferm/Service[ferm]/ensure: ensure changed 'stopped' to 'running' (corrective)
Info: /Stage[main]/Ferm/Service[ferm]: Unscheduling refresh on Service[ferm]

iptables has the rules:

ACCEPT     tcp  --  208.80.154.132       0.0.0.0/0            tcp dpt:22
ACCEPT     tcp  --  208.80.153.15        0.0.0.0/0            tcp dpt:22

I have disconnected and reconnected the agent from https://integration.wikimedia.org/ci/manage/computer/deployment-deploy03/ and it managed to connect.

I have done the same on integration-puppetmaster-02.integration.eqiad.wmflabs, ran Puppet on integration-agent-docker-1023 but no ferm rules get applied. The instance has the role role::ci::slave::labs::docker which does include profile::ci::slave::labs::common. It does have ferm rules resources:

grep Ferm /var/lib/puppet/state/state.yaml 
Ferm::Conf[docker-ferm]:
Ferm::Conf[cumin-project-defs]:
Ferm::Rule[metricsinfra-prometheus-all]:
Ferm::Service[contint1002_ssh_to_slaves]:
Ferm::Service[contint2001_ssh_to_slaves]:

But somehow there is no ferm service:

systemctl status ferm
Unit ferm.service could not be found.

Else I guess we would had the issue on the integration project as well.

Essentially the issue is solved. It is pending the rule to be merged in puppet which is https://gerrit.wikimedia.org/r/c/operations/puppet/+/887363 and I need a follow up patch which refactor the commons bits shared with the two production hosts.

Change 887738 had a related patch set uploaded (by Hashar; author: Hashar):

[operations/puppet@production] contint: factor common firewalling rules

https://gerrit.wikimedia.org/r/887738

Change 887363 merged by Dzahn:

[operations/puppet@production] Revert "contint: remove obsolete firewall rules from labs"

https://gerrit.wikimedia.org/r/887363

Change 887738 merged by Dzahn:

[operations/puppet@production] contint: regroup common firewalling rules

https://gerrit.wikimedia.org/r/887738