Page MenuHomePhabricator

labs security rules are flappy / invalid cause network communications issues
Closed, ResolvedPublic

Description

Since Friday Feb 6th 23:30 UTC, puppet runs on instance of the integration labs project started failing. The instances puppet agent point to integration-puppetmaster.eqiad.wmflabs but timeout connecting to it. The puppet agent local to integration-puppetmaster works fine though.

On one of the instance:

Warning: Unable to fetch my node definition, but the agent run will continue:
Warning: Connection timed out - connect(2)
Info: Retrieving plugin
...
Error: Could not retrieve catalog from remote server: Connection timed out - connect(2)
Notice: Using cached catalog
Info: Applying configuration version '1423261026'

Additionally the Jenkins master on gallium ( 208.80.154.135 ) is no more able to ssh to integration-slave1001.eqiad.wmflabs ( 10.68.16.60 ) after I rebooted the instance.

gallium:~$ telnet 10.68.16.60 22
Trying 10.68.16.60...
telnet: Unable to connect to remote host: Connection timed out

Though I can ssh to it from the labs bastion. The instance has some iptables rules but they allow connections from gallium on port 22 (ssh).

Event Timeline

hashar created this task.Feb 9 2015, 9:14 AM
hashar raised the priority of this task from to High.
hashar updated the task description. (Show Details)
hashar added a subscriber: hashar.
Restricted Application added a subscriber: Aklapper. · View Herald TranscriptFeb 9 2015, 9:14 AM
hashar added a comment.Feb 9 2015, 9:46 AM

I have stopped the puppet master on integration-puppetmaster and started listening for connections using netcat -l 8140. From integration-slave1002 I tried to telnet 10.68.16.96 8140 and have a connection timeout.

Seems there is some firewall rule in between that prevents network communications between the instances and the puppetmaster. I can't find any specific iptables rule that could cause the problem.

Maybe ops have some idea? :-(

hashar renamed this task from integration-puppetmaster does not respond to other instances to labs security rules are flappy / invalid cause network communications issues.Feb 9 2015, 10:27 AM
hashar updated the task description. (Show Details)
hashar set Security to None.
coren closed this task as Resolved.Feb 9 2015, 3:53 PM
coren claimed this task.
coren added a subscriber: coren.

The project security group did not (was changed not to?) include allowing ssh from gallium. Fixing that did the trick.

hashar reopened this task as Open.Feb 9 2015, 3:59 PM

The integration labs project was missing a security rule to allow ssh from gallium for some reason. I have added it back: allow from gallium to port 22 (ssh) and that fixed the connection issue to integration-slave1001.eqiad.wmflabs

The puppetmaster issue must be related though I have yet to figure out the reason.

coren added a comment.Feb 9 2015, 4:06 PM

The puppetmaster issue did appear related: adding an explcit rule to allow it fixed the immediate problem, but seems to point at something having changed about the applied defaults. This will need further investigation.

coren closed this task as Resolved.Feb 9 2015, 4:07 PM
hashar added a comment.Feb 9 2015, 4:07 PM

The deployment-prep labs project also uses a local puppetmaster but it does not need any specific security rule to allow puppet connections (port 8140).

@hashar
root@integration-slave1002:~# telnet 10.68.16.96 8140
Trying 10.68.16.96...
Connected to 10.68.16.96.
Escape character is '^]'.

I assume everything is OK ? (or at least some my testing says)

coren added a comment.Feb 9 2015, 4:11 PM

Yeah, things are working fine now with an explicit rule - but the necessity of having the explicit rule seems to be new.

The doc publishing jobs are failing as well and there is no workaround for it :( T89026