slow salt-call invocation on minions
Closed, DeclinedPublic
Actions

Assigned To

Authored By

	fgiunchedi
	Nov 11 2015, 11:12 AM

Description

noticed this when first provisioning a new machine, salt-call would try to talk to tin on ipv6 port 6379 and fail, timeout, fallback to ipv4 and then succeed making the invocation unnecessarily slow. I don't see ~~the salt master~~ redis bound on ipv6 port, perhaps we could simply do that?

Details

Subject	Repo	Branch	Lines +/-
deployment: fix socket_connect_timeout argument	operations/puppet	production	+1 -1
deployment: fix pyredis timeout argument and timeout to 5s	operations/puppet	production	+2 -2
deployment: set socket_connect_timeout to 2s	operations/puppet	production	+4 -3
deployment: add redis socket_connect_timeout	operations/puppet	production	+10 -2

Customize query in gerrit

Related Objects

Mentioned Here: T115287: Move salt master to separate host from puppet master

Event Timeline

fgiunchedi created this task.Nov 11 2015, 11:12 AM

fgiunchedi assigned this task to ArielGlenn.

fgiunchedi raised the priority of this task from to Needs Triage.

fgiunchedi updated the task description. (Show Details)

fgiunchedi added projects: SRE, Salt.

fgiunchedi subscribed.

Restricted Application added subscribers: StudiesWorld, Aklapper. · View Herald TranscriptNov 11 2015, 11:12 AM

more context

root@restbase2001:~# ps fwaux | grep -i salt-call
root      2317  1.3  0.0 334692 50056 ?        Ssl  11:11   0:00      \_ /usr/bin/python /usr/bin/salt-call --log-level=quiet --out=json deploy.fetch cassandra/logstash-logback-encoder
root      2473  0.0  0.0  12720  2188 pts/1    S+   11:12   0:00      \_ grep -i salt-call
root@restbase2001:~# strace -f -p 2317
Process 2317 attached with 3 threads
[pid  2317] connect(12, {sa_family=AF_INET6, sin6_port=htons(6379), inet_pton(AF_INET6, "2620:0:861:101:10:64:0:196", &sin6_addr), sin6_flowinfo=0, sin6_scope_id=0}, 28 <unfinished ...>
[pid  2383] epoll_wait(9,  <unfinished ...>
[pid  2382] epoll_wait(7,

file descriptors

salt-call 2317 root  mem    REG    9,0   140928   915730 /lib/x86_64-linux-gnu/ld-2.19.so
salt-call 2317 root    0r   CHR    1,3      0t0     1028 /dev/null
salt-call 2317 root    1u   REG    9,0        0   392464 /tmp/puppet20151111-1073-1hqxoc1
salt-call 2317 root    2u   REG    9,0        0   392464 /tmp/puppet20151111-1073-1hqxoc1
salt-call 2317 root    3w   REG  253,0     3300 62128166 /var/log/salt/minion
salt-call 2317 root    4u  0000   0,10        0     7967 anon_inode
salt-call 2317 root    5r   CHR    1,9      0t0     1033 /dev/urandom
salt-call 2317 root    6u  0000   0,10        0     7967 anon_inode
salt-call 2317 root    7u  0000   0,10        0     7967 anon_inode
salt-call 2317 root    8u  0000   0,10        0     7967 anon_inode
salt-call 2317 root    9u  0000   0,10        0     7967 anon_inode
salt-call 2317 root   10u  0000   0,10        0     7967 anon_inode
salt-call 2317 root   11u  IPv4  67911      0t0      TCP restbase2001.codfw.wmnet:38302->palladium.eqiad.wmnet:4506 (ESTABLISHED)
salt-call 2317 root   12u  IPv6  69099      0t0      TCP [2620:0:860:102:3ea8:2aff:fe0a:eca0]:55278->tin.eqiad.wmnet:6379 (SYN_SENT)

fgiunchedi updated the task description. (Show Details)Nov 11 2015, 11:15 AM

fgiunchedi set Security to None.

ArielGlenn triaged this task as Medium priority.Nov 11 2015, 11:43 AM

looking at bit more into this, redis on tin is 2:2.6.13-1+wmf1 though ipv6 support landed in 2.8 as per https://github.com/antirez/redis/pull/61

proposed ad-hoc fix in trebuchet instead, https://github.com/trebuchet-deploy/trebuchet/pull/17

Change 254128 had a related patch set uploaded (by Filippo Giunchedi):
deployment: add redis socket_connect_timeout

https://gerrit.wikimedia.org/r/254128

gerritbot added a project: Patch-For-Review.Nov 19 2015, 11:05 AM

Change 254128 merged by Filippo Giunchedi:
deployment: add redis socket_connect_timeout

https://gerrit.wikimedia.org/r/254128

Change 255090 had a related patch set uploaded (by Filippo Giunchedi):
deployment: set socket_connect_timeout to 2s

https://gerrit.wikimedia.org/r/255090

Change 255090 merged by Filippo Giunchedi:
deployment: set socket_connect_timeout to 2s

https://gerrit.wikimedia.org/r/255090

Change 255092 had a related patch set uploaded (by Filippo Giunchedi):
deployment: fix pyredis timeout argument and timeout to 5s

https://gerrit.wikimedia.org/r/255092

Change 255092 merged by Filippo Giunchedi:
deployment: fix pyredis timeout argument and timeout to 5s

https://gerrit.wikimedia.org/r/255092

"fixed" as in the socket_connect_timeout option wasn't introduced until pyredis 2.10 (that means jessie) so we are passing socket_timeout to set a timeout on the socket as a whole (not just connect). Eventually when salt masters are upgraded to jessie (if ever) we can move again to just the connect timeout, thus "stalled" so we don't forget one way or another

btw neodymium is jessie, as will be any other new syndics or masters. See T115287

Change 256403 had a related patch set uploaded (by Filippo Giunchedi):
deployment: fix socket_connect_timeout argument

https://gerrit.wikimedia.org/r/256403

Change 256403 merged by Filippo Giunchedi:
deployment: fix socket_connect_timeout argument