Page MenuHomePhabricator

slow salt-call invocation on minions
Closed, DeclinedPublic

Description

noticed this when first provisioning a new machine, salt-call would try to talk to tin on ipv6 port 6379 and fail, timeout, fallback to ipv4 and then succeed making the invocation unnecessarily slow. I don't see the salt master redis bound on ipv6 port, perhaps we could simply do that?

Event Timeline

fgiunchedi assigned this task to ArielGlenn.
fgiunchedi raised the priority of this task from to Needs Triage.
fgiunchedi updated the task description. (Show Details)
fgiunchedi added projects: SRE, Salt.
fgiunchedi subscribed.

more context

root@restbase2001:~# ps fwaux | grep -i salt-call
root      2317  1.3  0.0 334692 50056 ?        Ssl  11:11   0:00      \_ /usr/bin/python /usr/bin/salt-call --log-level=quiet --out=json deploy.fetch cassandra/logstash-logback-encoder
root      2473  0.0  0.0  12720  2188 pts/1    S+   11:12   0:00      \_ grep -i salt-call
root@restbase2001:~# strace -f -p 2317
Process 2317 attached with 3 threads
[pid  2317] connect(12, {sa_family=AF_INET6, sin6_port=htons(6379), inet_pton(AF_INET6, "2620:0:861:101:10:64:0:196", &sin6_addr), sin6_flowinfo=0, sin6_scope_id=0}, 28 <unfinished ...>
[pid  2383] epoll_wait(9,  <unfinished ...>
[pid  2382] epoll_wait(7,

file descriptors

salt-call 2317 root  mem    REG    9,0   140928   915730 /lib/x86_64-linux-gnu/ld-2.19.so
salt-call 2317 root    0r   CHR    1,3      0t0     1028 /dev/null
salt-call 2317 root    1u   REG    9,0        0   392464 /tmp/puppet20151111-1073-1hqxoc1
salt-call 2317 root    2u   REG    9,0        0   392464 /tmp/puppet20151111-1073-1hqxoc1
salt-call 2317 root    3w   REG  253,0     3300 62128166 /var/log/salt/minion
salt-call 2317 root    4u  0000   0,10        0     7967 anon_inode
salt-call 2317 root    5r   CHR    1,9      0t0     1033 /dev/urandom
salt-call 2317 root    6u  0000   0,10        0     7967 anon_inode
salt-call 2317 root    7u  0000   0,10        0     7967 anon_inode
salt-call 2317 root    8u  0000   0,10        0     7967 anon_inode
salt-call 2317 root    9u  0000   0,10        0     7967 anon_inode
salt-call 2317 root   10u  0000   0,10        0     7967 anon_inode
salt-call 2317 root   11u  IPv4  67911      0t0      TCP restbase2001.codfw.wmnet:38302->palladium.eqiad.wmnet:4506 (ESTABLISHED)
salt-call 2317 root   12u  IPv6  69099      0t0      TCP [2620:0:860:102:3ea8:2aff:fe0a:eca0]:55278->tin.eqiad.wmnet:6379 (SYN_SENT)

looking at bit more into this, redis on tin is 2:2.6.13-1+wmf1 though ipv6 support landed in 2.8 as per https://github.com/antirez/redis/pull/61

Change 254128 had a related patch set uploaded (by Filippo Giunchedi):
deployment: add redis socket_connect_timeout

https://gerrit.wikimedia.org/r/254128

Change 254128 merged by Filippo Giunchedi:
deployment: add redis socket_connect_timeout

https://gerrit.wikimedia.org/r/254128

Change 255090 had a related patch set uploaded (by Filippo Giunchedi):
deployment: set socket_connect_timeout to 2s

https://gerrit.wikimedia.org/r/255090

Change 255090 merged by Filippo Giunchedi:
deployment: set socket_connect_timeout to 2s

https://gerrit.wikimedia.org/r/255090

Change 255092 had a related patch set uploaded (by Filippo Giunchedi):
deployment: fix pyredis timeout argument and timeout to 5s

https://gerrit.wikimedia.org/r/255092

Change 255092 merged by Filippo Giunchedi:
deployment: fix pyredis timeout argument and timeout to 5s

https://gerrit.wikimedia.org/r/255092

fgiunchedi changed the task status from Open to Stalled.Nov 24 2015, 11:18 AM

"fixed" as in the socket_connect_timeout option wasn't introduced until pyredis 2.10 (that means jessie) so we are passing socket_timeout to set a timeout on the socket as a whole (not just connect). Eventually when salt masters are upgraded to jessie (if ever) we can move again to just the connect timeout, thus "stalled" so we don't forget one way or another

btw neodymium is jessie, as will be any other new syndics or masters. See T115287

Change 256403 had a related patch set uploaded (by Filippo Giunchedi):
deployment: fix socket_connect_timeout argument

https://gerrit.wikimedia.org/r/256403

Change 256403 merged by Filippo Giunchedi:
deployment: fix socket_connect_timeout argument

https://gerrit.wikimedia.org/r/256403