Page MenuHomePhabricator

Toolforge maintain-kubeusers doesn't fail well when LDAP servers are unreachable
Closed, ResolvedPublic

Description

Initial description for reference:

It was reported by Magnus___ on #wkimedia-cloud that the tool home dir /data/project/genedb are not automatically created for the tool genedb. The directory was then created manually.

The directory is supposed to be created by maintain-kubeusers on tools-k8s-master-01, which, besides creating the home directory, also allocates and gives k8s creds to the user, in a directory named .kube inside the directory. The .kube is missing:

root@tools-bastion-05:~# ls -al /data/project/genedb
total 160
drwxr-xr-x    4 tools.genedb tools.genedb  4096 May 16 15:22 .
drwxr-xr-x 2046 root         root         69632 May 16 14:55 ..
-rw-r--r--    1 tools.genedb tools.genedb 53639 May 16 20:52 access.log
-rw-------    1 tools.genedb tools.genedb   906 May 16 15:22 .bash_history
-rw-r--r--    1 tools.genedb tools.genedb    49 May 16 14:58 error.log
drwxrwxr-x    2 tools.genedb tools.genedb  4096 May 16 15:05 public_html
-r--------    1 tools.genedb tools.genedb    52 May 16 14:55 replica.my.cnf
drwxr-xr-x    2 tools.genedb tools.genedb  4096 May 16 15:00 scripts
-rw-r--r--    1 tools.genedb tools.genedb   137 May 16 14:58 service.manifest

systemctl says maintain-kubeusers is running, but it generates no logs at all:

root@tools-k8s-master-01:~# systemctl status maintain-kubeusers
● maintain-kubeusers.service - "Create & Maintain kubernetes tool & infrastructure users"
   Loaded: loaded (/lib/systemd/system/maintain-kubeusers.service; enabled)
   Active: active (running) since Sat 2018-05-12 02:46:34 UTC; 4 days ago
 Main PID: 14152 (maintain-kubeus)
   CGroup: /system.slice/maintain-kubeusers.service
           └─14152 /usr/bin/python3 /usr/local/bin/maintain-kubeusers --infrastructure-users /etc/kubernetes/infrastructure-users --project ...

Warning: Journal has been rotated since unit was started. Log output is incomplete or unavailable.
root@tools-k8s-master-01:~# journalctl -xu maintain-kubeusers
-- Logs begin at Wed 2018-05-16 13:50:53 UTC, end at Wed 2018-05-16 21:00:06 UTC. --

strace shows that it's probably stuck in infinite sleeps of 10 seconds:

root@tools-k8s-master-01:~# strace -p 14152
Process 14152 attached
select(0, NULL, NULL, NULL, {8, 415880}) = 0 (Timeout)
select(0, NULL, NULL, NULL, {10, 0})    = 0 (Timeout)
select(0, NULL, NULL, NULL, {10, 0})    = 0 (Timeout)
select(0, NULL, NULL, NULL, {10, 0}^CProcess 14152 detached
 <detached ...>

Current notes: We've since found that the maintain-kubeusers script doesn't fail hard enough when LDAP is unavailable and will eventually keep running without any access to LDAP after a few tries. This naturally doesn't work. It needs to maintain connections more intelligently or fail harder.

Event Timeline

I installed python3-dbg temporary and was able to get a stack trace:

Traceback (most recent call first):
  File "/usr/lib/python3/dist-packages/ldap3/core/pooling.py", line 172, in find_active_server
    sleep(get_config_parameter('POOLING_LOOP_TIMEOUT'))
  File "/usr/lib/python3/dist-packages/ldap3/core/pooling.py", line 82, in get_server
    self.last_used_server = self.find_active_server(self.last_used_server + 1)
  File "/usr/lib/python3/dist-packages/ldap3/core/pooling.py", line 291, in get_server
    return self.pool_states[connection].get_server()
  File "/usr/lib/python3/dist-packages/ldap3/strategy/base.py", line 108, in open
    new_server = self.connection.server_pool.get_server(self.connection)  # get a server from the server_pool if available
  File "/usr/lib/python3/dist-packages/ldap3/strategy/sync.py", line 55, in open
    BaseStrategy.open(self, reset_usage, read_server_info)
  File "/usr/lib/python3/dist-packages/ldap3/core/connection.py", line 282, in __init__
    self.open(read_server_info=False)
  File "/usr/local/bin/maintain-kubeusers", line 371, in <module>
    password=ldapconfig['password']

(stashbot died... rip. will debug that later)

This find_active_server function:

def find_active_server(self, starting):
    counter = self.server_pool.active  # can be True for "forever" or the number of cycles to try
    if starting >= len(self.servers):
        starting = 0

    while counter:
        if log_enabled(NETWORK):
            log(NETWORK, 'entering loop number <%s> for finding active server in pool <%s>', counter, self)
        index = -1
        pool_size = len(self.servers)
        while index < pool_size - 1:
            index += 1
            offset = index + starting if index + starting < pool_size else index + starting - pool_size
            if not self.servers[offset][2]:  # server is offline
                if (isinstance(self.server_pool.exhaust, bool) and self.server_pool.exhaust) or (datetime.now() - self.servers[offset][1]).seconds < self.server_pool.exhaust:  # keeps server offline
                    if log_enabled(NETWORK):
                        if isinstance(self.server_pool.exhaust, bool):
                            log(NETWORK, 'server <%s> excluded from checking because is  offline', self.servers[offset][0])
                        else:
                            log(NETWORK, 'server <%s> excluded from checking because is offline for %d seconds', self.servers[offset][0], (self.server_pool.exhaust - (datetime.now() - self.servers[offset][1]).seconds))
                    continue
                if log_enabled(NETWORK):
                        log(NETWORK, 'server <%s> reinserted in pool', self.servers[offset][0])
            self.servers[offset][1] = datetime.now()
            if log_enabled(NETWORK):
                log(NETWORK, 'checking server <%s> for availability', self.servers[offset][0])
            if self.servers[offset][0].check_availability():
                self.servers[offset][2] = True
                return offset
            else:
                self.servers[offset][2] = False  # sets server offline

        if not isinstance(self.server_pool.active, bool):
            counter -= 1
        if log_enabled(NETWORK):
            log(NETWORK, 'waiting for %d seconds before retrying pool servers cycle', get_config_parameter('POOLING_LOOP_TIMEOUT'))
        sleep(get_config_parameter('POOLING_LOOP_TIMEOUT'))

    if log_enabled(ERROR):
        log(ERROR, 'no active server available in Server Pool <%s> after maximum number of tries', self)
    raise LDAPServerPoolExhaustedError('no active server available in server pool after maximum number of tries')

Contains quite a lot of logs. Let me see if I can get it to show anything by running the script manually.

@Bstorm or @aborrero can you look at this? New tools will be in a bad way until this is fixed. Ldap issues...?

Which went too smoothly:

root@tools-k8s-master-01:~# /usr/bin/python3 /usr/local/bin/maintain-kubeusers --infrastructure-users /etc/kubernetes/infrastructure-users --project tools https://k8s-master.tools.wmflabs.org:6443 /etc/kubernetes/tokenauth /etc/kubernetes/abac
starting a run
Homedir already exists for /data/project/genedb
Wrote config in /data/project/genedb/.kube/config
(b'namespace "genedb" created\n', b'')
Provisioned creds for tool genedb
Wrote config in /data/project/wikiintent/.kube/config
(b'namespace "wikiintent" created\n', b'')
Provisioned creds for tool wikiintent
finished run, wrote 2 new accounts
^CTraceback (most recent call last):
  File "/usr/local/bin/maintain-kubeusers", line 405, in <module>
    time.sleep(args.interval)
KeyboardInterrupt

py-locals don't reveal anything interesting:

(gdb) py-locals
self = <ServerPoolState(servers=[[<Server(connect_timeout=1, _address_info=[[<AddressFamily(_name_='AF_INET', _value_=2, __objclass__=<EnumMeta(__doc__=None, _member_map_=<OrderedDict(_OrderedDict__root=<weakproxy at remote 0x7fc6c720ec28>, _OrderedDict__hardroot=<_Link at remote 0x7fc6c720bc60>, _OrderedDict__map={'AF_IRDA': <_Link at remote 0x7fc6c7215288>, 'AF_PPPOX': <_Link at remote 0x7fc6c72152d0>, 'AF_PACKET': <_Link at remote 0x7fc6c7215318>, 'AF_CAN': <_Link at remote 0x7fc6c7215828>, 'AF_INET6': <_Link at remote 0x7fc6c7215750>, 'AF_NETROM': <_Link at remote 0x7fc6c72153f0>, 'AF_BLUETOOTH': <_Link at remote 0x7fc6c7215438>, 'AF_APPLETALK': <_Link at remote 0x7fc6c7215480>, 'AF_ROSE': <_Link at remote 0x7fc6c72154c8>, 'AF_ASH': <_Link at remote 0x7fc6c7215510>, 'AF_SNA': <_Link at remote 0x7fc6c7215558>, 'AF_UNSPEC': <_Link at remote 0x7fc6c72155e8>, 'AF_ATMPVC': <_Link at remote 0x7fc6c7215630>, 'AF_ROUTE': <_Link at remote 0x7fc6c7215678>, 'AF_IPX': <_Link at remote 0x7fc6c72156c0>, 'AF_NETBEUI': <_Link a...(truncated)
starting = 1
counter = True
index = 1
pool_size = 2
offset = 0

Shall I just restart the service and hope it won't happen again?

(P.S. I hate how you can py-print self directly but not py-print self.server_pool (without messing with gdb internals)

I would say restarting it could fix things simply because it appears to have been caused by all the network issues recently. I don't think we can clean up any cached wrong entries in the ldap module without restarting it. I'll take a look if there are any options that can be passed to that library to make it more resilient.

I went ahead and am restarting. Let's take a look at how that went.

I had to kill your gdb sessions to get it to restart @zhuyifei1999 :)

Ok, I can confirm that restarting the service did fix things. The sleeps make sense because this is a polling service. On the other hand, I did find that it logs to /var/log/syslog, so I was able to confirm it creating a new tool (some junk I created). It did take it a little while, but it worked. I also verified that I can run ldapsearch from the k8s master, just to be sure.

I had to kill your gdb sessions to get it to restart @zhuyifei1999 :)

Oops, sorry about that.

(Also uninstalled python3-dbg which I installed a while ago to debug this)

Vvjjkkii renamed this task from Toolforge maintain-kubeusers stauck in infinite sleeps of 10 seconds to kucaaaaaaa.Jul 1 2018, 1:10 AM
Vvjjkkii reopened this task as Open.
Vvjjkkii removed zhuyifei1999 as the assignee of this task.
Vvjjkkii updated the task description. (Show Details)
Vvjjkkii removed a subscriber: Aklapper.
CommunityTechBot renamed this task from kucaaaaaaa to Toolforge maintain-kubeusers stauck in infinite sleeps of 10 seconds.Jul 2 2018, 6:43 AM
CommunityTechBot closed this task as Resolved.
CommunityTechBot assigned this task to zhuyifei1999.
CommunityTechBot updated the task description. (Show Details)
CommunityTechBot added a subscriber: Aklapper.

Mentioned in SAL (#wikimedia-cloud) [2018-10-07T21:57:45Z] <zhuyifei1999_> restarted maintain-kubeusers on tools-k8s-master-01 T194859

zhuyifei1999 renamed this task from Toolforge maintain-kubeusers stauck in infinite sleeps of 10 seconds to Toolforge maintain-kubeusers stuck in infinite sleeps of 10 seconds.Oct 7 2018, 9:59 PM
zhuyifei1999 removed zhuyifei1999 as the assignee of this task.

It is supposed to sleep between polls. I don't think that's the error it's running into here. Previously, it started screwing up because of reaching the outer edge of ldap pagination, requiring restarts to attempt reordering (which I fixed for at least one query). This time I suspect something else...or the same thing in another place.

Ah yes. It wasn't running at all :)

Oct  4 17:02:43 tools-k8s-master-01 maintain-kubeusers[9915]: starting a run
Oct  4 17:02:43 tools-k8s-master-01 maintain-kubeusers[9915]: finished run, wrote 0 new accounts
Oct  4 17:03:43 tools-k8s-master-01 maintain-kubeusers[9915]: starting a run
Oct  4 17:03:44 tools-k8s-master-01 maintain-kubeusers[9915]: Traceback (most recent call last):
Oct  4 17:03:44 tools-k8s-master-01 maintain-kubeusers[9915]: File "/usr/local/bin/maintain-kubeusers", line 433, in <module>
Oct  4 17:03:44 tools-k8s-master-01 maintain-kubeusers[9915]: main()
Oct  4 17:03:44 tools-k8s-master-01 maintain-kubeusers[9915]: File "/usr/local/bin/maintain-kubeusers", line 395, in main
Oct  4 17:03:44 tools-k8s-master-01 maintain-kubeusers[9915]: password=ldapconfig['password']
Oct  4 17:03:44 tools-k8s-master-01 maintain-kubeusers[9915]: File "/usr/lib/python3/dist-packages/ldap3/core/connection.py", line 282, in __init__
Oct  4 17:03:44 tools-k8s-master-01 maintain-kubeusers[9915]: self.open(read_server_info=False)
Oct  4 17:03:44 tools-k8s-master-01 maintain-kubeusers[9915]: File "/usr/lib/python3/dist-packages/ldap3/strategy/sync.py", line 55, in open
Oct  4 17:03:44 tools-k8s-master-01 maintain-kubeusers[9915]: BaseStrategy.open(self, reset_usage, read_server_info)
Oct  4 17:03:44 tools-k8s-master-01 maintain-kubeusers[9915]: File "/usr/lib/python3/dist-packages/ldap3/strategy/base.py", line 132, in open
Oct  4 17:03:44 tools-k8s-master-01 maintain-kubeusers[9915]: raise exception_history[0][1](exception_history[0][2])
Oct  4 17:03:44 tools-k8s-master-01 maintain-kubeusers[9915]: ldap3.core.exceptions.LDAPSocketOpenError: socket connection error: timed out
Oct  4 17:03:44 tools-k8s-master-01 systemd[1]: maintain-kubeusers.service: main process exited, code=exited, status=1/FAILURE
Oct  4 17:03:44 tools-k8s-master-01 systemd[1]: Unit maintain-kubeusers.service entered failed state.
Oct  4 17:03:44 tools-k8s-master-01 systemd[1]: maintain-kubeusers.service holdoff time over, scheduling restart.

It lost ldap connection days ago and eventually stopped restarting itself. That may have been the switch move that happened recently from network.
Let me check the timing.

Yes, it lost connection during the switch recabling on October 4th. I wasn't aware that it was going to impact this, but we did see unexpected DNS failures during that time, which might be part of this LDAP connection failure. @Andrew, does that jive with what you saw then? Maybe what was killing off some DNS things was actually LDAP queries in scripts? Or maybe it was the other way around.

It lost ldap connection days ago and eventually stopped restarting itself.

It was running as PID 9962 today. I got the same backtrace as T194859#4211040

Yeah, but that's clouding the actual issue with more information, basically. The death of the service was days ago. To find it, you need to go simple and just run:

grep maintain-kubeusers /var/log/syslog
grep maintain-kubeusers /var/log/syslog.1
zgrep maintain-kubeusers /var/log/syslog.2.gz
zgrep maintain-kubeusers /var/log/syslog.3.gz

It's obvious then that it stopped working due to lack of connection to LDAP on Oct 4th during the network outage. It never throws another message, so basically, I think the parent pid was shot to hell and may have been looping for whatever reason. It had stopped logging anything until you restarted it.

Since, after a chat on IRC, it seems that the ultimate concern here is that sometimes LDAP isn't available and this script's LDAP server pooling function does something weird in that case. It fails a few times, but then it ends up kind of like a zombie. It should be either hard failed or restoring one of the connections. Changing the description around that.

Bstorm renamed this task from Toolforge maintain-kubeusers stuck in infinite sleeps of 10 seconds to Toolforge maintain-kubeusers doesn't fail well when LDAP servers are unreachable.Oct 7 2018, 10:31 PM
Bstorm lowered the priority of this task from High to Medium.
Bstorm updated the task description. (Show Details)

Mentioned in SAL (#wikimedia-cloud) [2019-07-27T23:00:06Z] <zhuyifei1999_> a past probably related ticket: T194859

This might be happening again: SAL:

2019-07-27 22:57 zhuyifei1999_: maintain-kubeusers seems stuck. Traceback: https://phabricator.wikimedia.org/P8812, core dump: /root/core.17898. Restarting
2019-07-26 17:39 bstorm_: restarted maintain-kubeusers because it was suspiciously tardy and quiet

I caught maintain-kubeusers apparently stuck again today. sudo journalctl -u maintain-kubeusers.service --no-pager -f showed no new log output for more than 5 hours. ps axuwww|grep maintain showed the process running. sudo gdb python3 <PID> shows the following output:

$ sudo gdb python3 10167
# snipped out lots of symbol reading output...
_______________________________________________________________________________
     eax:FFFFFDFE ebx:010717D0  ecx:017F5873  edx:00000000     eflags:00000246
     esi:00000000 edi:00000000  esp:59528E08  ebp:01406C58     eip:Error while running hook_stop:
Value can't be converted to integer.
0x00007fac017f5873 in select () from target:/lib/x86_64-linux-gnu/libc.so.6
gdb> bt
#0  0x00007fac017f5873 in select () from target:/lib/x86_64-linux-gnu/libc.so.6
#1  0x00000000005c42fb in floatsleep (secs=<optimized out>) at ../Modules/timemodule.c:1583
#2  time_sleep.lto_priv () at ../Modules/timemodule.c:250
#3  0x00000000004ef9f2 in call_function (oparg=<optimized out>, pp_stack=0x7ffc59528f10) at ../Python/ceval.c:4234
#4  PyEval_EvalFrameEx () at ../Python/ceval.c:2836
#5  0x00000000004f3577 in fast_function (nk=<optimized out>, na=<optimized out>, n=0x2, pp_stack=0x7ffc59529010, func=<optimized out>) at ../Python/ceval.c:4331
#6  call_function (oparg=<optimized out>, pp_stack=0x7ffc59529010) at ../Python/ceval.c:4259
#7  PyEval_EvalFrameEx () at ../Python/ceval.c:2836
#8  0x00000000004f3577 in fast_function (nk=<optimized out>, na=<optimized out>, n=0x1, pp_stack=0x7ffc59529110, func=<optimized out>) at ../Python/ceval.c:4331
#9  call_function (oparg=<optimized out>, pp_stack=0x7ffc59529110) at ../Python/ceval.c:4259
#10 PyEval_EvalFrameEx () at ../Python/ceval.c:2836
#11 0x00000000004f3577 in fast_function (nk=<optimized out>, na=<optimized out>, n=0x2, pp_stack=0x7ffc59529210, func=<optimized out>) at ../Python/ceval.c:4331
#12 call_function (oparg=<optimized out>, pp_stack=0x7ffc59529210) at ../Python/ceval.c:4259
#13 PyEval_EvalFrameEx () at ../Python/ceval.c:2836
#14 0x00000000004ee3cc in PyEval_EvalCodeEx () at ../Python/ceval.c:3585
#15 0x00000000004f3f97 in fast_function (nk=<optimized out>, na=<optimized out>, n=<optimized out>, pp_stack=0x7ffc59529400, func=<optimized out>) at ../Python/ceval.c:4341
#16 call_function (oparg=<optimized out>, pp_stack=0x7ffc59529400) at ../Python/ceval.c:4259
#17 PyEval_EvalFrameEx () at ../Python/ceval.c:2836
#18 0x00000000004ee3cc in PyEval_EvalCodeEx () at ../Python/ceval.c:3585
#19 0x00000000004f3f97 in fast_function (nk=<optimized out>, na=<optimized out>, n=<optimized out>, pp_stack=0x7ffc595295f0, func=<optimized out>) at ../Python/ceval.c:4341
#20 call_function (oparg=<optimized out>, pp_stack=0x7ffc595295f0) at ../Python/ceval.c:4259
#21 PyEval_EvalFrameEx () at ../Python/ceval.c:2836
#22 0x000000000050b0db in PyEval_EvalCodeEx (closure=<optimized out>, kwdefs=<optimized out>, defcount=<optimized out>, defs=<optimized out>, kwcount=<optimized out>, kws=<optimized out>, argcount=<optimized out>, args=<optimized out>, locals=0x0, globals=<optimized out>, _co=<code at remote 0x7fabffb89930>) at ../Python/ceval.c:3585
#23 function_call.lto_priv () at ../Objects/funcobject.c:632
Python Exception <class 'ValueError'> invalid literal for int() with base 10: '0x2':
#24 0x0000000000513dbf in PyObject_Call (kw={'user': 'cn=proxyagent,ou=profile,dc=wikimedia,dc=org', 'read_only': True, 'password': 'Eche0ieng8UaNoo', 'auto_bind': True}, arg=, func=<function at remote 0x7fabff73d620>) at ../Objects/abstract.c:2067
#25 method_call.lto_priv () at ../Objects/classobject.c:347
Python Exception <class 'ValueError'> invalid literal for int() with base 10: '0x1':
#26 0x0000000000513351 in PyObject_Call (kw={'user': 'cn=proxyagent,ou=profile,dc=wikimedia,dc=org', 'read_only': True, 'password': 'Eche0ieng8UaNoo', 'auto_bind': True}, arg=, func=<method at remote 0x7fabfec570c8>) at ../Objects/abstract.c:2067
#27 slot_tp_init.lto_priv () at ../Objects/typeobject.c:6023
#28 0x00000000004f9273 in type_call.lto_priv () at ../Objects/typeobject.c:869
Python Exception <class 'ValueError'> invalid literal for int() with base 10: '0x1':
#29 0x00000000004f3c59 in PyObject_Call (kw={'user': 'cn=proxyagent,ou=profile,dc=wikimedia,dc=org', 'read_only': True, 'password': 'Eche0ieng8UaNoo', 'auto_bind': True}, arg=, func=<optimized out>) at ../Objects/abstract.c:2067
#30 do_call (nk=<optimized out>, na=0x1, pp_stack=0x7ffc59529860, func=<optimized out>) at ../Python/ceval.c:4463
#31 call_function (oparg=<optimized out>, pp_stack=0x7ffc59529860) at ../Python/ceval.c:4261
#32 PyEval_EvalFrameEx () at ../Python/ceval.c:2836
#33 0x00000000004f3577 in fast_function (nk=<optimized out>, na=<optimized out>, n=0x0, pp_stack=0x7ffc59529960, func=<optimized out>) at ../Python/ceval.c:4331
#34 call_function (oparg=<optimized out>, pp_stack=0x7ffc59529960) at ../Python/ceval.c:4259
#35 PyEval_EvalFrameEx () at ../Python/ceval.c:2836
#36 0x00000000004ee3cc in PyEval_EvalCodeEx () at ../Python/ceval.c:3585
Python Exception <class 'ValueError'> invalid literal for int() with base 10: '0x10':
Python Exception <class 'ValueError'> invalid literal for int() with base 10: '0x10':
#37 0x0000000000581e25 in PyEval_EvalCode (locals=, globals=, co=<code at remote 0x7fac0164a9c0>) at ../Python/ceval.c:773
#38 run_mod.lto_priv () at ../Python/pythonrun.c:2180
#39 0x0000000000597fc9 in PyRun_FileExFlags () at ../Python/pythonrun.c:2133
#40 0x0000000000596bb4 in PyRun_SimpleFileExFlags () at ../Python/pythonrun.c:1606
#41 0x00000000004cd13b in run_file (p_cf=<optimized out>, filename=<optimized out>, fp=<optimized out>) at ../Modules/main.c:319
#42 Py_Main () at ../Modules/main.c:751
#43 0x00000000004cc5df in main () at ../Modules/python.c:69
#44 0x00007fac01735b45 in __libc_start_main () from target:/lib/x86_64-linux-gnu/libc.so.6
#45 0x00000000004cc424 in _start ()
gdb> py-bt
Traceback (most recent call first):
  (unable to read python frame information)
  (unable to read python frame information)
  (unable to read python frame information)
  (unable to read python frame information)
  (unable to read python frame information)
  (unable to read python frame information)
  (unable to read python frame information)
  (unable to read python frame information)
gdb>

Something is corrupted such that gdb can't figure out which source lines the stack corresponds to, but it seems pretty obviously related to LDAP connection activity. Also, before anyone freaks out, the password for the cn=proxyagent,ou=profile,dc=wikimedia,dc=org user shown in the stack traces is already public information.

Mentioned in SAL (#wikimedia-cloud) [2019-07-29T20:14:35Z] <bd808> Restarted maintain-kubeusers on tools-k8s-master-01 (T194859)

So it does sometimes handle LDAP issues gracefully:

Jul 29 00:42:59 tools-k8s-master-01 maintain-kubeusers[427]: starting a run
Jul 29 00:43:00 tools-k8s-master-01 maintain-kubeusers[427]: Traceback (most recent call last):
Jul 29 00:43:00 tools-k8s-master-01 maintain-kubeusers[427]: File "/usr/local/bin/maintain-kubeusers", line 435, in <module>
Jul 29 00:43:00 tools-k8s-master-01 maintain-kubeusers[427]: main()
Jul 29 00:43:00 tools-k8s-master-01 maintain-kubeusers[427]: File "/usr/local/bin/maintain-kubeusers", line 397, in main
Jul 29 00:43:00 tools-k8s-master-01 maintain-kubeusers[427]: password=ldapconfig['password']
Jul 29 00:43:00 tools-k8s-master-01 maintain-kubeusers[427]: File "/usr/lib/python3/dist-packages/ldap3/core/connection.py", line 282, in __init__
Jul 29 00:43:00 tools-k8s-master-01 maintain-kubeusers[427]: self.open(read_server_info=False)
Jul 29 00:43:00 tools-k8s-master-01 maintain-kubeusers[427]: File "/usr/lib/python3/dist-packages/ldap3/strategy/sync.py", line 55, in open
Jul 29 00:43:00 tools-k8s-master-01 maintain-kubeusers[427]: BaseStrategy.open(self, reset_usage, read_server_info)
Jul 29 00:43:00 tools-k8s-master-01 maintain-kubeusers[427]: File "/usr/lib/python3/dist-packages/ldap3/strategy/base.py", line 132, in open
Jul 29 00:43:00 tools-k8s-master-01 maintain-kubeusers[427]: raise exception_history[0][1](exception_history[0][2])
Jul 29 00:43:00 tools-k8s-master-01 maintain-kubeusers[427]: ldap3.core.exceptions.LDAPSocketOpenError: socket connection error: timed out
Jul 29 00:43:01 tools-k8s-master-01 systemd[1]: maintain-kubeusers.service: main process exited, code=exited, status=1/FAILURE
Jul 29 00:43:01 tools-k8s-master-01 systemd[1]: Unit maintain-kubeusers.service entered failed state.
Jul 29 00:43:01 tools-k8s-master-01 systemd[1]: maintain-kubeusers.service holdoff time over, scheduling restart.
Jul 29 00:43:01 tools-k8s-master-01 maintain-kubeusers[10080]: starting a run

But the latest issue before restart put out no log messages. It just hung there, so systemd didn't restart it.

The evidence:

Jul 29 14:51:34 tools-k8s-master-01 maintain-kubeusers[10167]: starting a run
Jul 29 14:51:35 tools-k8s-master-01 maintain-kubeusers[10167]: finished run, wrote 0 new accounts
Jul 29 14:52:35 tools-k8s-master-01 maintain-kubeusers[10167]: starting a run
Jul 29 14:52:36 tools-k8s-master-01 maintain-kubeusers[10167]: finished run, wrote 0 new accounts
Jul 29 14:53:36 tools-k8s-master-01 maintain-kubeusers[10167]: starting a run
Jul 29 20:13:43 tools-k8s-master-01 maintain-kubeusers[9691]: starting a run
Jul 29 20:13:44 tools-k8s-master-01 maintain-kubeusers[9691]: finished run, wrote 0 new accounts
Jul 29 20:14:44 tools-k8s-master-01 maintain-kubeusers[9691]: starting a run
Jul 29 20:14:44 tools-k8s-master-01 maintain-kubeusers[9691]: finished run, wrote 0 new accounts

Just for the sake of someone wondering about this.

Reading old issues of the software, it looks like we should be setting receive_timeout on that function as well as our connect_timeout.

Change 526250 had a related patch set uploaded (by Bstorm; owner: Bstorm):
[operations/puppet@production] maintain-kubeusers: set timeout on reading LDAP data

https://gerrit.wikimedia.org/r/526250

Change 526250 merged by Bstorm:
[operations/puppet@production] maintain-kubeusers: set timeout on reading LDAP data

https://gerrit.wikimedia.org/r/526250

Change 526258 had a related patch set uploaded (by Bstorm; owner: Bstorm):
[operations/puppet@production] maintain-kubeusers: allow ldap3 to raise exceptions

https://gerrit.wikimedia.org/r/526258

The timeout effort ran into T214541.

Maybe letting LDAP raise exceptions in the connection object will do the thing.

Change 526258 merged by Bstorm:
[operations/puppet@production] maintain-kubeusers: allow ldap3 to raise exceptions

https://gerrit.wikimedia.org/r/526258

OK, that's running. It's feasible that could fix things, but the errors are a bit mysterious. I preferred the read_timeout option we can't use. 🙂

bd808 added a subscriber: Phamhi.

@Phamhi has been working on a band-aid fix for this problem. The systemd integration for maintain-kubeusers has been changed to use a systemd timer (similar to a cron job) rather than running in a continuous loop as a systemd service.

There appears to still be some work to do here as the timer seems not to be firing exactly as expected:

$ ssh tools-k8s-master-01.tools.eqiad.wmflabs
$ sudo systemctl list-timers --no-pager
NEXT                         LEFT          LAST                         PASSED    UNIT                               ACTIVATES
Thu 2019-09-05 23:51:00 UTC  14h ago       Thu 2019-09-05 23:51:08 UTC  14h ago   maintain-kubeusers-timer.timer     maintain-kubeusers-timer.service
Fri 2019-09-06 14:50:38 UTC  1min 39s left Fri 2019-09-06 14:35:38 UTC  13min ago prometheus-node-exporter-apt.timer prometheus-node-exporter-apt.service
Fri 2019-09-06 18:34:38 UTC  3h 45min left Thu 2019-09-05 18:34:38 UTC  20h ago   systemd-tmpfiles-clean.timer       systemd-tmpfiles-clean.service
Sat 2019-09-07 08:15:00 UTC  17h left      Fri 2019-09-06 08:15:11 UTC  6h ago    send_puppet_failure_emails.timer   send_puppet_failure_emails.service

4 timers listed.
Pass --all to see loaded but inactive timers, too.
$ sudo journalctl -u maintain-kubeusers-timer.service --no-pager --since "24 hours ago"
-- Logs begin at Thu 2019-09-05 13:52:02 UTC, end at Fri 2019-09-06 14:51:30 UTC. --
Sep 05 14:52:00 tools-k8s-master-01 systemd[1]: Starting Automate the process of generating users...
Sep 05 14:52:00 tools-k8s-master-01 systemd[1]: Started Automate the process of generating users.
Sep 05 14:52:01 tools-k8s-master-01 maintain-kubeusers[25495]: starting a run
Sep 05 14:52:01 tools-k8s-master-01 maintain-kubeusers[25495]: finished run, wrote 0 new accounts
...
Sep 05 23:50:00 tools-k8s-master-01 systemd[1]: Starting Automate the process of generating users...
Sep 05 23:50:00 tools-k8s-master-01 systemd[1]: Started Automate the process of generating users.
Sep 05 23:50:01 tools-k8s-master-01 maintain-kubeusers[10616]: starting a run
Sep 05 23:50:02 tools-k8s-master-01 maintain-kubeusers[10616]: finished run, wrote 0 new accounts
Sep 05 23:51:08 tools-k8s-master-01 systemd[1]: Starting Automate the process of generating users...
Sep 05 23:51:08 tools-k8s-master-01 systemd[1]: Started Automate the process of generating users.
Sep 05 23:51:09 tools-k8s-master-01 maintain-kubeusers[10635]: starting a run

So the run that started at 2019-09-05T23:51:08 appears to have gotten stuck.

$ ps axuww|grep maintain-kubeusers
bd808     4890  0.0  0.0  12728  2068 pts/0    S+   14:52   0:00 grep --color=auto maintain-kubeusers
root     10635  0.0  0.5  69852 21620 ?        Ss   Sep05   0:02 /usr/bin/python3 /usr/local/bin/maintain-kubeusers --once --infrastructure-users /etc/kubernetes/infrastructure-users --project tools https://k8s-master.tools.wmflabs.org:6443 /etc/kubernetes/tokenauth /etc/kubernetes/abac
$ sudo gdb python3 10635
...
Attaching to program: /usr/bin/python3, process 10635
...
_______________________________________________________________________________
     eax:FFFFFDFE ebx:018C8800  ecx:5B77F873  edx:00000000     eflags:00000246
     esi:00000000 edi:00000000  esp:CE8559B8  ebp:01C8C008     eip:Error while running hook_stop:
Value can't be converted to integer.
0x00007f895b77f873 in select () from /lib/x86_64-linux-gnu/libc.so.6
gdb> bt
#0  0x00007f895b77f873 in select () from /lib/x86_64-linux-gnu/libc.so.6
#1  0x00000000005c42fb in floatsleep (secs=<optimized out>) at ../Modules/timemodule.c:1583
#2  time_sleep.lto_priv () at ../Modules/timemodule.c:250
#3  0x00000000004ef9f2 in call_function (oparg=<optimized out>, pp_stack=0x7ffece855ac0) at ../Python/ceval.c:4234
#4  PyEval_EvalFrameEx () at ../Python/ceval.c:2836
#5  0x00000000004f3577 in fast_function (nk=<optimized out>, na=<optimized out>, n=0x2, pp_stack=0x7ffece855bc0, func=<optimized out>) at ../Python/ceval.c:4331
#6  call_function (oparg=<optimized out>, pp_stack=0x7ffece855bc0) at ../Python/ceval.c:4259
#7  PyEval_EvalFrameEx () at ../Python/ceval.c:2836
#8  0x00000000004f3577 in fast_function (nk=<optimized out>, na=<optimized out>, n=0x1, pp_stack=0x7ffece855cc0, func=<optimized out>) at ../Python/ceval.c:4331
#9  call_function (oparg=<optimized out>, pp_stack=0x7ffece855cc0) at ../Python/ceval.c:4259
#10 PyEval_EvalFrameEx () at ../Python/ceval.c:2836
#11 0x00000000004f3577 in fast_function (nk=<optimized out>, na=<optimized out>, n=0x2, pp_stack=0x7ffece855dc0, func=<optimized out>) at ../Python/ceval.c:4331
#12 call_function (oparg=<optimized out>, pp_stack=0x7ffece855dc0) at ../Python/ceval.c:4259
#13 PyEval_EvalFrameEx () at ../Python/ceval.c:2836
#14 0x00000000004ee3cc in PyEval_EvalCodeEx () at ../Python/ceval.c:3585
#15 0x00000000004f3f97 in fast_function (nk=<optimized out>, na=<optimized out>, n=<optimized out>, pp_stack=0x7ffece855fb0, func=<optimized out>) at ../Python/ceval.c:4341
#16 call_function (oparg=<optimized out>, pp_stack=0x7ffece855fb0) at ../Python/ceval.c:4259
#17 PyEval_EvalFrameEx () at ../Python/ceval.c:2836
#18 0x00000000004ee3cc in PyEval_EvalCodeEx () at ../Python/ceval.c:3585
#19 0x00000000004f3f97 in fast_function (nk=<optimized out>, na=<optimized out>, n=<optimized out>, pp_stack=0x7ffece8561a0, func=<optimized out>) at ../Python/ceval.c:4341
#20 call_function (oparg=<optimized out>, pp_stack=0x7ffece8561a0) at ../Python/ceval.c:4259
#21 PyEval_EvalFrameEx () at ../Python/ceval.c:2836
#22 0x000000000050b0db in PyEval_EvalCodeEx (closure=<optimized out>, kwdefs=<optimized out>, defcount=<optimized out>, defs=<optimized out>, kwcount=<optimized out>, kws=<optimized out>, argcount=<optimized out>, args=<optimized out>, locals=0x0, globals=<optimized out>, _co=<code at remote 0x7f8959b13930>) at ../Python/ceval.c:3585
#23 function_call.lto_priv () at ../Objects/funcobject.c:632
Python Exception <class 'ValueError'> invalid literal for int() with base 10: '0x2':
#24 0x0000000000513dbf in PyObject_Call (kw={'raise_exceptions': True, 'read_only': True, 'auto_bind': True, 'user': 'cn=proxyagent,ou=profile,dc=wikimedia,dc=org', 'password': 'Eche0ieng8UaNoo'}, arg=, func=<function at remote 0x7f89596c7620>) at ../Objects/abstract.c:2067
#25 method_call.lto_priv () at ../Objects/classobject.c:347
Python Exception <class 'ValueError'> invalid literal for int() with base 10: '0x1':
#26 0x0000000000513351 in PyObject_Call (kw={'raise_exceptions': True, 'read_only': True, 'auto_bind': True, 'user': 'cn=proxyagent,ou=profile,dc=wikimedia,dc=org', 'password': 'Eche0ieng8UaNoo'}, arg=, func=<method at remote 0x7f895b627ac8>) at ../Objects/abstract.c:2067
#27 slot_tp_init.lto_priv () at ../Objects/typeobject.c:6023
#28 0x00000000004f9273 in type_call.lto_priv () at ../Objects/typeobject.c:869
Python Exception <class 'ValueError'> invalid literal for int() with base 10: '0x1':
#29 0x00000000004f3c59 in PyObject_Call (kw={'raise_exceptions': True, 'read_only': True, 'auto_bind': True, 'user': 'cn=proxyagent,ou=profile,dc=wikimedia,dc=org', 'password': 'Eche0ieng8UaNoo'}, arg=, func=<optimized out>) at ../Objects/abstract.c:2067
#30 do_call (nk=<optimized out>, na=0x1, pp_stack=0x7ffece856410, func=<optimized out>) at ../Python/ceval.c:4463
#31 call_function (oparg=<optimized out>, pp_stack=0x7ffece856410) at ../Python/ceval.c:4261
#32 PyEval_EvalFrameEx () at ../Python/ceval.c:2836
#33 0x00000000004f3577 in fast_function (nk=<optimized out>, na=<optimized out>, n=0x0, pp_stack=0x7ffece856510, func=<optimized out>) at ../Python/ceval.c:4331
#34 call_function (oparg=<optimized out>, pp_stack=0x7ffece856510) at ../Python/ceval.c:4259
#35 PyEval_EvalFrameEx () at ../Python/ceval.c:2836
#36 0x00000000004ee3cc in PyEval_EvalCodeEx () at ../Python/ceval.c:3585
Python Exception <class 'ValueError'> invalid literal for int() with base 10: '0x10':
Python Exception <class 'ValueError'> invalid literal for int() with base 10: '0x10':
#37 0x0000000000581e25 in PyEval_EvalCode (locals=, globals=, co=<code at remote 0x7f895b5d49c0>) at ../Python/ceval.c:773
#38 run_mod.lto_priv () at ../Python/pythonrun.c:2180
#39 0x0000000000597fc9 in PyRun_FileExFlags () at ../Python/pythonrun.c:2133
#40 0x0000000000596bb4 in PyRun_SimpleFileExFlags () at ../Python/pythonrun.c:1606
#41 0x00000000004cd13b in run_file (p_cf=<optimized out>, filename=<optimized out>, fp=<optimized out>) at ../Modules/main.c:319
#42 Py_Main () at ../Modules/main.c:751
#43 0x00000000004cc5df in main () at ../Modules/python.c:69
#44 0x00007f895b6bfb45 in __libc_start_main () from /lib/x86_64-linux-gnu/libc.so.6
#45 0x00000000004cc424 in _start ()

Looks to be stuck in the same old place trying to read data from the LDAP directory. This is happening even though the service started by the timer is supposed to be killed after 5 minutes of runtime.

$ sudo systemctl cat maintain-kubeusers-timer.timer
# /lib/systemd/system/maintain-kubeusers-timer.timer
[Unit]
Description=Periodic execution of maintain-kubeusers-timer.service

[Timer]
Unit=maintain-kubeusers-timer.service
# Accuracy sets the maximum time interval around the execution time we want to a
AccuracySec=15sec
OnCalendar=*-*-* *:00/1:00

[Install]
WantedBy=multi-user.target
$ sudo systemctl cat maintain-kubeusers-timer.service
# /lib/systemd/system/maintain-kubeusers-timer.service
[Unit]
Description=Automate the process of generating users

[Service]
User=root
ExecStart=/usr/local/bin/maintain-kubeusers                         --once
RuntimeMaxSec=300

Mentioned in SAL (#wikimedia-cloud) [2019-09-06T15:11:33Z] <bd808> sudo kill -9 10635 on tools-k8s-master-01 (T194859)

Mentioned in SAL (#wikimedia-cloud) [2019-09-06T15:11:33Z] <bd808> sudo kill -9 10635 on tools-k8s-master-01 (T194859)

After this kill, things picked up as expected:

$ sudo journalctl -u maintain-kubeusers-timer.service --no-pager --since "1 hours ago"
-- Logs begin at Thu 2019-09-05 13:52:02 UTC, end at Fri 2019-09-06 15:12:45 UTC. --
Sep 06 15:10:40 tools-k8s-master-01 systemd[1]: Starting Automate the process of generating users...
Sep 06 15:10:40 tools-k8s-master-01 systemd[1]: Started Automate the process of generating users.
Sep 06 15:10:41 tools-k8s-master-01 maintain-kubeusers[5844]: starting a run
Sep 06 15:10:41 tools-k8s-master-01 maintain-kubeusers[5844]: Wrote config in /data/project/wikiwho/.kube/config
Sep 06 15:10:41 tools-k8s-master-01 maintain-kubeusers[5844]: (b'namespace "wikiwho" created\n', b'')
Sep 06 15:10:41 tools-k8s-master-01 maintain-kubeusers[5844]: Provisioned creds for tool wikiwho
Sep 06 15:10:50 tools-k8s-master-01 maintain-kubeusers[5844]: finished run, wrote 1 new accounts
Sep 06 15:11:11 tools-k8s-master-01 systemd[1]: Starting Automate the process of generating users...
Sep 06 15:11:11 tools-k8s-master-01 systemd[1]: Started Automate the process of generating users.
Sep 06 15:11:11 tools-k8s-master-01 maintain-kubeusers[5913]: starting a run
Sep 06 15:11:11 tools-k8s-master-01 maintain-kubeusers[5913]: finished run, wrote 0 new accounts
Sep 06 15:12:00 tools-k8s-master-01 systemd[1]: Starting Automate the process of generating users...
Sep 06 15:12:00 tools-k8s-master-01 systemd[1]: Started Automate the process of generating users.
Sep 06 15:12:01 tools-k8s-master-01 maintain-kubeusers[5941]: starting a run
Sep 06 15:12:01 tools-k8s-master-01 maintain-kubeusers[5941]: finished run, wrote 0 new accounts

Looks like we need to dig into the proper RuntimeMaxSec configuration a bit deeper.

That's why it suddenly stopped and started working. I was wondering (and commented on the merged task).

$ sudo grep RuntimeMaxSec /var/log/daemon.log
Sep  4 16:29:27 tools-k8s-master-01 systemd[1]: [/lib/systemd/system/maintain-kubeusers-timer.service:7] Unknown lvalue 'RuntimeMaxSec' in section 'Service'

As per https://github.com/systemd/systemd/blob/master/NEWS, RuntimeMaxSec was added in systemd version 229... our version is at 215

$ systemd --version
systemd 215

As a workaround, I'm going to use /usr/bin/timeout utility to wrap the command.

I was able to test this method successfully by emulating a very long running maintain-kubeusers task

As a workaround, I'm going to use /usr/bin/timeout utility to wrap the command.

I was able to test this method successfully by emulating a very long running maintain-kubeusers task

https://gerrit.wikimedia.org/r/#/c/operations/puppet/+/534848/

The workaround seems to be working so far. I haven't seen evidence of a single hang, the logs show it creating new accounts and running successfully. The only place it might have an issue is if it ever needs to create enough accounts to make it take longer than a minute.

Hi @Bstorm... do you think it's safe to say that this workaround works and we can close this issue?

I think this crushed it! Good job.