Page MenuHomePhabricator

Manual update - stale file handle
Closed, ResolvedPublicBUG REPORT

Description

Steps to Reproduce:
Press "Manual Update" on this page: https://www.wikidata.org/wiki/User:99of9/Properties_dashboard

Actual Results:
Something catastrophic happened when procesing page User:99of9/Properties_dashboard.

Please report this on Phabricator.

<class 'OSError'>

[Errno 116] Stale file handle

Event Timeline

Thanks for reporting.

Fairly sure this is Toolforge related − saw other mentions on twitter (but strangely enough, nothing on the mailing-list nor Phabricator?)

Let’s try a restart.

Mentioned in SAL (#wikimedia-cloud) [2019-05-30T10:59:03Z] <wm-bot> <jeanfred> Service stop, mv logs, service start for T224651

tools.integraality@tools-sgebastion-07:~$ webservice --backend=kubernetes python3.5 status
Your webservice of type python3.5 is running

yet, https://tools.wmflabs.org/integraality/ goes in 502 ; and also nothing in uwsgi.log (I hand-rotated the file to uwgi.log.1, and it does not get recreated 🤔 )

tools.integraality@tools-sgebastion-07:~$ kubectl get po
NAME                            READY     STATUS             RESTARTS   AGE
integraality-2123637710-p5b8o   0/1       CrashLoopBackOff   15         56m
tools.integraality@tools-sgebastion-07:~$ webservice --backend=kubernetes python3.5 stop
Stopping webservice
tools.integraality@tools-sgebastion-07:~$ kubectl get pod
<>
tools.integraality@tools-sgebastion-07:~$ webservice --backend=kubernetes python3.5 start
Starting webservice...
tools.integraality@tools-sgebastion-07:~$ kubectl get pod
NAME                            READY     STATUS    RESTARTS   AGE
integraality-2123637710-denbu   0/1       Error     2          30s

Grrr, wanted to nuke the virtualenv and provision it from scratch but:

tools.integraality@tools-sgebastion-07:~$ webservice --backend=kubernetes python3.5 shell
Pod is not ready in time
tools.integraality@tools-sgebastion-07:~$ webservice --backend=kubernetes python3.5 stop
Stopping webservice
tools.integraality@tools-sgebastion-07:~$ webservice --backend=kubernetes python3.5 shell
Traceback (most recent call last):
  File "/usr/local/bin/webservice", line 198, in <module>
    job.shell()
  File "/usr/lib/python2.7/dist-packages/toollabs/webservice/backends/kubernetesbackend.py", line 491, in shell
    pykube.Pod(self.api, podSpec).create()
  File "/usr/lib/python2.7/dist-packages/pykube/objects.py", line 76, in create
    self.api.raise_for_status(r)
  File "/usr/lib/python2.7/dist-packages/pykube/http.py", line 104, in raise_for_status
    raise HTTPError(payload["message"])
pykube.exceptions.HTTPError: object is being deleted: pods "interactive" already exists

Mentioned in SAL (#wikimedia-cloud) [2019-05-30T12:08:06Z] <wm-bot> <jeanfred> Nuked the virtualenv and reinstalled all deps from scratch, in desperation for T224651

Even after nuking the venv and reinstalling everything, the service does not get healthy:

tools.integraality@tools-sgebastion-07:~$ kubectl get pod
NAME                            READY     STATUS             RESTARTS   AGE
integraality-2123637710-142v5   0/1       CrashLoopBackOff   3          1m
tools.integraality@tools-sgebastion-07:~$ kubectl get pod
NAME                            READY     STATUS             RESTARTS   AGE
integraality-2123637710-142v5   0/1       CrashLoopBackOff   6          6m
tools.integraality@tools-sgebastion-07:~$ kubectl logs integraality-2123637710-142v5
Traceback (most recent call last):
  File "/usr/bin/webservice-runner", line 20, in <module>
    tool = Tool.from_currentuser()
  File "/usr/lib/python2.7/dist-packages/toollabs/common/tool.py", line 96, in from_currentuser
    pwd_entry = pwd.getpwuid(os.geteuid())
KeyError: 'getpwuid(): uid not found: 54041'

This pod is running in a node I'm working right now for T224558: sssd: support for Debian Jessie

root@tools-k8s-master-01:~# kubectl get pod --namespace integraality -o wide
NAME                            READY     STATUS             RESTARTS   AGE       IP              NODE
integraality-2123637710-142v5   0/1       CrashLoopBackOff   6          7m        192.168.174.3   tools-worker-1003.tools.eqiad.wmflabs

However, I just did:

aborrero@tools-worker-1003:~$ id 54041
uid=54041(tools.integraality) gid=54041(tools.integraality) groups=54041(tools.integraality)
aborrero triaged this task as High priority.

More tests:

aborrero@tools-worker-1003:~$ python
Python 2.7.9 (default, Sep 25 2018, 20:42:16) 
[GCC 4.9.2] on linux2
Type "help", "copyright", "credits" or "license" for more information.
>>> import os
>>> os.geteuid()
18194
>>> import pwd
>>> pwd.getpwuid(os.geteuid())
pwd.struct_passwd(pw_name='aborrero', pw_passwd='*', pw_uid=18194, pw_gid=500, pw_gecos='Arturo Borrero Gonzalez', pw_dir='/home/aborrero', pw_shell='/bin/bash')

Not sure why this wouldn't work inside the pod.

Mentioned in SAL (#wikimedia-cloud) [2019-05-30T12:20:28Z] <arturo> cordon/drain tools-worker-1003 because T224651 and T224651

Mentioned in SAL (#wikimedia-cloud) [2019-05-30T12:20:28Z] <arturo> cordon/drain tools-worker-1003 because T224651 and T224651

Mentioned in SAL (#wikimedia-cloud) [2019-05-30T12:22:52Z] <arturo> cordon/drain tools-worker-1029 because T224651 and T224651

Mentioned in SAL (#wikimedia-cloud) [2019-05-30T12:22:52Z] <arturo> cordon/drain tools-worker-1029 because T224651 and T224651

Mentioned in SAL (#wikimedia-cloud) [2019-05-30T12:23:46Z] <arturo> cordon/drain tools-worker-1001 because T224651 and T224651

Mentioned in SAL (#wikimedia-cloud) [2019-05-30T12:23:46Z] <arturo> cordon/drain tools-worker-1001 because T224651 and T224651

Mentioned in SAL (#wikimedia-cloud) [2019-05-30T12:25:11Z] <arturo> cordon/drain tools-worker-1002 because T224651 and T224651

Mentioned in SAL (#wikimedia-cloud) [2019-05-30T12:25:11Z] <arturo> cordon/drain tools-worker-1002 because T224651 and T224651

I had to cordon/drain all the k8s worker nodes that were running sssd, because your pod was scheduled in them :-P

root@tools-k8s-master-01:~# kubectl get pod --namespace integraality -o wide
NAME                            READY     STATUS    RESTARTS   AGE       IP               NODE
integraality-2123637710-95bts   1/1       Running   0          1m        192.168.227.30   tools-worker-1028.tools.eqiad.wmflabs

Now I have to fix T224558: sssd: support for Debian Jessie. Closing this task now. Please reopen if you see more issues.

Mentioned in SAL (#wikimedia-cloud) [2019-05-30T12:29:29Z] <arturo> switch hiera setting back to classic/sudoldap for tools-worker because T224651 (T224558)

Thanks for investigating and fixing @aborrero !