Page MenuHomePhabricator

Some Kubernetes tools were stopped on 2019-04-13 19:31 and can’t be restarted
Closed, ResolvedPublic

Description

The QuickCategories tool was apparently stopped for unknown reasons yesterday:

uwsgi.log
…
*** Operational MODE: preforking ***
mounting /data/project/quickcategories/www/python/src/app.py on /quickcategories
WSGI app 0 (mountpoint='/quickcategories') ready in 33 seconds on interpreter 0x1c91300 pid: 1 (default app)
*** uWSGI is running in multiple interpreter mode ***
spawned uWSGI master process (pid: 1)
spawned uWSGI worker 1 (pid: 8, cores: 1)
spawned uWSGI worker 2 (pid: 9, cores: 1)
spawned uWSGI worker 3 (pid: 10, cores: 1)
spawned uWSGI worker 4 (pid: 11, cores: 1)
SIGINT/SIGQUIT received...killing workers...
worker 1 buried after 1 seconds
worker 2 buried after 1 seconds
worker 3 buried after 1 seconds
worker 4 buried after 1 seconds
goodbye to uWSGI.

(Yes, there were no requests served since the last restart. That’s not a bug, the tool just isn’t overly popular yet.) The log isn’t timestamped, but the last modification to the file was on 2019-04-13 19:31 UTC. @Fnielsen’s Ordia tool is also down, and while I can’t read its uwsgi.log, it has the same modification time.

kubectl get pods listed a “pending” pod 16 hours old (i. e. about as old as the last modification to uwsgi.log, if I’m not mistaken).

kubectl describe pod
Name:           quickcategories-654583560-xqip5              
Namespace:      quickcategories                                                      
Node:           /                                                                        
Labels:         name=quickcategories                                                
                pod-template-hash=654583560                                 
                tools.wmflabs.org/webservice=true                                                                       
                tools.wmflabs.org/webservice-version=1                                   
Status:         Pending                                                             
IP:                                                                                         
Controllers:    ReplicaSet/quickcategories-654583560                                 
Containers:                                                                          
  webservice:                                                                        
    Image:      docker-registry.tools.wmflabs.org/toollabs-python-web:latest        
    Port:       8000/TCP                                                                 
    Command:                                                                                                     
      /usr/bin/webservice-runner                             
      --type                                                                         
      uwsgi-python                                                                  
      --port                                                                        
      8000                                                                                       
    Limits:                                    
      cpu:      2                                                                                
      memory:   2Gi                                                                    
    Requests:                                                                                           
      cpu:      125m                                 
      memory:   256Mi                            
    Volume Mounts:                                    
      /data/project/ from home (rw)                             
      /data/scratch/ from scratch (rw)     
      /etc/ldap.conf from etcldap-conf-bzn58 (rw)                                                                                                                                                              
      /etc/ldap.yaml from etcldap-yaml-xaarl (rw)     
      /etc/novaobserver.yaml from etcnovaobserver-yaml-syao6 (rw)
      /etc/wmcs-project from wmcs-project (rw)                              
      /mnt/nfs/ from nfs (rw)                        
      /public/dumps/ from dumps (rw)  
      /var/run/nslcd/socket from varrunnslcdsocket-dhv68 (rw)        
    Environment Variables:                                                  
      HOME:     /data/project/quickcategories/                                                             
Conditions:                                                                                      
  Type          Status                                                                 
  PodScheduled  False                                                                                   
Volumes:
  dumps:
    Type:       HostPath (bare host directory volume)
    Path:       /public/dumps/
  home:
    Type:       HostPath (bare host directory volume)
    Path:       /data/project/
  wmcs-project:
    Type:       HostPath (bare host directory volume)
    Path:       /etc/wmcs-project
  nfs:
    Type:       HostPath (bare host directory volume)
    Path:       /mnt/nfs/
  scratch:
    Type:       HostPath (bare host directory volume)
    Path:       /data/scratch/
  etcldap-conf-bzn58:
    Type:       HostPath (bare host directory volume)
    Path:       /etc/ldap.conf
  etcldap-yaml-xaarl:
    Type:       HostPath (bare host directory volume)
    Path:       /etc/ldap.yaml
  etcnovaobserver-yaml-syao6:
    Type:       HostPath (bare host directory volume)
    Path:       /etc/novaobserver.yaml
  varrunnslcdsocket-dhv68:
    Type:       HostPath (bare host directory volume)
    Path:       /var/run/nslcd/socket
QoS Class:      Burstable
Tolerations:    <none>
No events.

According to @Chicocvenancio in IRC, there should probably be some illuminating events at the bottom, but I lack the permissions to see them. webservice restart had no effect; deleting the pod brought up another one in the same situation (pending forever).

A custom Kubernetes deployment in the same tool, quickcategories.background-runner, appears to be functional as far as I can tell, though it doesn’t have a whole lot to do if the web frontend that starts background runs isn’t running.

Event Timeline

Restricted Application added a subscriber: Aklapper. · View Herald TranscriptApr 14 2019, 12:41 PM
Fnielsen added a comment.EditedApr 14 2019, 1:17 PM

Ordia's uwsgi.log looks basically the same as QuickCategories. And with restart I get the same message.

LucasWerkmeister triaged this task as High priority.EditedApr 14 2019, 3:03 PM

Prioritizing as “high” since we have users depending on these tools (and an unknown number of other tools could also be affected).

Edit: I realize this kinda contradicts my above remark that “the tool just isn’t overly popular yet”, but the last restart was only last night and I now have a user reporting that they need the tool :P

LucasWerkmeister added a subscriber: Magnus.EditedApr 14 2019, 3:29 PM

@MagnusReasonator is apparently also affected, its error.log was also last modified 19:31.

tail -1 error.log
2019-04-13 19:31:48: (server.c.1558) server stopped by UID = 0 PID = 0

It is a python3.4/python3.5 issue? The Virtualenv I can construct is 3.5. Kubernetes start Python 3.4.2.

That timing would put it right around the cloudvirt1015 reboot

It is a python3.4/python3.5 issue? The Virtualenv I can construct is 3.5. Kubernetes start Python 3.4.2.

As far as I know Reasonator is client-side only, so it probably runs on lighttpd (see also the server.c in the error.log). By the way, the version mismatch is why you should set up the virtualenv inside a webservice --backend=kubernetes python shell. (Though Python 3.5 will hopefully be available soon, see T219091.)

QuickCategories is working again:

uwsgi.log
*** Starting uWSGI 2.0.7-debian (64bit) on [Sun Apr 14 15:57:05 2019] ***
compiled with version: 4.9.2 on 17 March 2018 15:40:38
os: Linux-4.9.0-0.bpo.6-amd64 #1 SMP Debian 4.9.88-1+deb9u1~bpo8+1 (2018-05-13)
nodename: quickcategories-654583560-ko87f
…

Sounds like this is/was related to T220853, and possibly this SAL entry by @Andrew?

SAL (#wikimedia-cloud) [2019-04-14T16:23:11Z] <andrewbogott> moved all tools-worker nodes off of cloudvirt1015 and uncordoned them

Commands such as these seems to have fixed it:

webservice --backend=kubernetes python shell
cd www/python
python3 /usr/lib/python3/dist-packages/virtualenv.py --python=python3.4 venv3.4
ln -s venv3.4 venv
cd ordia ; pip install -rrequirements.txt
LucasWerkmeister closed this task as Resolved.Apr 14 2019, 4:56 PM

Reasonator is also back. Since that’s all the affected tools I’m aware of, let’s close the task.