Page MenuHomePhabricator

Webservice failing, but not quite
Closed, DeclinedPublic

Description

The webservice for my "wikidata-todo" tool, which is a container for several smaller tools, keeps falling into a "forever loading" coma. It does not serve the page, but does not fail/restart either.

This behaviour is new, since last night probably. I have restarted it three times today already, because people keep telling me it's not working.

As a guess, it might be related to the general toolforge file system slowness I experience today, across tools.

Event Timeline

Hm, I didn't do anything yet and it's loading for me atm https://tools.wmflabs.org/wikidata-todo/

tools.wikidata-todo@tools-bastion-03:~$ kubectl get pods
NAME                             READY     STATUS    RESTARTS   AGE
wikidata-todo-1911024282-ulx15   1/1       Running   0          7h
tools.wikidata-todo@tools-bastion-03:~$ kubectl describe pod wikidata-todo-1911024282-ulx15
Name:		wikidata-todo-1911024282-ulx15
Namespace:	wikidata-todo
Node:		tools-worker-1016.tools.eqiad.wmflabs/10.68.21.253
Start Time:	Wed, 15 Nov 2017 08:42:42 +0000
Labels:		name=wikidata-todo
		pod-template-hash=1911024282
		tools.wmflabs.org/webservice=true
		tools.wmflabs.org/webservice-version=1
Status:		Running
IP:		192.168.178.9
Controllers:	ReplicaSet/wikidata-todo-1911024282
Containers:
  webservice:
    Container ID:	docker://ba97df2eed67c628457da4eb2d2aef0f2fe170331987e32e43b18f0209b0884e
    Image:		docker-registry.tools.wmflabs.org/toollabs-php-web:latest
    Image ID:		docker://sha256:2dddcb2d8c0a794fde6217e0638391ee9c5e47dcc49951d6f793e1c30828a6cf
    Port:		8000/TCP
    Command:
      /usr/bin/webservice-runner
      --type
      lighttpd
      --port
      8000
    Limits:
      cpu:	2
      memory:	2Gi
    Requests:
      cpu:		125m
      memory:		256Mi
    State:		Running
      Started:		Wed, 15 Nov 2017 08:42:44 +0000
    Ready:		True
    Restart Count:	0
    Volume Mounts:
      /data/project/ from home (rw)
      /data/scratch/ from scratch (rw)
      /etc/ldap.conf from etcldap-conf-ja9md (rw)
      /etc/ldap.yaml from etcldap-yaml-cfmch (rw)
      /etc/novaobserver.yaml from etcnovaobserver-yaml-1dz39 (rw)
      /public/dumps/ from dumps (rw)
      /var/run/nslcd/socket from varrunnslcdsocket-bve39 (rw)
    Environment Variables:
      HOME:	/data/project/wikidata-todo/
Conditions:
  Type		Status
  Initialized 	True
  Ready 	True
  PodScheduled 	True
Volumes:
  dumps:
    Type:	HostPath (bare host directory volume)
    Path:	/public/dumps/
  home:
    Type:	HostPath (bare host directory volume)
    Path:	/data/project/
  scratch:
    Type:	HostPath (bare host directory volume)
    Path:	/data/scratch/
  etcldap-conf-ja9md:
    Type:	HostPath (bare host directory volume)
    Path:	/etc/ldap.conf
  etcldap-yaml-cfmch:
    Type:	HostPath (bare host directory volume)
    Path:	/etc/ldap.yaml
  etcnovaobserver-yaml-1dz39:
    Type:	HostPath (bare host directory volume)
    Path:	/etc/novaobserver.yaml
  varrunnslcdsocket-bve39:
    Type:	HostPath (bare host directory volume)
    Path:	/var/run/nslcd/socket
QoS Class:	Burstable
Tolerations:	<none>
No events.tools.wikidata-todo@tools-bastion-03:~$

That worker isn't overloaded or anything, I'm going to poke a bit and see if I can turn up anything. The error log seems full of 2017-11-15 15:57:37: (mod_fastcgi.c.2702) FastCGI-stderr: PHP Notice: Undefined property: stdClass::$missing in /data/project/wikidata-todo/public_html/missing_wp_animal_audio.php on line 58.

It came back after the bastion reboot. Maybe slow bastion => slow filesystem => slow tool/webservice?

And it's stuck again...

This is like the toolserver, where I had to manually restart things every few hours :-(

@chasemp The "missing animal audio" thing should be fixed now.

And now it's back. Something weird is going on. Is there a load record for that tools' webservice?

bd808 subscribed.

Closing for inactivity. The entire Kubernetes cluster has been rebuilt and many, many changes have been made to webservice since the last activity on this task. Please reopen if there is a recent reproduction case for this issue.