Page MenuHomePhabricator

Apparent issues in Toolforge Kubernetes
Closed, InvalidPublic

Description

Issues this morning seemed unrelated to images and webservice (though they originally appeared to be). They were instead caused by random filesystem errors in toolsbeta and possibly totally different errors in tools itself.

original desc:

In trying to switch to the new cluster, folks are noticing that some containers won't come up on either cluster. They all seem to be buster containers, but the issue is likely the webservice process where several changes have been done recently.

Event Timeline

Bstorm triaged this task as Unbreak Now! priority.Jan 13 2020, 4:13 PM
Bstorm created this task.

In my current test case, I'm using python37
image: docker-registry.tools.wmflabs.org/toolforge-python37-web:latest

It's sitting there waiting to "become ready" via probe.

Name:		test-603267139-qykpz
Namespace:	test
Node:		toolsbeta-worker-1001.toolsbeta.eqiad.wmflabs/172.16.6.179
Start Time:	Mon, 13 Jan 2020 16:13:03 +0000
Labels:		name=test
		pod-template-hash=603267139
		toolforge=tool
		tools.wmflabs.org/webservice=true
		tools.wmflabs.org/webservice-version=1
Status:		Pending
IP:
Controllers:	ReplicaSet/test-603267139
Containers:
  webservice:
    Container ID:
    Image:		docker-registry.tools.wmflabs.org/toolforge-python37-web:latest
    Image ID:
    Port:		8000/TCP
    Command:
      /usr/bin/webservice-runner
      --type
      uwsgi-python
      --port
      8000
    Limits:
      cpu:	2
      memory:	2Gi
    Requests:
      cpu:		125m
      memory:		256Mi
    State:		Waiting
      Reason:		ContainerCreating
    Ready:		False
    Restart Count:	0
    Volume Mounts:
      /data/project/ from home (rw)
      /data/scratch/ from scratch (rw)
      /etc/ldap.conf from etcldap-conf-4flw6 (rw)
      /etc/ldap.yaml from etcldap-yaml-w0gxm (rw)
      /etc/novaobserver.yaml from etcnovaobserver-yaml-jdvz8 (rw)
      /etc/wmcs-project from wmcs-project (rw)
      /mnt/nfs/ from nfs (rw)
      /public/dumps/ from dumps (rw)
      /var/run/nslcd/socket from varrunnslcdsocket-kojy1 (rw)
    Environment Variables:
      HOME:	/data/project/test/
Conditions:
  Type		Status
  Initialized 	True
  Ready 	False
  PodScheduled 	True
Volumes:
  dumps:
    Type:	HostPath (bare host directory volume)
    Path:	/public/dumps/
  home:
    Type:	HostPath (bare host directory volume)
    Path:	/data/project/
  wmcs-project:
    Type:	HostPath (bare host directory volume)
    Path:	/etc/wmcs-project
  nfs:
    Type:	HostPath (bare host directory volume)
    Path:	/mnt/nfs/
  scratch:
    Type:	HostPath (bare host directory volume)
    Path:	/data/scratch/
  etcldap-conf-4flw6:
    Type:	HostPath (bare host directory volume)
    Path:	/etc/ldap.conf
  etcldap-yaml-w0gxm:
    Type:	HostPath (bare host directory volume)
    Path:	/etc/ldap.yaml
  etcnovaobserver-yaml-jdvz8:
    Type:	HostPath (bare host directory volume)
    Path:	/etc/novaobserver.yaml
  varrunnslcdsocket-kojy1:
    Type:	HostPath (bare host directory volume)
    Path:	/var/run/nslcd/socket
QoS Class:	Burstable
Tolerations:	<none>

Note: I'm troubleshooting on the old cluster to eliminate variables right now, but I may switch to the new because I can get more information.

Fun thing...on the old cluster, the pods aren't fully deleting. They are ending up stuck in "terminating". Note that I'm testing this in toolsbeta, so this has nothing to do with the particular clusters. These are just set up similarly. It's the container images.

Some of what I'm seeing appears to be related to a bad filesystem on the worker node. Well that really doesn't help or relate to toolforge in anyway.

Ok, now that I rebooted that node, it's fine on the old cluster. The image may be fine.

Got it running on the new cluster as well. The image is ok.

Bstorm renamed this task from Issue with some buster docker images in Toolforge Kubernetes to Apparent issues in Toolforge Kubernetes.Jan 13 2020, 4:51 PM
Bstorm lowered the priority of this task from Unbreak Now! to High.
Bstorm updated the task description. (Show Details)
Bstorm lowered the priority of this task from High to Medium.Jan 13 2020, 5:50 PM

I'm now convinced that what I've seen so far has nothing to do with images or the infrastructure at this point. I'm very much hoping that this gets closed as invalid.

The issues with glamtools are a fail at reading a config from what I can tell, with consistent errors across php versions and clusters (which probably would only take effect on a restart).

There may be additional issues, but T242559: Partialy setup tools-k8s-worker instances created by novaadmin causing problems would have caused many problems for people and presented as random or nearly so with between 4 and 9 of the 2020 Kubernetes cluster nodes being in a state where $HOME would not have mounted into the pods.

I think that in toolsbeta, we had a filesystem issue on the only old-cluster node and those nodes were also causing issues. This ticket is basically not needed.