Page MenuHomePhabricator

tools-k8s: Nagf down after rolling-update
Closed, ResolvedPublic

Description

[17:25 UTC] krinkle at tools-k8s-bastion-01.tools.eqiad.wmflabs in ~
 $ sudo -iu tools.nagf
[18:06 UTC] tools.nagf at tools-k8s-bastion-01.tools.eqiad.wmflabs in ~
$ kubectl get pods
NAME         READY     STATUS    RESTARTS   AGE
nagf-wu6jm   1/1       Running   0          4d
[18:06 UTC] tools.nagf at tools-k8s-bastion-01.tools.eqiad.wmflabs in ~
$ kubectl rolling-update nagf --image=yuvipanda/nagf
Created nagf-6899d476598983c219c3beb256c8445f
Scaling up nagf-6899d476598983c219c3beb256c8445f from 0 to 1, scaling down nagf from 1 to 0 (keep 1 pods available, don't exceed 2 pods)
Scaling nagf-6899d476598983c219c3beb256c8445f up to 1
Scaling nagf down to 0
Update succeeded. Deleting old controller: nagf
Renaming nagf-6899d476598983c219c3beb256c8445f to nagf
replicationcontroller "nagf" rolling updated

[18:23 UTC] tools.nagf at tools-k8s-bastion-01.tools.eqiad.wmflabs in ~
$ kubectl get pods
NAME                                          READY     STATUS    RESTARTS   AGE
nagf-6899d476598983c219c3beb256c8445f-hr235   1/1       Running   0          5m

About half-way through the above, https://tools.wmflabs.org/nagf/ Nginx started to respond with HTTP 500 Internal Server Error, and hasn't recovered since.

Event Timeline

Krinkle triaged this task as Unbreak Now! priority.Mar 16 2016, 6:22 PM
Krinkle updated the task description. (Show Details)
valhallasw@tools-proxy-01:~$ redis-cli hgetall prefix:nagf
1) ".*"
2) "http://192.168.0.224:8080"
valhallasw@tools-proxy-01:~$ curl http://192.168.0.224:8080
<!DOCTYPE html><title>Error - Nagf</title><pre>Exception: Unable to write to cache file
 in /data/project/nagf/inc/WebCache.php:57

#0 /data/project/nagf/inc/Graphite.php(14): WebCache::get('wikitech-v1-pro...', 'https://wikitec...')
#1 /data/project/nagf/inc/NagfView.php(14): Graphite::getProjects()
#2 /data/project/nagf/inc/NagfView.php(227): NagfView-&gt;getProjectMenu()
#3 /data/project/nagf/public_html/index.php(13): NagfView-&gt;output()

so seems to be a backend issue in nagf rather than in k8s/webproxy

Krinkle claimed this task.

Thanks. I pushed a bad Docker image (it had a non-empty cache directory on my dev machine, which apparently becomes inaccessible to the docker guest once spawned on the other side).

Confirmed my shelling into the Docker image and manually removing the cache files.

I cleared my local cache, pushed the updated image, and did another rolling update. It's all okay now.

The cache re-created at run-time within the instance itself is accessible and working as expected.