tools-k8s: Nagf down after rolling-update
Closed, ResolvedPublic
Actions

Assigned To

Authored By

	Krinkle
	Mar 16 2016, 6:22 PM

Description

[17:25 UTC] krinkle at tools-k8s-bastion-01.tools.eqiad.wmflabs in ~
 $ sudo -iu tools.nagf
[18:06 UTC] tools.nagf at tools-k8s-bastion-01.tools.eqiad.wmflabs in ~
$ kubectl get pods
NAME         READY     STATUS    RESTARTS   AGE
nagf-wu6jm   1/1       Running   0          4d
[18:06 UTC] tools.nagf at tools-k8s-bastion-01.tools.eqiad.wmflabs in ~
$ kubectl rolling-update nagf --image=yuvipanda/nagf
Created nagf-6899d476598983c219c3beb256c8445f
Scaling up nagf-6899d476598983c219c3beb256c8445f from 0 to 1, scaling down nagf from 1 to 0 (keep 1 pods available, don't exceed 2 pods)
Scaling nagf-6899d476598983c219c3beb256c8445f up to 1
Scaling nagf down to 0
Update succeeded. Deleting old controller: nagf
Renaming nagf-6899d476598983c219c3beb256c8445f to nagf
replicationcontroller "nagf" rolling updated

[18:23 UTC] tools.nagf at tools-k8s-bastion-01.tools.eqiad.wmflabs in ~
$ kubectl get pods
NAME                                          READY     STATUS    RESTARTS   AGE
nagf-6899d476598983c219c3beb256c8445f-hr235   1/1       Running   0          5m

About half-way through the above, https://tools.wmflabs.org/nagf/ Nginx started to respond with HTTP 500 Internal Server Error, and hasn't recovered since.

Event Timeline

Krinkle created this task.Mar 16 2016, 6:22 PM

Restricted Application added a subscriber: Aklapper. · View Herald TranscriptMar 16 2016, 6:22 PM

Krinkle triaged this task as Unbreak Now! priority.Mar 16 2016, 6:22 PM

Krinkle updated the task description. (Show Details)

valhallasw@tools-proxy-01:~$ redis-cli hgetall prefix:nagf
1) ".*"
2) "http://192.168.0.224:8080"
valhallasw@tools-proxy-01:~$ curl http://192.168.0.224:8080
<!DOCTYPE html><title>Error - Nagf</title><pre>Exception: Unable to write to cache file
 in /data/project/nagf/inc/WebCache.php:57

#0 /data/project/nagf/inc/Graphite.php(14): WebCache::get('wikitech-v1-pro...', 'https://wikitec...')
#1 /data/project/nagf/inc/NagfView.php(14): Graphite::getProjects()
#2 /data/project/nagf/inc/NagfView.php(227): NagfView-&gt;getProjectMenu()
#3 /data/project/nagf/public_html/index.php(13): NagfView-&gt;output()

so seems to be a backend issue in nagf rather than in k8s/webproxy

Thanks. I pushed a bad Docker image (it had a non-empty cache directory on my dev machine, which apparently becomes inaccessible to the docker guest once spawned on the other side).

Confirmed my shelling into the Docker image and manually removing the cache files.

I cleared my local cache, pushed the updated image, and did another rolling update. It's all okay now.

The cache re-created at run-time within the instance itself is accessible and working as expected.

• Phabricator_maintenance removed a subscriber: yuvipanda.Jun 7 2017, 6:43 PM

Restricted Application added subscribers: Jay8g, TerraCodes. · View Herald TranscriptJun 7 2017, 6:43 PM

Nintendofan885 edited projects, added Nagf; removed Cloud-Services.Jul 30 2020, 3:56 PM

Krinkle moved this task from Inbox to Confirmed Problem on the Nagf board.Aug 14 2020, 2:00 PM

tools-k8s: Nagf down after rolling-updateClosed, ResolvedPublicActions

Description

Event Timeline

tools-k8s: Nagf down after rolling-update
Closed, ResolvedPublic
Actions