Page MenuHomePhabricator

dplbot webservice on Tools Labs fails repeatedly
Closed, ResolvedPublic

Description

https://tools.wmflabs.org/dplbot/

The webservice for the project 'dplbot' on Tool Labs has been failing repeatedly for several days. If you attempt to access any of the project's pages, you get the "No Webservice" page. But if I log in to the project, "qstat" shows the webservice is running, and the error logs don't indicate any problem. "webservice status" reports "Your webservice is running." (Not true.) "webservice restart" works, and the webpages then become available, but this typically only lasts a few minutes and then they go down again.

Event Timeline

russblau created this task.Oct 11 2015, 9:20 PM
russblau raised the priority of this task from to Needs Triage.
russblau updated the task description. (Show Details)
russblau added a project: Toolforge.
russblau added a subscriber: russblau.
Restricted Application added a project: Cloud-Services. · View Herald TranscriptOct 11 2015, 9:20 PM
Restricted Application added a subscriber: Aklapper. · View Herald Transcript

This is possibly related to T115225? How long has this been happening?

I'm not entirely sure because I have been mostly offline myself for several days, but I noticed it definitely on Saturday 10 Oct.

valhallasw triaged this task as Unbreak Now! priority.Oct 12 2015, 7:31 PM
valhallasw added a subscriber: valhallasw.

At the moment this seems to work, so I'm closing this. If it happens again, please re-open it (with the same priority).

valhallasw updated the task description. (Show Details)Oct 12 2015, 7:34 PM
valhallasw set Security to None.
yuvipanda closed this task as Resolved.Oct 16 2015, 4:55 PM
yuvipanda claimed this task.

Actually closing.

russblau reopened this task as Open.Jul 19 2016, 5:27 PM

Reopening; same symptoms are occurring again.

Restricted Application added subscribers: Luke081515, TerraCodes. · View Herald TranscriptJul 19 2016, 5:27 PM

It seems to be running now - did someone manually start it?

Works for me.

Yes, I had to manually restart it twice today. The automatic webservice restarter is not working.

I moved it to kuberenetes and also fixed the issue with the webservice restarter. Can you verify it works fine under kubernetes? (no changes required from your perspective, since it's still running lighttpd + php)

russblau added a comment.EditedJul 19 2016, 6:38 PM

It is down at the moment. "webservice status" says it is running, but "qstat" shows no server process running.

UPDATE: It did restart itself a few moments after I wrote the above. Still nothing visible under 'qstat'; is that to be expected?

I can see it running now? http://tools.wmflabs.org/dplbot/

When I switched it to kubernetes it'll no longer show up in qstat (use
'kubectl get pod' for equivalent).

yuvipanda lowered the priority of this task from Unbreak Now! to Normal.Jul 20 2016, 2:42 PM
Gorthian added a subscriber: Gorthian.EditedJul 20 2016, 10:23 PM

This has been failing over and over yesterday and today. It is intermittent. As a frequent user of dplbot, I have to say that the solution has not been found yet.

UPDATE: An hour later, and it is still down.

It is currently down again. Shell shows the following:

tools.dplbot@tools-bastion-03:~$ kubectl get pod
NAME                      READY     STATUS    RESTARTS   AGE
dplbot-1445756605-f0mpw   1/1       Running   0          1d
tools.dplbot@tools-bastion-03:~$ webservice status
Your webservice is running

I will wait a short time before restarting it manually.

Don't, am looking at it just now.

Hmm, I fixed it (required a restart of kube2proxy layer). I'll file a separate bug to investigate this just now.

It's still been unavailable intermittently through the day. It's down at the moment.

Seems to be https://phabricator.wikimedia.org/T140988 again, I restarted that and it's back up.

Is there some custom code that's attempting to autorestart this tool?

(am asking since no other tools seem to be suffering this right now, so need to figure out what makes this tool special)

I'm afraid I know nothing about the bot. The nominal maintainer, JaGa, hasn't been active on en.wiki for a month. RussBlau, listed as a maintainer, opened and re-opened this ticket. There's Dispenser, who requested to be a maintainer a few months back, so might be one now.

Should I keep posting notices when it fails, or is that just redundant? (Totally a newbie on the technical side)

Is there some custom code that's attempting to autorestart this tool?

Not that I am aware of.

yuvipanda renamed this task from Webservice on Tools Labs fails repeatedly to dplbot webservice on Tools Labs fails repeatedly.Jul 22 2016, 4:22 PM

This problem is continuing, intermittently, every day. I haven't been chiming in because it would get irritating, yet I want to keep this on the radar.

yuvipanda removed yuvipanda as the assignee of this task.Aug 6 2016, 9:56 PM

I've been checking it on and off for the last week or so and seems stable...

It's down right now. It has been intermittently down and back up since I last posted here.

strange, I just checked it and it's up...

I've had to restart it manually at least once a day for the past several days, although not today (so far).

It is down again and not restarting. It just went down at approx. 10:20 GMT today (Thursday).

scfc added a subscriber: scfc.Dec 4 2016, 9:29 PM

ATM, http://tools.wmflabs.org/dplbot/ returns 503 ("No webservice"). There is a pod running:

tools.dplbot@tools-bastion-03:~$ kubectl get pod
NAME                      READY     STATUS    RESTARTS   AGE
dplbot-1445756605-gamjq   1/1       Running   0          16d
tools.dplbot@tools-bastion-03:~$

On tools-proxy-01, Redis had no key for prefix:dplbot, but tools-proxy-02 had:

scfc@tools-proxy-01:~$ redis-cli 
127.0.0.1:6379> HGETALL prefix:dplbot
(empty list or set)
127.0.0.1:6379> 
scfc@tools-proxy-01:~$
127.0.0.1:6379> HGETALL prefix:dplbot
1) ".*"
2) "http://192.168.0.33:8000"
127.0.0.1:6379> 
scfc@tools-proxy-02:~$

lynx http://192.168.0.33:8000 shows the dplbot page.

scfc added a comment.Dec 4 2016, 9:33 PM

I have now restarted the webservice with webservice restart. Now Redis on tools-proxy-01 points to the new Kubernetes pod (http://192.168.0.26:8000), while tools-proxy-02 continues to point to the (now dysfunctional) http://192.168.0.33:8000.

scfc added a comment.Dec 4 2016, 9:38 PM

/etc/active-proxy is tools-proxy-01 on tools-bastion-03.

scfc closed this task as Resolved.Dec 4 2016, 9:52 PM
scfc claimed this task.

I've restarted the webservice yet again, and now the entry on tools-proxy-01 points to http://192.168.0.50:8000, with the entry on tools-proxy-02 still unchanged. So the Redis replication from tools-proxy-01 to tools-proxy-02 is broken, and I have filed T152356 for that.

However, I'm closing this task for the time being so that we have a base line when the webservice worked with which parameters. If it fails again, please reopen this task so that it can be investigated.

(I've fixed the replication between the proxies with
https://gerrit.wikimedia.org/r/#/c/325751/)