Page MenuHomePhabricator

dplbot webservice on Toolforge repeatedly have its dynamicproxy entry removed (because qsub schedules tasks to webgrid queues, causing portreleaser to run as job epilogue)
Closed, InvalidPublic

Description

https://tools.wmflabs.org/dplbot/

The webservice for the project 'dplbot' on Tool Labs has been failing repeatedly for several days. If you attempt to access any of the project's pages, you get the "No Webservice" page. But if I log in to the project, "qstat" shows the webservice is running, and the error logs don't indicate any problem. "webservice status" reports "Your webservice is running." (Not true.) "webservice restart" works, and the webpages then become available, but this typically only lasts a few minutes and then they go down again.

Event Timeline

There are a very large number of changes, so older changes are hidden. Show Older Changes

At the moment this seems to work, so I'm closing this. If it happens again, please re-open it (with the same priority).

valhallasw updated the task description. (Show Details)Oct 12 2015, 7:34 PM
valhallasw set Security to None.
yuvipanda closed this task as Resolved.Oct 16 2015, 4:55 PM
yuvipanda claimed this task.

Actually closing.

russblau reopened this task as Open.Jul 19 2016, 5:27 PM

Reopening; same symptoms are occurring again.

Restricted Application added subscribers: Luke081515, TerraCodes. · View Herald TranscriptJul 19 2016, 5:27 PM

It seems to be running now - did someone manually start it?

Works for me.

Yes, I had to manually restart it twice today. The automatic webservice restarter is not working.

I moved it to kuberenetes and also fixed the issue with the webservice restarter. Can you verify it works fine under kubernetes? (no changes required from your perspective, since it's still running lighttpd + php)

russblau added a comment.EditedJul 19 2016, 6:38 PM

It is down at the moment. "webservice status" says it is running, but "qstat" shows no server process running.

UPDATE: It did restart itself a few moments after I wrote the above. Still nothing visible under 'qstat'; is that to be expected?

I can see it running now? http://tools.wmflabs.org/dplbot/

When I switched it to kubernetes it'll no longer show up in qstat (use
'kubectl get pod' for equivalent).

yuvipanda lowered the priority of this task from Unbreak Now! to Medium.Jul 20 2016, 2:42 PM
Gorthian added a subscriber: Gorthian.EditedJul 20 2016, 10:23 PM

This has been failing over and over yesterday and today. It is intermittent. As a frequent user of dplbot, I have to say that the solution has not been found yet.

UPDATE: An hour later, and it is still down.

It is currently down again. Shell shows the following:

tools.dplbot@tools-bastion-03:~$ kubectl get pod
NAME                      READY     STATUS    RESTARTS   AGE
dplbot-1445756605-f0mpw   1/1       Running   0          1d
tools.dplbot@tools-bastion-03:~$ webservice status
Your webservice is running

I will wait a short time before restarting it manually.

Don't, am looking at it just now.

Hmm, I fixed it (required a restart of kube2proxy layer). I'll file a separate bug to investigate this just now.

It's still been unavailable intermittently through the day. It's down at the moment.

Seems to be https://phabricator.wikimedia.org/T140988 again, I restarted that and it's back up.

Is there some custom code that's attempting to autorestart this tool?

(am asking since no other tools seem to be suffering this right now, so need to figure out what makes this tool special)

I'm afraid I know nothing about the bot. The nominal maintainer, JaGa, hasn't been active on en.wiki for a month. RussBlau, listed as a maintainer, opened and re-opened this ticket. There's Dispenser, who requested to be a maintainer a few months back, so might be one now.

Should I keep posting notices when it fails, or is that just redundant? (Totally a newbie on the technical side)

Is there some custom code that's attempting to autorestart this tool?

Not that I am aware of.

yuvipanda renamed this task from Webservice on Tools Labs fails repeatedly to dplbot webservice on Tools Labs fails repeatedly.Jul 22 2016, 4:22 PM

This problem is continuing, intermittently, every day. I haven't been chiming in because it would get irritating, yet I want to keep this on the radar.

yuvipanda removed yuvipanda as the assignee of this task.Aug 6 2016, 9:56 PM

I've been checking it on and off for the last week or so and seems stable...

It's down right now. It has been intermittently down and back up since I last posted here.

strange, I just checked it and it's up...

I've had to restart it manually at least once a day for the past several days, although not today (so far).

It is down again and not restarting. It just went down at approx. 10:20 GMT today (Thursday).

scfc added a subscriber: scfc.Dec 4 2016, 9:29 PM

ATM, http://tools.wmflabs.org/dplbot/ returns 503 ("No webservice"). There is a pod running:

tools.dplbot@tools-bastion-03:~$ kubectl get pod
NAME                      READY     STATUS    RESTARTS   AGE
dplbot-1445756605-gamjq   1/1       Running   0          16d
tools.dplbot@tools-bastion-03:~$

On tools-proxy-01, Redis had no key for prefix:dplbot, but tools-proxy-02 had:

scfc@tools-proxy-01:~$ redis-cli 
127.0.0.1:6379> HGETALL prefix:dplbot
(empty list or set)
127.0.0.1:6379> 
scfc@tools-proxy-01:~$
127.0.0.1:6379> HGETALL prefix:dplbot
1) ".*"
2) "http://192.168.0.33:8000"
127.0.0.1:6379> 
scfc@tools-proxy-02:~$

lynx http://192.168.0.33:8000 shows the dplbot page.

scfc added a comment.Dec 4 2016, 9:33 PM

I have now restarted the webservice with webservice restart. Now Redis on tools-proxy-01 points to the new Kubernetes pod (http://192.168.0.26:8000), while tools-proxy-02 continues to point to the (now dysfunctional) http://192.168.0.33:8000.

scfc added a comment.Dec 4 2016, 9:38 PM

/etc/active-proxy is tools-proxy-01 on tools-bastion-03.

scfc closed this task as Resolved.Dec 4 2016, 9:52 PM
scfc claimed this task.

I've restarted the webservice yet again, and now the entry on tools-proxy-01 points to http://192.168.0.50:8000, with the entry on tools-proxy-02 still unchanged. So the Redis replication from tools-proxy-01 to tools-proxy-02 is broken, and I have filed T152356 for that.

However, I'm closing this task for the time being so that we have a base line when the webservice worked with which parameters. If it fails again, please reopen this task so that it can be investigated.

(I've fixed the replication between the proxies with
https://gerrit.wikimedia.org/r/#/c/325751/)

russblau reopened this task as Open.Dec 29 2019, 12:47 AM

The webservice for dplbot now rarely stays up for more than 10-15 minutes at a time. There are still no error messages whatsoever in the logs giving any clue as to the cause of the outages. Note that, as reported originally, the lighttpd process continues to run during these outages, but all http requests return an error 503. (I have tried using webservice start --backend=kubernetes, but this results in all requests throwing an error 502.) I have also reopened T218915, a related issue.

not familiar with the internals of our webservice script yet but is it possible it's due to the dplbot pod being in terminating status?

tools.dplbot@tools-sgebastion-07:~$ kubectl get pod
NAME                      READY     STATUS        RESTARTS   AGE
dplbot-3025215167-geq9d   0/1       Terminating   0          4h
interactive               0/1       Terminating   0          4h

I just tried to start a k8s webservice of my own, pod is stuck in Pending (age 1m) - suspect our k8s cluster is broken. Probably a separate task.

Found a problematic host in the k8s cluster which was hosting dplbot, cordoned it in T241523.
Running happily for 21m now:

tools.dplbot@tools-sgebastion-07:~$ kubectl get pod
NAME                      READY     STATUS    RESTARTS   AGE
dplbot-3025215167-fwi5u   1/1       Running   0          21m

I just looked again and this time the pod is running and has been for 10h, but I get 503 service unavailable. It's not registered on the proxy for some reason:

krenair@tools-proxy-05:~$ redis-cli
127.0.0.1:6379> hgetall prefix:dplbot
(empty list or set)
bd808 removed scfc as the assignee of this task.Jan 1 2020, 5:53 PM

I could really use some help trying to resolve this issue. The webservice is somewhat more stable now that I am running it with the Kubernetes backend instead of on the grid, but this just means that it fails every few hours instead of every few minutes, on average.

The symptom of failure is that all requests get rejected with 503 service unavailable. This generally happens almost immediately after submitting a request, so it doesn't seem to be a timeout problem (and I would expect a different error code if it were). The lighttpd process continues to run (which is why I am calling this a "failure" rather than a "crash"), which means kubernetes doesn't detect any problem and doesn't restart the server, but the server does not respond to any requests. Every time this happens, I have to restart the webservice manually; I haven't found any other way of doing it, since a cronjob apparently can't interact with kubernetes.

Review of the tool's logs is unenlightening, at least to me. I have tried both

debug.log-request-handling = "enable"

and

fastcgi.debug = 1

but haven't learned anything useful. As far as I can tell, all requests are handled correctly until they aren't, and there is nothing in the logs that leaps out to me as indicating any problem. With "debug.log-request-handling" enabled, the failure is preceded by a series of lines that looks like this (sorry for the length, but this is all handling a single request, except for the manual restart in the last three lines):

2019-12-28 22:13:27: (response.c.350) -- splitting Request-URI
2019-12-28 22:13:27: (response.c.351) Request-URI     :  /dplbot/dab_fix_list.php?title=African
2019-12-28 22:13:27: (response.c.352) URI-scheme      :  http
2019-12-28 22:13:27: (response.c.353) URI-authority   :  www.tools.wmflabs.org
2019-12-28 22:13:27: (response.c.354) URI-path (raw)  :  /dplbot/dab_fix_list.php
2019-12-28 22:13:27: (response.c.355) URI-path (clean):  /dplbot/dab_fix_list.php
2019-12-28 22:13:27: (response.c.356) URI-query       :  title=African
2019-12-28 22:13:27: (mod_access.c.148) -- mod_access_uri_handler called
2019-12-28 22:13:27: (response.c.490) -- before doc_root
2019-12-28 22:13:27: (response.c.491) Doc-Root     : /data/project/dplbot/public_html
2019-12-28 22:13:27: (response.c.492) Rel-Path     : /dplbot/dab_fix_list.php
2019-12-28 22:13:27: (response.c.493) Path         :
2019-12-28 22:13:27: (response.c.542) -- after doc_root
2019-12-28 22:13:27: (response.c.543) Doc-Root     : /data/project/dplbot/public_html
2019-12-28 22:13:27: (response.c.544) Rel-Path     : /dplbot/dab_fix_list.php
2019-12-28 22:13:27: (response.c.545) Path         : /data/project/dplbot/public_html/dplbot/dab_fix_list.php
2019-12-28 22:13:27: (response.c.562) -- logical -> physical
2019-12-28 22:13:27: (response.c.563) Doc-Root     : /data/project/dplbot/public_html
2019-12-28 22:13:27: (response.c.564) Basedir      : /data/project/dplbot/public_html/
2019-12-28 22:13:27: (response.c.565) Rel-Path     : /dplbot/dab_fix_list.php
2019-12-28 22:13:27: (response.c.566) Path         : /data/project/dplbot/public_html//dab_fix_list.php
2019-12-28 22:13:27: (response.c.583) -- handling physical path
2019-12-28 22:13:27: (response.c.584) Path         : /data/project/dplbot/public_html//dab_fix_list.php
2019-12-28 22:13:27: (response.c.591) -- file found
2019-12-28 22:13:27: (response.c.592) Path         : /data/project/dplbot/public_html//dab_fix_list.php
2019-12-28 22:13:27: (response.c.753) -- handling subrequest
2019-12-28 22:13:27: (response.c.754) Path         : /data/project/dplbot/public_html//dab_fix_list.php
2019-12-28 22:13:27: (mod_access.c.148) -- mod_access_uri_handler called
2019-12-28 22:13:27: (mod_compress.c.839) -- handling file as static file
2019-12-28 22:13:27: (mod_fastcgi.c.3500) handling it in mod_fastcgi
2019-12-29 01:24:56: (server.c.1751) [note] graceful shutdown started
2019-12-29 01:24:56: (server.c.1828) server stopped by UID = 51290 PID = 11187
2019-12-29 01:25:18: (log.c.164) server started

With "fastcgi.debug" on, I get log entries like the following (even longer):

2020-01-08 00:23:21: (response.c.339) -- splitting Request-URI
2020-01-08 00:23:21: (response.c.340) Request-URI     :  /dplbot/disambigs_in_an_article.php?title=American_Samurai
2020-01-08 00:23:21: (response.c.341) URI-scheme      :  http
2020-01-08 00:23:21: (response.c.342) URI-authority   :  www.tools.wmflabs.org
2020-01-08 00:23:21: (response.c.343) URI-path (raw)  :  /dplbot/disambigs_in_an_article.php
2020-01-08 00:23:21: (response.c.344) URI-path (clean):  /dplbot/disambigs_in_an_article.php
2020-01-08 00:23:21: (response.c.345) URI-query       :  title=American_Samurai
2020-01-08 00:23:21: (mod_access.c.135) -- mod_access_uri_handler called
2020-01-08 00:23:21: (response.c.473) -- before doc_root
2020-01-08 00:23:21: (response.c.474) Doc-Root     : /data/project/dplbot/public_html
2020-01-08 00:23:21: (response.c.475) Rel-Path     : /dplbot/disambigs_in_an_article.php
2020-01-08 00:23:21: (response.c.476) Path         :
2020-01-08 00:23:21: (response.c.524) -- after doc_root
2020-01-08 00:23:21: (response.c.525) Doc-Root     : /data/project/dplbot/public_html
2020-01-08 00:23:21: (response.c.526) Rel-Path     : /dplbot/disambigs_in_an_article.php
2020-01-08 00:23:21: (response.c.527) Path         : /data/project/dplbot/public_html/dplbot/disambigs_in_an_article.php
2020-01-08 00:23:21: (response.c.544) -- logical -> physical
2020-01-08 00:23:21: (response.c.545) Doc-Root     : /data/project/dplbot/public_html
2020-01-08 00:23:21: (response.c.546) Basedir      : /data/project/dplbot/public_html/
2020-01-08 00:23:21: (response.c.547) Rel-Path     : /dplbot/disambigs_in_an_article.php
2020-01-08 00:23:21: (response.c.548) Path         : /data/project/dplbot/public_html//disambigs_in_an_article.php
2020-01-08 00:23:21: (response.c.565) -- handling physical path
2020-01-08 00:23:21: (response.c.566) Path         : /data/project/dplbot/public_html//disambigs_in_an_article.php
2020-01-08 00:23:21: (response.c.573) -- file found
2020-01-08 00:23:21: (response.c.574) Path         : /data/project/dplbot/public_html//disambigs_in_an_article.php
2020-01-08 00:23:21: (response.c.725) -- handling subrequest
2020-01-08 00:23:21: (response.c.726) Path         : /data/project/dplbot/public_html//disambigs_in_an_article.php
2020-01-08 00:23:21: (mod_access.c.135) -- mod_access_uri_handler called
2020-01-08 00:23:21: (mod_compress.c.744) -- handling file as static file
2020-01-08 00:23:21: (mod_fastcgi.c.3700) handling it in mod_fastcgi
2020-01-08 00:23:21: (mod_fastcgi.c.3061) got proc: pid: 8 socket: unix:/var/run/lighttpd/php.socket.dplbot-1 load: 1
2020-01-08 00:23:21: (mod_fastcgi.c.2702) FastCGI-stderr: PHP Notice:  Undefined index: HTTP_REFERER in /data/project/dplbot/public_html/v2/disambigs_in_an_article.php on line 8
2020-01-08 00:23:21: (mod_fastcgi.c.2702) FastCGI-stderr: PHP Stack trace:
2020-01-08 00:23:21: (mod_fastcgi.c.2702) FastCGI-stderr: PHP   1. {main}() /data/project/dplbot/public_html/v2/disambigs_in_an_article.php:0
2020-01-08 00:23:21: (mod_fastcgi.c.1505) released proc: pid: 8 socket: unix:/var/run/lighttpd/php.socket.dplbot-1 load: 0
2020-01-08 00:35:17: (server.c.1558) server stopped by UID = 0 PID = 0
2020-01-08 00:35:22: (log.c.164) server started

From the "load: 0" at the end of the last line before the restart, I infer that the request had been handled and response sent, and the process released; am I missing something there? As far as I can tell, everything in the log indicates requests being handled normally. There are some PHP Notice and Warning messages, but nothing indicating a fatal error in an application. I also enabled access logging, and it appears that all requests that are being logged are being handled; for example, around the time of the previous error.log entries, access.log shows as follows:

192.168.207.0 www.tools.wmflabs.org - [08/Jan/2020:00:22:18 +0000] "GET /dplbot/disambigs_in_an_article.php?title=2019_World_Para_Athletics_Championships_%E2%80%93_Men%27s_long_jump HTTP/1.1" 200 1632 "-" "Mozilla/5.0 (compatible; bingbot/2.0; +http://www.bing.com/bingbot.htm)"
192.168.207.0 www.tools.wmflabs.org - [08/Jan/2020:00:23:21 +0000] "GET /dplbot/disambigs_in_an_article.php?title=American_Samurai HTTP/1.1" 200 1452 "-" "Mozilla/5.0 (compatible; SemrushBot/6~bl; +http://www.semrush.com/bot.html)"
192.168.207.0 tools.wmflabs.org - [08/Jan/2020:00:35:54 +0000] "GET /dplbot/server-status HTTP/1.1" 200 4711 "-" "curl/7.52.1"
192.168.207.0 www.tools.wmflabs.org - [08/Jan/2020:00:36:22 +0000] "GET /dplbot/simple/templates_with_dab_links.php?limit=100&offset=0 HTTP/1.1" 200 11877 "-" "Mozilla/5.0 (compatible; SemrushBot/6~bl; +http://www.semrush.com/bot.html)"
192.168.207.0 tools.wmflabs.org - [08/Jan/2020:00:37:24 +0000] "GET /dplbot/templates_with_dab_links.php HTTP/1.1" 200 3950 "-" "Mozilla/5.0 (iPhone; CPU iPhone OS 13_3 like Mac OS X) AppleWebKit/605.1.15 (KHTML, like Gecko) CriOS/79.0.3945.73 Mobile/15E148 Safari/604.1"

The request at 00:23:21 for /disambigs_in_an_article.php corresponds to the last request shown in the error.log before the restart, and does not seem to have encountered any errors. The next access.log entries are after the restart (and are almost certainly me checking the service after restarting it).

Is there any way to figure out what is going wrong inside lighttpd? Is there something I should be looking at that I have missed? This is seriously impairing my tool's usefulness to its users, and I would really like to come up with some solution. Any suggestions would be greatly appreciated.

This is happening.

06:31:03 0 ✓ zhuyifei1999@tools-sgebastion-08: ~$ sudo become dplbot kubectl get pods -o wide
NAME                      READY     STATUS    RESTARTS   AGE       IP              NODE
dplbot-3854638888-hgtf0   1/1       Running   0          2m        192.168.187.5   tools-worker-1039.tools.eqiad.wmflabs

The pod was restarted very recently, will try to figure out why.

The pod seems to be crashing every around 3 mins.

The lightttpd process is stopped by signal. SI_USER says it's userspace but with a PID of 0... which I believe is an indication that the sender of the signal is outside of the container.

epoll_wait(7, {}, 65537, 1000)          = 0
wait4(8, 0x7fff6aee7d54, WNOHANG, NULL) = 0
wait4(7, 0x7fff6aee7d54, WNOHANG, NULL) = 0
epoll_wait(7, {}, 65537, 1000)          = 0
wait4(8, 0x7fff6aee7d54, WNOHANG, NULL) = 0
wait4(7, 0x7fff6aee7d54, WNOHANG, NULL) = 0
epoll_wait(7, 7feb8006b010, 65537, 1000) = -1 EINTR (Interrupted system call)
--- SIGTERM {si_signo=SIGTERM, si_code=SI_USER, si_pid=0, si_uid=0} ---
rt_sigreturn()                          = -1 EINTR (Interrupted system call)
ftruncate(3, 0)                         = 0
close(3)                                = 0
unlink("/var/run/lighttpd/dplbot.pid")  = 0
stat("/etc/localtime", {st_mode=S_IFREG|0644, st_size=127, ...}) = 0
write(5, "2020-01-27 06:42:03: (server.c.1828) server stopped by UID = 0 PID = 0 \n", 72) = 72
close(5)                                = 0
epoll_ctl(7, EPOLL_CTL_DEL, 4, 7fff6aee7d60) = 0
close(4)                                = 0
close(6)                                = 0
kill(8, SIGTERM)                        = 0
unlink("/var/run/lighttpd/php.socket.dplbot-1") = 0
kill(7, SIGTERM)                        = 0
unlink("/var/run/lighttpd/php.socket.dplbot-0") = 0
munmap(0x7feb7dfb3000, 2105360)         = 0
[...]
munmap(0x7feb7c349000, 2105360)         = 0
close(7)                                = 0
munmap(0x7feb8006b000, 790528)          = 0
munmap(0x7feb8012c000, 528384)          = 0
exit_group(0)                           = ?
+++ exited with 0 +++

Confirming signal sent from outside container causes si_pid=0:

06:53:13 0 ✓ zhuyifei1999@tools-worker-1037: ~$ sudo kill -CHLD 29720

causes:

wait4(7, 0x7ffebce0e0a4, WNOHANG, NULL) = 0
epoll_wait(7, 7fcf20594010, 65537, 1000) = -1 EINTR (Interrupted system call)
--- SIGCHLD {si_signo=SIGCHLD, si_code=SI_USER, si_pid=0, si_uid=0} ---
rt_sigreturn()                          = -1 EINTR (Interrupted system call)
wait4(8, 0x7ffebce0e0a4, WNOHANG, NULL) = 0

06:54:25 1 ✗ zhuyifei1999@tools-worker-1037: ~$ sudo nsenter -p -m -t 29720
root@tools-worker-1037:/# kill -CHLD 1

causes:

wait4(7, 0x7ffebce0e0a4, WNOHANG, NULL) = 0
epoll_wait(7, 7fcf20594010, 65537, 1000) = -1 EINTR (Interrupted system call)
--- SIGCHLD {si_signo=SIGCHLD, si_code=SI_USER, si_pid=44, si_uid=0} ---
rt_sigreturn()                          = -1 EINTR (Interrupted system call)
epoll_wait(7, {}, 65537, 1000)          = 0

Mentioned in SAL (#wikimedia-cloud) [2020-01-27T07:01:43Z] <zhuyifei1999_> apt installing bcc on tools-worker-1037 to see who is sending SIGTERM, will uninstall after done. dependency: bin86. T115231

Mentioned in SAL (#wikimedia-cloud) [2020-01-27T07:05:43Z] <zhuyifei1999_> wrong package. uninstalled. the correct one is bpfcc-tools and seems only available in buster+. T115231

Gorthian removed a subscriber: Gorthian.Jan 27 2020, 7:20 AM

Considering bcc (which I use to globally trace events on a host) is unavailable, I attched strace to two k8s threads, one of which is the main thread of docker-containerd-shim. What is interesting is that at the point of exit it was something like

futex(0x... <unfinished>
+++ exited with 0 +++

There was no signals so another thread must have ran exit_group(2) syscall.

Most threads are running futex except one, which runs on read(2) on fd6, which looks like a docker control file. I straced that specifically but no that did not do:

07:19:13 0 ✓ zhuyifei1999@tools-worker-1039: ~$ sudo strace -p 16421
Process 16421 attached
read(6,  <unfinished ...>
+++ exited with 0 +++

So then I traced every thread but still no, the child exited first:

07:20:52 0 ✓ zhuyifei1999@tools-worker-1039: ~$ sudo strace -fp 20189
Process 20189 attached with 11 threads
[pid 20216] futex(0xc8200ec508, FUTEX_WAIT, 0, NULL <unfinished ...>
[pid 20211] futex(0xc820027908, FUTEX_WAIT, 0, NULL <unfinished ...>
[pid 20196] futex(0xc8200ec108, FUTEX_WAIT, 0, NULL <unfinished ...>
[pid 20195] futex(0xc820096108, FUTEX_WAIT, 0, NULL <unfinished ...>
[pid 20197] read(6,  <unfinished ...>
[pid 20194] futex(0xc820027508, FUTEX_WAIT, 0, NULL <unfinished ...>
[pid 20193] futex(0xc820027108, FUTEX_WAIT, 0, NULL <unfinished ...>
[pid 20191] futex(0x87de80, FUTEX_WAIT, 0, NULL <unfinished ...>
[pid 20192] futex(0xc820026d08, FUTEX_WAIT, 0, NULL <unfinished ...>
[pid 20190] restart_syscall(<... resuming interrupted call ...> <unfinished ...>
[pid 20189] futex(0x863e08, FUTEX_WAIT, 0, NULL <unfinished ...>
[pid 20190] <... restart_syscall resumed> ) = -1 ETIMEDOUT (Connection timed out)
[pid 20190] select(0, NULL, NULL, NULL, {0, 20}) = 0 (Timeout)
[pid 20190] futex(0x8635f0, FUTEX_WAIT, 0, {60, 0}) = -1 ETIMEDOUT (Connection timed out)
[pid 20190] select(0, NULL, NULL, NULL, {0, 20}) = 0 (Timeout)
[pid 20190] futex(0x8635f0, FUTEX_WAIT, 0, {60, 0} <unfinished ...>
[pid 20189] <... futex resumed> )       = ? ERESTARTSYS (To be restarted if SA_RESTART is set)
[pid 20189] --- SIGCHLD {si_signo=SIGCHLD, si_code=CLD_EXITED, si_pid=20205, si_uid=51290, si_status=0, si_utime=8, si_stime=8} ---
[pid 20189] futex(0x87de80, FUTEX_WAKE, 1 <unfinished ...>
[pid 20191] <... futex resumed> )       = 0
[pid 20189] <... futex resumed> )       = 1
[pid 20191] futex(0x8635f0, FUTEX_WAKE, 1) = 1
[pid 20190] <... futex resumed> )       = 0
[pid 20189] rt_sigreturn( <unfinished ...>
[pid 20191] futex(0xc820026d08, FUTEX_WAKE, 1 <unfinished ...>
[pid 20190] select(0, NULL, NULL, NULL, {0, 20} <unfinished ...>
[pid 20192] <... futex resumed> )       = 0
[pid 20191] <... futex resumed> )       = 1
[pid 20192] select(0, NULL, NULL, NULL, {0, 100} <unfinished ...>
[pid 20191] wait4(-1,  <unfinished ...>
[pid 20189] <... rt_sigreturn resumed> ) = 202
[pid 20189] futex(0x863e08, FUTEX_WAIT, 0, NULL <unfinished ...>
[pid 20191] <... wait4 resumed> [{WIFEXITED(s) && WEXITSTATUS(s) == 0}], WNOHANG, {ru_utime={0, 84000}, ru_stime={0, 84000}, ...}) = 20205
[pid 20192] <... select resumed> )      = 0 (Timeout)
[pid 20191] wait4(-1,  <unfinished ...>
[pid 20192] futex(0xc820027508, FUTEX_WAKE, 1 <unfinished ...>
[pid 20191] <... wait4 resumed> 0xc820045ac4, WNOHANG, 0xc820045b38) = -1 ECHILD (No child processes)
[pid 20194] <... futex resumed> )       = 0
[pid 20192] <... futex resumed> )       = 1
[pid 20194] futex(0xc820027508, FUTEX_WAIT, 0, NULL <unfinished ...>
[pid 20192] futex(0x87de80, FUTEX_WAIT, 0, NULL <unfinished ...>
[pid 20191] openat(AT_FDCWD, "exitStatus", O_RDWR|O_CREAT|O_TRUNC|O_CLOEXEC, 0666 <unfinished ...>
[pid 20190] <... select resumed> )      = 0 (Timeout)
[pid 20191] <... openat resumed> )      = 7
[pid 20190] select(0, NULL, NULL, NULL, {0, 20} <unfinished ...>
[pid 20191] write(7, "0", 1)            = 1
[pid 20191] close(7 <unfinished ...>
[pid 20190] <... select resumed> )      = 0 (Timeout)
[pid 20191] <... close resumed> )       = 0
[pid 20190] select(0, NULL, NULL, NULL, {0, 20} <unfinished ...>
[pid 20191] close(5)                    = 0
[pid 20191] close(8 <unfinished ...>
[pid 20190] <... select resumed> )      = 0 (Timeout)
[pid 20191] <... close resumed> )       = 0
[pid 20190] select(0, NULL, NULL, NULL, {0, 20} <unfinished ...>
[pid 20191] write(4, "{\"level\": \"warn\",\"msg\": \"invalid"..., 43) = 43
[pid 20190] <... select resumed> )      = 0 (Timeout)
[pid 20190] select(0, NULL, NULL, NULL, {0, 20} <unfinished ...>
[pid 20191] close(6)                    = 0
[pid 20191] exit_group(0)               = ?
[pid 20211] +++ exited with 0 +++
[pid 20216] +++ exited with 0 +++
[pid 20197] +++ exited with 0 +++
[pid 20196] +++ exited with 0 +++
[pid 20195] +++ exited with 0 +++
[pid 20194] +++ exited with 0 +++
[pid 20192] +++ exited with 0 +++
[pid 20191] +++ exited with 0 +++
[pid 20190] +++ exited with 0 +++
[pid 20193] +++ exited with 0 +++
+++ exited with 0 +++

By the way, @Bstorm, I remember T242397: Make webservice grid jobs "non-rerunable", considering that when grid reschedules a webservice the webservice registration breaks, does that also apply in the case of the old k8s? As in, if the old k8s reschedules a pod, would registration break?

The SIGTERM is being sent when I run ‘webservice restart’ on one of the login bastions, which I do only _after_ receiving a 503 response.

The SIGTERM is being sent when I run ‘webservice restart’ on one of the login bastions, which I do only _after_ receiving a 503 response.

You are doing the restarts manually once every 3-10 minutes? I was not finding and script in the cron or something that does obvious checking for 503.

Might be relevant:

zhuyifei1999@tools-proxy-05:~$ sudo journalctl -u kube2proxy | grep dpl -B 1 -A 1
Jan 27 10:20:48 tools-proxy-05 kube2proxy[15077]: 2020-01-27 10:20:48,865 Service dimastbkbot/dimastbkbot/.* is ADDED
Jan 27 10:20:48 tools-proxy-05 kube2proxy[15077]: 2020-01-27 10:20:48,866 Service dplbot/dplbot/.* is ADDED
Jan 27 10:20:48 tools-proxy-05 kube2proxy[15077]: 2020-01-27 10:20:48,866 Service dtz/dtz/.* is ADDED
--
Jan 27 11:03:52 tools-proxy-05 kube2proxy[15077]: 2020-01-27 11:03:52,095 Service dimastbkbot/dimastbkbot/.* is ADDED
Jan 27 11:03:52 tools-proxy-05 kube2proxy[15077]: 2020-01-27 11:03:52,095 Service dplbot/dplbot/.* is ADDED
Jan 27 11:03:52 tools-proxy-05 kube2proxy[15077]: 2020-01-27 11:03:52,096 Service dtz/dtz/.* is ADDED
--
Jan 27 11:43:14 tools-proxy-05 kube2proxy[15077]: 2020-01-27 11:43:14,129 Service dimastbkbot/dimastbkbot/.* is ADDED
Jan 27 11:43:14 tools-proxy-05 kube2proxy[15077]: 2020-01-27 11:43:14,129 Service dplbot/dplbot/.* is ADDED
Jan 27 11:43:14 tools-proxy-05 kube2proxy[15077]: 2020-01-27 11:43:14,130 Service dtz/dtz/.* is ADDED
--
Jan 27 12:30:53 tools-proxy-05 kube2proxy[15077]: 2020-01-27 12:30:53,958 Service dimastbkbot/dimastbkbot/.* is ADDED
Jan 27 12:30:53 tools-proxy-05 kube2proxy[15077]: 2020-01-27 12:30:53,958 Service dplbot/dplbot/.* is ADDED
Jan 27 12:30:53 tools-proxy-05 kube2proxy[15077]: 2020-01-27 12:30:53,959 Service dtz/dtz/.* is ADDED
--
Jan 27 13:04:07 tools-proxy-05 kube2proxy[15077]: 2020-01-27 13:04:07,297 Service dimastbkbot/dimastbkbot/.* is ADDED
Jan 27 13:04:07 tools-proxy-05 kube2proxy[15077]: 2020-01-27 13:04:07,297 Service dplbot/dplbot/.* is ADDED
Jan 27 13:04:07 tools-proxy-05 kube2proxy[15077]: 2020-01-27 13:04:07,298 Service dtz/dtz/.* is ADDED
--
Jan 27 13:04:07 tools-proxy-05 kube2proxy[15077]: 2020-01-27 13:04:07,543 Service zoomproof/zoomproof/.* is ADDED
Jan 27 13:32:03 tools-proxy-05 kube2proxy[15077]: 2020-01-27 13:32:03,716 Service dplbot/dplbot/.* is DELETED
Jan 27 13:32:27 tools-proxy-05 kube2proxy[15077]: 2020-01-27 13:32:27,477 Service dplbot/dplbot/.* is ADDED
Jan 27 13:37:46 tools-proxy-05 kube2proxy[15077]: 2020-01-27 13:37:46,347 Service abbreviso/abbreviso/.* is ADDED
--
Jan 27 13:37:46 tools-proxy-05 kube2proxy[15077]: 2020-01-27 13:37:46,411 Service dimastbkbot/dimastbkbot/.* is ADDED
Jan 27 13:37:46 tools-proxy-05 kube2proxy[15077]: 2020-01-27 13:37:46,411 Service dplbot/dplbot/.* is ADDED
Jan 27 13:37:46 tools-proxy-05 kube2proxy[15077]: 2020-01-27 13:37:46,413 Service dtz/dtz/.* is ADDED
--
Jan 27 14:21:21 tools-proxy-05 kube2proxy[15077]: 2020-01-27 14:21:21,720 Service dimastbkbot/dimastbkbot/.* is ADDED
Jan 27 14:21:21 tools-proxy-05 kube2proxy[15077]: 2020-01-27 14:21:21,720 Service dplbot/dplbot/.* is ADDED
Jan 27 14:21:21 tools-proxy-05 kube2proxy[15077]: 2020-01-27 14:21:21,720 Service dtz/dtz/.* is ADDED
--
Jan 27 14:55:25 tools-proxy-05 kube2proxy[15077]: 2020-01-27 14:55:25,150 Service dimastbkbot/dimastbkbot/.* is ADDED
Jan 27 14:55:25 tools-proxy-05 kube2proxy[15077]: 2020-01-27 14:55:25,151 Service dplbot/dplbot/.* is ADDED
Jan 27 14:55:25 tools-proxy-05 kube2proxy[15077]: 2020-01-27 14:55:25,151 Service dtz/dtz/.* is ADDED

The timestamp for most ADDs has nothing to do with the restarts, considering other tools with close names are being added a lot in the same milliseconds.

Ok lets say in the current situation all SIGTERMs are sent by you.

2019-12-29 01:24:56: (server.c.1751) [note] graceful shutdown started
2019-12-29 01:24:56: (server.c.1828) server stopped by UID = 51290 PID = 11187
2019-12-29 01:25:18: (log.c.164) server started
2020-01-08 00:35:17: (server.c.1558) server stopped by UID = 0 PID = 0
2020-01-08 00:35:22: (log.c.164) server started

What about these?

Also, I cannot debug this if the service is manually restarted over and over. Any hints of what is going on would either be flushed away by the restarts or just be a needle in a haystack of restarts.

Ok lets say in the current situation all SIGTERMs are sent by you.

2019-12-29 01:24:56: (server.c.1751) [note] graceful shutdown started
2019-12-29 01:24:56: (server.c.1828) server stopped by UID = 51290 PID = 11187
2019-12-29 01:25:18: (log.c.164) server started
2020-01-08 00:35:17: (server.c.1558) server stopped by UID = 0 PID = 0
2020-01-08 00:35:22: (log.c.164) server started

What about these?

Well, UID 51290 is tools.dplbot; I don't know why that particular restart got logged differently than the others.

Also, I cannot debug this if the service is manually restarted over and over. Any hints of what is going on would either be flushed away by the restarts or just be a needle in a haystack of restarts.

OK, I will not restart the service manually pending further notice here.

ok, at the moment https://tools.wmflabs.org/dplbot/ is a 503. restarting.

Mentioned in SAL (#wikimedia-cloud) [2020-01-27T17:46:09Z] <zhuyifei1999_> restarted webservice via webservice --backend kubernetes php7.3 stop webservice --backend kubernetes php7.3 start T115231

zhuyifei1999 added a comment.EditedJan 27 2020, 5:54 PM

Running locally: while true; do if [[ $(curl -s -o /dev/null -I -w "%{http_code}" https://tools.wmflabs.org/dplbot/) != '200' ]]; then date; alert 'dplbot dplbot dplbot'; fi; sleep 60; done. Waiting for it to fail.

For future reference, in case I come back and lose this information:

tools.dplbot@tools-sgebastion-08:~$ kubectl get pods -o wide
NAME                      READY     STATUS    RESTARTS   AGE       IP              NODE
dplbot-3840810906-0qkte   1/1       Running   0          20m       192.168.143.5   tools-worker-1037.tools.eqiad.wmflabs
tools.dplbot@tools-sgebastion-08:~$ kubectl describe pod dplbot-3840810906-0qkte
Name:		dplbot-3840810906-0qkte
Namespace:	dplbot
Node:		tools-worker-1037.tools.eqiad.wmflabs/172.16.1.151
Start Time:	Mon, 27 Jan 2020 17:45:15 +0000
Labels:		name=dplbot
		pod-template-hash=3840810906
		toolforge=tool
		tools.wmflabs.org/webservice=true
		tools.wmflabs.org/webservice-version=1
Status:		Running
IP:		192.168.143.5
Controllers:	ReplicaSet/dplbot-3840810906
Containers:
  webservice:
    Container ID:	docker://0862980f14e2012b27defa7cfb540d08cb74c30ce9a51b56c3bc612e724b107c
    Image:		docker-registry.tools.wmflabs.org/toolforge-php73-web:latest
    Image ID:		docker://sha256:8fd33c8f43a03cbf53aa1faeb173f203e9be6202b682e7a70518f1d02d4f1d4b
    Port:		8000/TCP
    Command:
      /usr/bin/webservice-runner
      --type
      lighttpd
      --port
      8000
    Limits:
      cpu:	2
      memory:	2Gi
    Requests:
      cpu:		125m
      memory:		256Mi
    State:		Running
      Started:		Mon, 27 Jan 2020 17:45:18 +0000
    Ready:		True
    Restart Count:	0
    Volume Mounts:
      /data/project/ from home (rw)
      /data/scratch/ from scratch (rw)
      /etc/ldap.conf from etcldap-conf-mo4k0 (rw)
      /etc/ldap.yaml from etcldap-yaml-fkjk0 (rw)
      /etc/novaobserver.yaml from etcnovaobserver-yaml-2lvtl (rw)
      /etc/wmcs-project from wmcs-project (rw)
      /mnt/nfs/ from nfs (rw)
      /public/dumps/ from dumps (rw)
      /var/run/nslcd/socket from varrunnslcdsocket-qgx1l (rw)
    Environment Variables:
      HOME:	/data/project/dplbot/
Conditions:
  Type		Status
  Initialized 	True 
  Ready 	True 
  PodScheduled 	True 
Volumes:
  dumps:
    Type:	HostPath (bare host directory volume)
    Path:	/public/dumps/
  home:
    Type:	HostPath (bare host directory volume)
    Path:	/data/project/
  wmcs-project:
    Type:	HostPath (bare host directory volume)
    Path:	/etc/wmcs-project
  nfs:
    Type:	HostPath (bare host directory volume)
    Path:	/mnt/nfs/
  scratch:
    Type:	HostPath (bare host directory volume)
    Path:	/data/scratch/
  etcldap-conf-mo4k0:
    Type:	HostPath (bare host directory volume)
    Path:	/etc/ldap.conf
  etcldap-yaml-fkjk0:
    Type:	HostPath (bare host directory volume)
    Path:	/etc/ldap.yaml
  etcnovaobserver-yaml-2lvtl:
    Type:	HostPath (bare host directory volume)
    Path:	/etc/novaobserver.yaml
  varrunnslcdsocket-qgx1l:
    Type:	HostPath (bare host directory volume)
    Path:	/var/run/nslcd/socket
QoS Class:	Burstable
Tolerations:	<none>
No events.tools.dplbot@tools-sgebastion-08:~$ kubectl get services -o wide
NAME      CLUSTER-IP      EXTERNAL-IP   PORT(S)    AGE       SELECTOR
dplbot    192.168.3.150   <none>        8000/TCP   21m       name=dplbot
tools.dplbot@tools-sgebastion-08:~$ kubectl describe service dplbot
Name:			dplbot
Namespace:		dplbot
Labels:			name=dplbot
			toolforge=tool
			tools.wmflabs.org/webservice=true
			tools.wmflabs.org/webservice-version=1
Selector:		name=dplbot
Type:			ClusterIP
IP:			192.168.3.150
Port:			http	8000/TCP
Endpoints:		<none>
Session Affinity:	None

Has failed.

tools.dplbot@tools-sgebastion-08:~$ kubectl get pods -o wide
NAME                      READY     STATUS    RESTARTS   AGE       IP              NODE
dplbot-3840810906-0qkte   1/1       Running   0          32m       192.168.143.5   tools-worker-1037.tools.eqiad.wmflabs

Same pod.

tools.dplbot@tools-sgebastion-08:~$ tail error.log
[...]
2020-01-27 17:44:58: (server.c.2059) server stopped by UID = 0 PID = 0 
2020-01-27 17:45:18: (server.c.1464) server started (lighttpd/1.4.53)

No restarts.

zhuyifei1999@tools-proxy-05:~$ sudo journalctl | grep dpl -B 1 -A 1
Jan 27 17:11:52 tools-proxy-05 kube2proxy[15077]: 2020-01-27 17:11:52,158 Service dimastbkbot/dimastbkbot/.* is ADDED
Jan 27 17:11:52 tools-proxy-05 kube2proxy[15077]: 2020-01-27 17:11:52,158 Service dplbot/dplbot/.* is ADDED
Jan 27 17:11:52 tools-proxy-05 kube2proxy[15077]: 2020-01-27 17:11:52,159 Service dtz/dtz/.* is ADDED
--
Jan 27 17:44:56 tools-proxy-05 flanneld[3635]: I0127 17:44:56.364831 03635 vxlan.go:340] Ignoring not a miss: aa:79:16:d0:50:24, 192.168.139.21
Jan 27 17:44:58 tools-proxy-05 kube2proxy[15077]: 2020-01-27 17:44:58,107 Service dplbot/dplbot/.* is DELETED
Jan 27 17:44:58 tools-proxy-05 flanneld[3635]: I0127 17:44:58.412814 03635 vxlan.go:340] Ignoring not a miss: 8a:59:d0:00:15:6d, 192.168.243.13
--
Jan 27 17:45:14 tools-proxy-05 flanneld[3635]: I0127 17:45:14.540811 03635 vxlan.go:340] Ignoring not a miss: aa:6a:3b:d6:1b:17, 192.168.216.11
Jan 27 17:45:15 tools-proxy-05 kube2proxy[15077]: 2020-01-27 17:45:15,352 Service dplbot/dplbot/.* is ADDED
Jan 27 17:45:15 tools-proxy-05 flanneld[3635]: I0127 17:45:15.847443 03635 vxlan.go:345] L3 miss: 192.168.218.17
--
Jan 27 17:54:31 tools-proxy-05 kube2proxy[15077]: 2020-01-27 17:54:31,472 Service dimastbkbot/dimastbkbot/.* is ADDED
Jan 27 17:54:31 tools-proxy-05 kube2proxy[15077]: 2020-01-27 17:54:31,473 Service dplbot/dplbot/.* is ADDED
Jan 27 17:54:31 tools-proxy-05 kube2proxy[15077]: 2020-01-27 17:54:31,473 Service dtz/dtz/.* is ADDED

kube2proxy not involved.

zhuyifei1999 added a comment.EditedJan 27 2020, 6:33 PM

Following https://wikitech.wikimedia.org/wiki/Portal:Toolforge/Admin/Dynamicproxy#debug

zhuyifei1999@tools-proxy-05:~$ grep /var/lib/redis/tools-proxy-05-6379.aof -e 'dplbot' -C 5 --text | tail -n 100
--
http://192.168.107.55:8000
*4
$4
HSET
$13
prefix:dplbot
$2
.*
$26
http://192.168.55.160:8000
*4
--
http://192.168.97.25:8000
*3
$4
HDEL
$13
prefix:dplbot
$2
.*
*4
$4
HSET
--
http://192.168.107.55:8000
*4
$4
HSET
$13
prefix:dplbot
$2
.*
$26
http://192.168.55.160:8000
*4
--
http://192.168.97.25:8000
*3
$4
HDEL
$13
prefix:dplbot
$2
.*
*3
$4
HDEL
--
$4
SREM
$12
k8s_services
$16
dplbot/dplbot/.*
*4
$4
HSET
$13
prefix:dplbot
$2
.*
$25
http://192.168.3.150:8000
*3
$4
SADD
$12
k8s_services
$16
dplbot/dplbot/.*
*4
$4
HSET
$16
prefix:abbreviso
--
http://192.168.107.55:8000
*4
$4
HSET
$13
prefix:dplbot
$2
.*
$25
http://192.168.3.150:8000
*4
--
http://172.16.1.76:40771
*3
$4
HDEL
$13
prefix:dplbot
$2
.*
*3
$4
HDEL

Yes, HDEL was issued to redis. Seems under normal case it would also get a SREM on k8s webservice shutdown but not in this case.

zhuyifei1999@tools-proxy-05:~$ sudo tail -fn 100 /var/log/redis/tcp_6379.log 
[...]
467:M 27 Jan 2020 17:46:55.044 * 1 changes in 900 seconds. Saving...
467:M 27 Jan 2020 17:46:55.045 * Background saving started by pid 25390
25390:C 27 Jan 2020 17:46:55.051 * DB saved on disk
25390:C 27 Jan 2020 17:46:55.054 * RDB: 0 MB of memory used by copy-on-write
467:M 27 Jan 2020 17:46:55.145 * Background saving terminated with success
467:M 27 Jan 2020 17:54:31.454 * 10 changes in 300 seconds. Saving...
467:M 27 Jan 2020 17:54:31.454 * Background saving started by pid 31039
31039:C 27 Jan 2020 17:54:31.458 * DB saved on disk
31039:C 27 Jan 2020 17:54:31.459 * RDB: 0 MB of memory used by copy-on-write
467:M 27 Jan 2020 17:54:31.554 * Background saving terminated with success
467:M 27 Jan 2020 17:59:32.053 * 10 changes in 300 seconds. Saving...
467:M 27 Jan 2020 17:59:32.054 * Background saving started by pid 1852
1852:C 27 Jan 2020 17:59:32.057 * DB saved on disk
1852:C 27 Jan 2020 17:59:32.058 * RDB: 0 MB of memory used by copy-on-write
467:M 27 Jan 2020 17:59:32.155 * Background saving terminated with success
467:M 27 Jan 2020 18:14:33.091 * 1 changes in 900 seconds. Saving...
467:M 27 Jan 2020 18:14:33.092 * Background saving started by pid 11448
11448:C 27 Jan 2020 18:14:33.096 * DB saved on disk
11448:C 27 Jan 2020 18:14:33.097 * RDB: 0 MB of memory used by copy-on-write
467:M 27 Jan 2020 18:14:33.192 * Background saving terminated with success
467:M 27 Jan 2020 18:29:34.092 * 1 changes in 900 seconds. Saving...
467:M 27 Jan 2020 18:29:34.093 * Background saving started by pid 21837
21837:C 27 Jan 2020 18:29:34.096 * DB saved on disk
21837:C 27 Jan 2020 18:29:34.097 * RDB: 0 MB of memory used by copy-on-write
467:M 27 Jan 2020 18:29:34.193 * Background saving terminated with success

The changes might be isolated. The tools-proxy-05-6379.aof data around the HDEL are:

1HSET
2$30
3prefix:video-cut-tool-back-end
4$2
5.*
6$25
7http://172.16.7.131:45961
8*3
9$4
10HDEL
11$21
12prefix:dewikinews-rss
13$2
14.*
15*3
16$4
17HDEL
18$22
19prefix:random-featured
20$2
21.*
22*4
23$4
24HSET
25$21
26prefix:dewikinews-rss
27$2
28.*
29$24
30http://172.16.1.76:40771
31*3
32$4
33HDEL
34$13
35prefix:dplbot
36$2
37.*
38*3
39$4
40HDEL
41$13
42prefix:jembot
43$2
44.*
45*4
46$4
47HSET
48$13
49prefix:jembot
50$2
51.*
52$24
53http://172.16.1.68:43261
54*4
55$4
56HSET
57$16
58prefix:abbreviso
59$2
60.*
61$26
62http://192.168.42.204:8000
63*4
64$4
65HSET
66$22
67prefix:account-creator
68$2
69.*
70$26
71http://192.168.75.112:8000
72*4
73$4
74HSET
75$14
76prefix:ace2018
77$2
78.*
79$26
80http://192.168.60.149:8000
81*4
82$4
83HSET
84$14
85prefix:actrial
86$2
87.*
88$26
89http://192.168.76.190:8000
90*4
91$4
92HSET
93$11
94prefix:adas
95$2
96.*
97$25
98http://192.168.83.61:8000
99*4
100$4
101HSET
102$15
103prefix:addshore
104$2
105.*
106$25
107http://192.168.67.58:8000

After out HDEL, there was a grid webservice restart, then a flood of k8s that are probably around 18:29, which as a side effect made https://tools.wmflabs.org/dplbot/ accessible again. The HDEL was between 18:13 and 18:16.

Failed same time this hour. Between 19:14 and 19:15. I did a strace on redis, grepping accept and HDEL. This was a new connection:

[pid   467] accept(6, {sa_family=AF_INET, sin_port=htons(42550), sin_addr=inet_addr("127.0.0.1")}, [128->16]) = 9
[pid   467] accept(6, 0x7fffc9f02200, [128]) = -1 EAGAIN (Resource temporarily unavailable)
[pid   467] read(9, "*3\r\n$4\r\nHDEL\r\n$13\r\nprefix:dplbot\r\n$2\r\n.*\r\n", 16384) = 42
[pid   467] write(20, "*3\r\n$4\r\nHDEL\r\n$13\r\nprefix:dplbot\r\n$2\r\n.*\r\n", 42) = 42
[pid   467] write(16, "*3\r\n$4\r\nHDEL\r\n$13\r\nprefix:dplbot\r\n$2\r\n.*\r\n", 42 <unfinished ...>
[pid   467] accept(6, {sa_family=AF_INET, sin_port=htons(42554), sin_addr=inet_addr("127.0.0.1")}, [128->16]) = 9
zhuyifei1999 added a comment.EditedJan 27 2020, 7:25 PM

I lsof-ed redis and there were 18 connections with nginx, 1 replication to tools-proxy-06, 1 kube2proxy, and 1 unaccounted for (doesn't have the other end):

redis-ser   467        redis   42u  IPv4 1976019609      0t0  TCP localhost:6379->localhost:58830 (ESTABLISHED)

Considering that the fd was 42 and the next highest was 33, this is probably some unrelated stale connection.

The connect-hdel-close behavior is consistent with proxylistener, and indeed:

zhuyifei1999@tools-proxy-05:~$ cat /var/log/proxylistener | grep dplbot | tail
2020-01-27 17:24:14,784 Cleaning up request from dplbot for .*
2020-01-27 17:24:14,786 Removed redis key prefix:dplbot with key .*
2020-01-27 18:15:08,738 Cleaning up request from dplbot for .*
2020-01-27 18:15:08,739 Removed redis key prefix:dplbot with key .*
2020-01-27 18:24:14,081 Cleaning up request from dplbot for .*
2020-01-27 18:24:14,082 Removed redis key prefix:dplbot with key .*
2020-01-27 19:14:58,216 Cleaning up request from dplbot for .*
2020-01-27 19:14:58,218 Removed redis key prefix:dplbot with key .*
2020-01-27 19:24:14,049 Cleaning up request from dplbot for .*
2020-01-27 19:24:14,049 Removed redis key prefix:dplbot with key .*

What is this?

proxylistener is a quiet service. stracing with sudo strace -e '!select' -fp 313

07:49:55 0 ✓ zhuyifei1999@tools-sgebastion-08: ~$ grep unregister ~tools.dplbot/v2/logs/sdab_stderr.txt -B 10 -A 10
[...]
Traceback (most recent call last):
  File "/usr/local/bin/portreleaser", line 8, in <module>
    portgrabber.unregister()
  File "/usr/local/lib/python2.7/dist-packages/portgrabber.py", line 46, in unregister
    sock.connect((proxy, 8282))
  File "/usr/lib/python2.7/socket.py", line 228, in meth
    return getattr(self._sock,name)(*args)
socket.error: [Errno 113] No route to host
[...]

huh

zhuyifei1999 closed this task as Invalid.Jan 27 2020, 8:39 PM
tools.dplbot@tools-sgebastion-08:~$ while true; do date; qstat; sleep 30; done
[...]
job-ID  prior   name       user         state submit/start at     queue                          slots ja-task-ID 
-----------------------------------------------------------------------------------------------------------------
 229985 0.25014 late_dab2_ tools.dplbot r     01/27/2020 18:36:12 task@tools-sgeexec-0911.tools.     1        
 234003 0.25000 sdab_procs tools.dplbot r     01/27/2020 20:14:12 webgrid-generic@tools-sgewebgr     1
tools.dplbot@tools-sgebastion-08:~$ crontab -l
[...]
 24  *   *   *   *    qsub [...]
 14  *   *   *   *    qsub [...]
[...]
08:26:22 0 ✓ zhuyifei1999@tools-sgebastion-08: /var/lib/gridengine$ grep port etc/queues/webgrid-generic 
epilog                /usr/local/bin/portreleaser

We should consider adding something like this to docs:

If you use qsub to submit a job, you must specify a queue or else complex interactions between different systems can mess you over.

zhuyifei1999 renamed this task from dplbot webservice on Tools Labs fails repeatedly to dplbot webservice on Toolforge repeatedly have its dynamicproxy removed (because qsub schedules tasks to webgrid queues, causing portreleaser to run as job epilogue).Jan 27 2020, 11:07 PM
zhuyifei1999 renamed this task from dplbot webservice on Toolforge repeatedly have its dynamicproxy removed (because qsub schedules tasks to webgrid queues, causing portreleaser to run as job epilogue) to dplbot webservice on Toolforge repeatedly have its dynamicproxy entry removed (because qsub schedules tasks to webgrid queues, causing portreleaser to run as job epilogue).
zhuyifei1999 removed a project: Cloud-Services.