Page MenuHomePhabricator

restbase endpoints health checks timing out
Closed, ResolvedPublic

Description

starting today at ~6 UTC icinga started failing restbase endpoints health checks, specifically the swagger-based nrpe checks.

running strace it seems mobile-html endpoint can take an unusually long time and thus timeout

restbase1001:~$ strace -r -s 128 -f -e sendto,poll,recvfrom /usr/bin/time /usr/local/lib/nagios/plugins/service_checker -t 5 127.0.0.1 http://127.0.0.1:7231/en.wikipedia.org/v1 2>&1 
Process 39227 attached
Process 39228 attached
Process 39229 attached
[pid 39229]      0.000000 +++ exited with 0 +++
[pid 39228]      0.000046 --- SIGCHLD {si_signo=SIGCHLD, si_code=CLD_EXITED, si_pid=39229, si_uid=4849, si_status=0, si_utime=0, si_stime=0} ---
[pid 39228]      0.000198 +++ exited with 0 +++
[pid 39227]      0.000012 --- SIGCHLD {si_signo=SIGCHLD, si_code=CLD_EXITED, si_pid=39228, si_uid=4849, si_status=0, si_utime=0, si_stime=0} ---
[pid 39227]      0.025061 poll([{fd=3, events=POLLOUT}], 1, 5000) = 1 ([{fd=3, revents=POLLOUT}])
[pid 39227]      0.000105 poll([{fd=3, events=POLLOUT}], 1, 5000) = 1 ([{fd=3, revents=POLLOUT}])
[pid 39227]      0.000035 sendto(3, "GET /en.wikipedia.org/v1/?spec HTTP/1.1\r\nAccept-Encoding: identity\r\nHost: 127.0.0.1\r\n\r\n", 87, 0, NULL, 0) = 87
[pid 39227]      0.000143 poll([{fd=3, events=POLLIN}], 1, 5000) = 1 ([{fd=3, revents=POLLIN}])
[pid 39227]      0.004312 recvfrom(3, "HTTP/1.1 200 \r\ncache-control: private, max-age=0, s-maxage=0, must-revalidate\r\naccess-control-allow-origin: *\r\naccess-control-al"..., 8192, 0, NULL, NULL) = 8192
[pid 39227]      0.000501 poll([{fd=3, events=POLLIN}], 1, 5000) = 1 ([{fd=3, revents=POLLIN}])
[pid 39227]      0.000033 recvfrom(3, ",\"x-monitor\":true,\"x-amples\":[{\"title\":\"Get html by title from storage\",\"request\":{\"params\":{\"title\":\"Foobar\"}},\"response\":{\"sta"..., 33017, 0, NULL, NULL) = 33017
[pid 39227]      0.010091 poll([{fd=4, events=POLLOUT}], 1, 5000) = 1 ([{fd=4, revents=POLLOUT}])
[pid 39227]      0.000105 poll([{fd=4, events=POLLOUT}], 1, 5000) = 1 ([{fd=4, revents=POLLOUT}])
[pid 39227]      0.000046 sendto(4, "GET /en.wikipedia.org/v1/page/html/Foobar HTTP/1.1\r\nAccept-Encoding: identity\r\nHost: 127.0.0.1\r\n\r\n", 98, 0, NULL, 0) = 98
[pid 39227]      0.000132 poll([{fd=4, events=POLLIN}], 1, 5000) = 1 ([{fd=4, revents=POLLIN}])
[pid 39227]      1.321712 recvfrom(4, "HTTP/1.1 200 \r\netag: \"687168867/36d80d22-79b9-11e5-89e1-16c0642eec28\"\r\ncontent-type: text/html;profile=\"mediawiki.org/specs/html"..., 8192, 0, NULL, NULL) = 8192
[pid 39227]      0.001004 poll([{fd=4, events=POLLIN}], 1, 5000) = 1 ([{fd=4, revents=POLLIN}])
[pid 39227]      0.000073 recvfrom(4, "\" typeof=\"mw:Image/Thumb\" id=\"mwHQ\"><a href=\"./File:Smokeycover.jpg\" id=\"mwHg\"><img resource=\"./File:Smokeycover.jpg\" src=\"//upl"..., 36180, 0, NULL, NULL) = 36180
[pid 39227]      0.003087 poll([{fd=4, events=POLLOUT}], 1, 5000) = 1 ([{fd=4, revents=POLLOUT}])
[pid 39227]      0.000136 poll([{fd=4, events=POLLOUT}], 1, 5000) = 1 ([{fd=4, revents=POLLOUT}])
[pid 39227]      0.000076 sendto(4, "GET /en.wikipedia.org/v1/page/data-parsoid/Foobar HTTP/1.1\r\nAccept-Encoding: identity\r\nHost: 127.0.0.1\r\n\r\n", 106, 0, NULL, 0) = 106
[pid 39227]      0.000240 poll([{fd=4, events=POLLIN}], 1, 5000) = 1 ([{fd=4, revents=POLLIN}])
[pid 39227]      0.054008 recvfrom(4, "HTTP/1.1 200 \r\netag: \"687168867/36d80d22-79b9-11e5-89e1-16c0642eec28\"\r\ncontent-type: application/json;profile=\"mediawiki.org/spe"..., 8192, 0, NULL, NULL) = 8192
[pid 39227]      0.001296 poll([{fd=4, events=POLLIN}], 1, 5000) = 1 ([{fd=4, revents=POLLIN}])
[pid 39227]      0.000068 recvfrom(4, "\"cite_ref-16\":{\"dsr\":[6401,6542,5,6]},\"mwVw\":{\"stx\":\"simple\",\"a\":{\"href\":\"./Category:All_articles_with_specifically_marked_wease"..., 11983, 0, NULL, NULL) = 11983
[pid 39227]      0.011168 poll([{fd=4, events=POLLOUT}], 1, 5000) = 1 ([{fd=4, revents=POLLOUT}])
[pid 39227]      0.000201 poll([{fd=4, events=POLLOUT}], 1, 5000) = 1 ([{fd=4, revents=POLLOUT}])
[pid 39227]      0.000052 sendto(4, "POST /en.wikipedia.org/v1/transform/wikitext/to/html/Foobar HTTP/1.1\r\nAccept-Encoding: identity\r\nContent-Length: 45\r\nHost: 127.0"..., 230, 0, NULL, 0) = 230
[pid 39227]      0.000221 poll([{fd=4, events=POLLIN}], 1, 5000) = 1 ([{fd=4, revents=POLLIN}])
[pid 39227]      0.020208 recvfrom(4, "HTTP/1.1 200 \r\ncontent-type: text/html;profile=\"mediawiki.org/specs/html/1.1.0\";charset=utf-8\r\ncache-control: private, max-age=0"..., 8192, 0, NULL, NULL) = 882
[pid 39227]      0.003165 poll([{fd=4, events=POLLOUT}], 1, 5000) = 1 ([{fd=4, revents=POLLOUT}])
[pid 39227]      0.000135 poll([{fd=4, events=POLLOUT}], 1, 5000) = 1 ([{fd=4, revents=POLLOUT}])
[pid 39227]      0.000148 sendto(4, "GET /en.wikipedia.org/v1/page/title/Foobar HTTP/1.1\r\nAccept-Encoding: identity\r\nHost: 127.0.0.1\r\n\r\n", 99, 0, NULL, 0) = 99
[pid 39227]      0.000222 poll([{fd=4, events=POLLIN}], 1, 5000) = 1 ([{fd=4, revents=POLLIN}])
[pid 39227]      0.004363 recvfrom(4, "HTTP/1.1 200 \r\netag: \"687168867/366722de-79b9-11e5-9058-99e99c77e71c\"\r\ncache-control: private, max-age=0, s-maxage=0, must-reval"..., 8192, 0, NULL, NULL) = 1085
[pid 39227]      0.003348 poll([{fd=4, events=POLLOUT}], 1, 5000) = 1 ([{fd=4, revents=POLLOUT}])
[pid 39227]      0.000149 poll([{fd=4, events=POLLOUT}], 1, 5000) = 1 ([{fd=4, revents=POLLOUT}])
[pid 39227]      0.000050 sendto(4, "GET /en.wikipedia.org/v1/page/revision/642497713 HTTP/1.1\r\nAccept-Encoding: identity\r\nHost: 127.0.0.1\r\n\r\n", 105, 0, NULL, 0) = 105
[pid 39227]      0.000207 poll([{fd=4, events=POLLIN}], 1, 5000) = 1 ([{fd=4, revents=POLLIN}])
[pid 39227]      0.010147 recvfrom(4, "HTTP/1.1 200 \r\netag: \"642497713/76df71b6-6365-11e5-8aa1-6636877c1130\"\r\ncache-control: private, max-age=0, s-maxage=0, must-reval"..., 8192, 0, NULL, NULL) = 1218
[pid 39227]      0.003272 poll([{fd=4, events=POLLOUT}], 1, 5000) = 1 ([{fd=4, revents=POLLOUT}])
[pid 39227]      0.000129 poll([{fd=4, events=POLLOUT}], 1, 5000) = 1 ([{fd=4, revents=POLLOUT}])
[pid 39227]      0.000052 sendto(4, "GET /en.wikipedia.org/v1/page/mobile-html/Main_Page HTTP/1.1\r\nAccept-Encoding: identity\r\nHost: 127.0.0.1\r\ncache-control: no-cach"..., 133, 0, NULL, 0) = 133
[pid 39227]      0.000163 poll([{fd=4, events=POLLIN}], 1, 5000) = 0 (Timeout)
[pid 39227]      5.006778 poll([{fd=4, events=POLLOUT}], 1, 5000) = 1 ([{fd=4, revents=POLLOUT}])
[pid 39227]      0.000264 poll([{fd=4, events=POLLOUT}], 1, 5000) = 1 ([{fd=4, revents=POLLOUT}])
[pid 39227]      0.000113 sendto(4, "GET /en.wikipedia.org/v1/page/mobile-html/Main_Page HTTP/1.1\r\nAccept-Encoding: identity\r\nHost: 127.0.0.1\r\ncache-control: no-cach"..., 133, 0, NULL, 0) = 133
[pid 39227]      0.000328 poll([{fd=4, events=POLLIN}], 1, 5000) = 0 (Timeout)
[pid 39227]      5.004579 poll([{fd=4, events=POLLOUT}], 1, 5000) = 1 ([{fd=4, revents=POLLOUT}])
[pid 39227]      0.000100 poll([{fd=4, events=POLLOUT}], 1, 5000) = 1 ([{fd=4, revents=POLLOUT}])
[pid 39227]      0.000036 sendto(4, "GET /en.wikipedia.org/v1/page/mobile-html/Main_Page HTTP/1.1\r\nAccept-Encoding: identity\r\nHost: 127.0.0.1\r\ncache-control: no-cach"..., 133, 0, NULL, 0) = 133
[pid 39227]      0.000117 poll([{fd=4, events=POLLIN}], 1, 5000) = 0 (Timeout)
[pid 39227]      5.002472 poll([{fd=4, events=POLLOUT}], 1, 5000) = 1 ([{fd=4, revents=POLLOUT}])
[pid 39227]      0.000214 poll([{fd=4, events=POLLOUT}], 1, 5000) = 1 ([{fd=4, revents=POLLOUT}])
[pid 39227]      0.000054 sendto(4, "GET /en.wikipedia.org/v1/page/mobile-html/Main_Page HTTP/1.1\r\nAccept-Encoding: identity\r\nHost: 127.0.0.1\r\ncache-control: no-cach"..., 133, 0, NULL, 0) = 133
[pid 39227]      0.000237 poll([{fd=4, events=POLLIN}], 1, 5000) = 0 (Timeout)
[pid 39227]      5.001967 poll([{fd=4, events=POLLOUT}], 1, 5000) = 1 ([{fd=4, revents=POLLOUT}])
[pid 39227]      0.000105 poll([{fd=4, events=POLLOUT}], 1, 5000) = 1 ([{fd=4, revents=POLLOUT}])
[pid 39227]      0.000031 sendto(4, "GET /en.wikipedia.org/v1/page/mobile-html/Main_Page HTTP/1.1\r\nAccept-Encoding: identity\r\nHost: 127.0.0.1\r\n\r\n", 108, 0, NULL, 0) = 108
[pid 39227]      0.000156 poll([{fd=4, events=POLLIN}], 1, 5000) = 0 (Timeout)
[pid 39227]      5.005620 poll([{fd=4, events=POLLOUT}], 1, 5000) = 1 ([{fd=4, revents=POLLOUT}])
[pid 39227]      0.000127 poll([{fd=4, events=POLLOUT}], 1, 5000) = 1 ([{fd=4, revents=POLLOUT}])
[pid 39227]      0.000033 sendto(4, "GET /en.wikipedia.org/v1/page/mobile-html/Main_Page HTTP/1.1\r\nAccept-Encoding: identity\r\nHost: 127.0.0.1\r\n\r\n", 108, 0, NULL, 0) = 108
[pid 39227]      0.000171 poll([{fd=4, events=POLLIN}], 1, 5000) = 0 (Timeout)
[pid 39227]      5.004101 poll([{fd=4, events=POLLOUT}], 1, 5000) = 1 ([{fd=4, revents=POLLOUT}])
[pid 39227]      0.000222 poll([{fd=4, events=POLLOUT}], 1, 5000) = 1 ([{fd=4, revents=POLLOUT}])
[pid 39227]      0.000079 sendto(4, "GET /en.wikipedia.org/v1/page/mobile-html/Main_Page HTTP/1.1\r\nAccept-Encoding: identity\r\nHost: 127.0.0.1\r\n\r\n", 108, 0, NULL, 0) = 108
[pid 39227]      0.000252 poll([{fd=4, events=POLLIN}], 1, 5000) = 0 (Timeout)
[pid 39227]      5.006296 poll([{fd=4, events=POLLOUT}], 1, 5000) = 1 ([{fd=4, revents=POLLOUT}])
[pid 39227]      0.000124 poll([{fd=4, events=POLLOUT}], 1, 5000) = 1 ([{fd=4, revents=POLLOUT}])
[pid 39227]      0.000032 sendto(4, "GET /en.wikipedia.org/v1/page/mobile-html/Main_Page HTTP/1.1\r\nAccept-Encoding: identity\r\nHost: 127.0.0.1\r\n\r\n", 108, 0, NULL, 0) = 108
[pid 39227]      0.000149 poll([{fd=4, events=POLLIN}], 1, 5000) = 0 (Timeout)
[pid 39227]      5.005361 poll([{fd=4, events=POLLOUT}], 1, 5000) = 1 ([{fd=4, revents=POLLOUT}])
[pid 39227]      0.000152 poll([{fd=4, events=POLLOUT}], 1, 5000) = 1 ([{fd=4, revents=POLLOUT}])
[pid 39227]      0.000035 sendto(4, "GET /en.wikipedia.org/v1/page/graph/png/User%3APchelolo%2FGraph/670213569/1533aaad45c733dcc7e07614b54cbae4119a6747.png HTTP/1.1\r"..., 175, 0, NULL, 0) = 175
[pid 39227]      0.000148 poll([{fd=4, events=POLLIN}], 1, 5000) = 1 ([{fd=4, revents=POLLIN}])
[pid 39227]      0.148107 recvfrom(4, "HTTP/1.1 200 \r\naccess-control-allow-origin: *\r\naccess-control-allow-headers: accept, content-type\r\nx-xss-protection: 1; mode=blo"..., 8192, 0, NULL, NULL) = 5797
[pid 39227]      0.002900 poll([{fd=4, events=POLLOUT}], 1, 5000) = 1 ([{fd=4, revents=POLLOUT}])
[pid 39227]      0.000103 poll([{fd=4, events=POLLOUT}], 1, 5000) = 1 ([{fd=4, revents=POLLOUT}])
[pid 39227]      0.000026 sendto(4, "POST /en.wikipedia.org/v1/media/math/svg HTTP/1.1\r\nAccept-Encoding: identity\r\nContent-Length: 21\r\nHost: 127.0.0.1\r\nContent-Type:"..., 187, 0, NULL, 0) = 187
[pid 39227]      0.000130 poll([{fd=4, events=POLLIN}], 1, 5000) = 1 ([{fd=4, revents=POLLIN}])
[pid 39227]      0.143450 recvfrom(4, "HTTP/1.1 200 \r\ncontent-type: image/svg+xml\r\naccess-control-allow-origin: *\r\naccess-control-allow-headers: accept, content-type\r\n"..., 8192, 0, NULL, NULL) = 4589
[pid 39227]      0.002822 poll([{fd=4, events=POLLOUT}], 1, 5000) = 1 ([{fd=4, revents=POLLOUT}])
[pid 39227]      0.000117 poll([{fd=4, events=POLLOUT}], 1, 5000) = 1 ([{fd=4, revents=POLLOUT}])
[pid 39227]      0.000064 sendto(4, "POST /en.wikipedia.org/v1/media/math/mml HTTP/1.1\r\nAccept-Encoding: identity\r\nContent-Length: 21\r\nHost: 127.0.0.1\r\nContent-Type:"..., 187, 0, NULL, 0) = 187
[pid 39227]      0.000191 poll([{fd=4, events=POLLIN}], 1, 5000) = 1 ([{fd=4, revents=POLLIN}])
[pid 39227]      0.037974 recvfrom(4, "HTTP/1.1 200 \r\ncontent-type: application/mathml+xml\r\naccess-control-allow-origin: *\r\naccess-control-allow-headers: accept, conte"..., 8192, 0, NULL, NULL) = 1153
All endpoints are healthy
[pid 39227]      0.010761 +++ exited with 0 +++
     0.000119 --- SIGCHLD {si_signo=SIGCHLD, si_code=CLD_EXITED, si_pid=39227, si_uid=4849, si_status=0, si_utime=8, si_stime=5} ---
0.09user 0.05system 0:41.95elapsed 0%CPU (0avgtext+0avgdata 16688maxresident)k
0inputs+0outputs (0major+4560minor)pagefaults 0swaps
     0.002286 +++ exited with 0 +++

other times it'd take ~2.5s or so, sometimes enough to trigger nrpe timeout (10s)

restbase1001:~$ strace -r -s 128 -f -e sendto,poll,recvfrom /usr/bin/time /usr/local/lib/nagios/plugins/service_checker -t 5 127.0.0.1 http://127.0.0.1:7231/en.wikipedia.org/v1 2>&1 
Process 39490 attached
Process 39491 attached
Process 39492 attached
[pid 39492]      0.000000 +++ exited with 0 +++
[pid 39491]      0.000068 --- SIGCHLD {si_signo=SIGCHLD, si_code=CLD_EXITED, si_pid=39492, si_uid=4849, si_status=0, si_utime=0, si_stime=0} ---
[pid 39491]      0.000445 +++ exited with 0 +++
[pid 39490]      0.000052 --- SIGCHLD {si_signo=SIGCHLD, si_code=CLD_EXITED, si_pid=39491, si_uid=4849, si_status=0, si_utime=0, si_stime=0} ---
[pid 39490]      0.042721 poll([{fd=3, events=POLLOUT}], 1, 5000) = 1 ([{fd=3, revents=POLLOUT}])
[pid 39490]      0.000226 poll([{fd=3, events=POLLOUT}], 1, 5000) = 1 ([{fd=3, revents=POLLOUT}])
[pid 39490]      0.000073 sendto(3, "GET /en.wikipedia.org/v1/?spec HTTP/1.1\r\nAccept-Encoding: identity\r\nHost: 127.0.0.1\r\n\r\n", 87, 0, NULL, 0) = 87
[pid 39490]      0.000266 poll([{fd=3, events=POLLIN}], 1, 5000) = 1 ([{fd=3, revents=POLLIN}])
[pid 39490]      0.008098 recvfrom(3, "HTTP/1.1 200 \r\ncache-control: private, max-age=0, s-maxage=0, must-revalidate\r\naccess-control-allow-origin: *\r\naccess-control-al"..., 8192, 0, NULL, NULL) = 8192
[pid 39490]      0.000944 poll([{fd=3, events=POLLIN}], 1, 5000) = 1 ([{fd=3, revents=POLLIN}])
[pid 39490]      0.000072 recvfrom(3, ",\"x-monitor\":true,\"x-amples\":[{\"title\":\"Get html by title from storage\",\"request\":{\"params\":{\"title\":\"Foobar\"}},\"response\":{\"sta"..., 33017, 0, NULL, NULL) = 33017
[pid 39490]      0.018113 poll([{fd=4, events=POLLOUT}], 1, 5000) = 1 ([{fd=4, revents=POLLOUT}])
[pid 39490]      0.000198 poll([{fd=4, events=POLLOUT}], 1, 5000) = 1 ([{fd=4, revents=POLLOUT}])
[pid 39490]      0.000087 sendto(4, "GET /en.wikipedia.org/v1/page/html/Foobar HTTP/1.1\r\nAccept-Encoding: identity\r\nHost: 127.0.0.1\r\n\r\n", 98, 0, NULL, 0) = 98
[pid 39490]      0.000266 poll([{fd=4, events=POLLIN}], 1, 5000) = 1 ([{fd=4, revents=POLLIN}])
[pid 39490]      0.014717 recvfrom(4, "HTTP/1.1 200 \r\netag: \"687168867/36d80d22-79b9-11e5-89e1-16c0642eec28\"\r\ncontent-type: text/html;profile=\"mediawiki.org/specs/html"..., 8192, 0, NULL, NULL) = 8192
[pid 39490]      0.001141 poll([{fd=4, events=POLLIN}], 1, 5000) = 1 ([{fd=4, revents=POLLIN}])
[pid 39490]      0.000097 recvfrom(4, "\" typeof=\"mw:Image/Thumb\" id=\"mwHQ\"><a href=\"./File:Smokeycover.jpg\" id=\"mwHg\"><img resource=\"./File:Smokeycover.jpg\" src=\"//upl"..., 36180, 0, NULL, NULL) = 36180
[pid 39490]      0.003758 poll([{fd=4, events=POLLOUT}], 1, 5000) = 1 ([{fd=4, revents=POLLOUT}])
[pid 39490]      0.000192 poll([{fd=4, events=POLLOUT}], 1, 5000) = 1 ([{fd=4, revents=POLLOUT}])
[pid 39490]      0.000079 sendto(4, "GET /en.wikipedia.org/v1/page/data-parsoid/Foobar HTTP/1.1\r\nAccept-Encoding: identity\r\nHost: 127.0.0.1\r\n\r\n", 106, 0, NULL, 0) = 106
[pid 39490]      0.000249 poll([{fd=4, events=POLLIN}], 1, 5000) = 1 ([{fd=4, revents=POLLIN}])
[pid 39490]      0.024616 recvfrom(4, "HTTP/1.1 200 \r\netag: \"687168867/36d80d22-79b9-11e5-89e1-16c0642eec28\"\r\ncontent-type: application/json;profile=\"mediawiki.org/spe"..., 8192, 0, NULL, NULL) = 8192
[pid 39490]      0.001046 poll([{fd=4, events=POLLIN}], 1, 5000) = 1 ([{fd=4, revents=POLLIN}])
[pid 39490]      0.000104 recvfrom(4, "\"cite_ref-16\":{\"dsr\":[6401,6542,5,6]},\"mwVw\":{\"stx\":\"simple\",\"a\":{\"href\":\"./Category:All_articles_with_specifically_marked_wease"..., 11983, 0, NULL, NULL) = 11983
[pid 39490]      0.009191 poll([{fd=4, events=POLLOUT}], 1, 5000) = 1 ([{fd=4, revents=POLLOUT}])
[pid 39490]      0.000168 poll([{fd=4, events=POLLOUT}], 1, 5000) = 1 ([{fd=4, revents=POLLOUT}])
[pid 39490]      0.000078 sendto(4, "POST /en.wikipedia.org/v1/transform/wikitext/to/html/Foobar HTTP/1.1\r\nAccept-Encoding: identity\r\nContent-Length: 45\r\nHost: 127.0"..., 230, 0, NULL, 0) = 230
[pid 39490]      0.000255 poll([{fd=4, events=POLLIN}], 1, 5000) = 1 ([{fd=4, revents=POLLIN}])
[pid 39490]      0.024152 recvfrom(4, "HTTP/1.1 200 \r\ncontent-type: text/html;profile=\"mediawiki.org/specs/html/1.1.0\";charset=utf-8\r\ncache-control: private, max-age=0"..., 8192, 0, NULL, NULL) = 882
[pid 39490]      0.003879 poll([{fd=4, events=POLLOUT}], 1, 5000) = 1 ([{fd=4, revents=POLLOUT}])
[pid 39490]      0.000179 poll([{fd=4, events=POLLOUT}], 1, 5000) = 1 ([{fd=4, revents=POLLOUT}])
[pid 39490]      0.000074 sendto(4, "GET /en.wikipedia.org/v1/page/title/Foobar HTTP/1.1\r\nAccept-Encoding: identity\r\nHost: 127.0.0.1\r\n\r\n", 99, 0, NULL, 0) = 99
[pid 39490]      0.000259 poll([{fd=4, events=POLLIN}], 1, 5000) = 1 ([{fd=4, revents=POLLIN}])
[pid 39490]      0.004795 recvfrom(4, "HTTP/1.1 200 \r\netag: \"687168867/366722de-79b9-11e5-9058-99e99c77e71c\"\r\ncache-control: private, max-age=0, s-maxage=0, must-reval"..., 8192, 0, NULL, NULL) = 1085
[pid 39490]      0.004466 poll([{fd=4, events=POLLOUT}], 1, 5000) = 1 ([{fd=4, revents=POLLOUT}])
[pid 39490]      0.000169 poll([{fd=4, events=POLLOUT}], 1, 5000) = 1 ([{fd=4, revents=POLLOUT}])
[pid 39490]      0.000105 sendto(4, "GET /en.wikipedia.org/v1/page/revision/642497713 HTTP/1.1\r\nAccept-Encoding: identity\r\nHost: 127.0.0.1\r\n\r\n", 105, 0, NULL, 0) = 105
[pid 39490]      0.000279 poll([{fd=4, events=POLLIN}], 1, 5000) = 1 ([{fd=4, revents=POLLIN}])
[pid 39490]      0.013579 recvfrom(4, "HTTP/1.1 200 \r\netag: \"642497713/76df71b6-6365-11e5-8aa1-6636877c1130\"\r\ncache-control: private, max-age=0, s-maxage=0, must-reval"..., 8192, 0, NULL, NULL) = 1218
[pid 39490]      0.003924 poll([{fd=4, events=POLLOUT}], 1, 5000) = 1 ([{fd=4, revents=POLLOUT}])
[pid 39490]      0.000166 poll([{fd=4, events=POLLOUT}], 1, 5000) = 1 ([{fd=4, revents=POLLOUT}])
[pid 39490]      0.000079 sendto(4, "GET /en.wikipedia.org/v1/page/mobile-html/Main_Page HTTP/1.1\r\nAccept-Encoding: identity\r\nHost: 127.0.0.1\r\ncache-control: no-cach"..., 133, 0, NULL, 0) = 133
[pid 39490]      0.000257 poll([{fd=4, events=POLLIN}], 1, 5000) = 1 ([{fd=4, revents=POLLIN}])
[pid 39490]      2.707647 recvfrom(4, "HTTP/1.1 200 \r\naccess-control-allow-origin: *\r\naccess-control-allow-headers: accept, content-type\r\nx-xss-protection: 1; mode=blo"..., 8192, 0, NULL, NULL) = 8192
[pid 39490]      0.001020 poll([{fd=4, events=POLLIN}], 1, 5000) = 1 ([{fd=4, revents=POLLIN}])
[pid 39490]      0.000063 recvfrom(4, "\"100\"></a></span>\n</div><span about=\"#mwt4\">\n</span><p about=\"#mwt4\"><b><a rel=\"mw:WikiLink\" href=\"./Oerip_Soemohardjo\" title=\"O"..., 66570, 0, NULL, NULL) = 66570
[pid 39490]      0.000127 poll([{fd=4, events=POLLIN}], 1, 5000) = 1 ([{fd=4, revents=POLLIN}])
[pid 39490]      0.000045 recvfrom(4, "\r\n", 2, 0, NULL, NULL) = 2
[pid 39490]      0.000046 poll([{fd=4, events=POLLIN}], 1, 5000) = 1 ([{fd=4, revents=POLLIN}])
[pid 39490]      0.000042 recvfrom(4, "0\r\n\r\n", 8192, 0, NULL, NULL) = 5
[pid 39490]      0.001530 poll([{fd=4, events=POLLOUT}], 1, 5000) = 1 ([{fd=4, revents=POLLOUT}])
[pid 39490]      0.000093 poll([{fd=4, events=POLLOUT}], 1, 5000) = 1 ([{fd=4, revents=POLLOUT}])
[pid 39490]      0.000036 sendto(4, "GET /en.wikipedia.org/v1/page/mobile-html/Main_Page HTTP/1.1\r\nAccept-Encoding: identity\r\nHost: 127.0.0.1\r\n\r\n", 108, 0, NULL, 0) = 108
[pid 39490]      0.000107 poll([{fd=4, events=POLLIN}], 1, 5000) = 1 ([{fd=4, revents=POLLIN}])
[pid 39490]      0.525016 recvfrom(4, "HTTP/1.1 200 \r\naccess-control-allow-origin: *\r\naccess-control-allow-headers: accept, content-type\r\nx-xss-protection: 1; mode=blo"..., 8192, 0, NULL, NULL) = 8192
[pid 39490]      0.000729 poll([{fd=4, events=POLLIN}], 1, 5000) = 1 ([{fd=4, revents=POLLIN}])
[pid 39490]      0.000052 recvfrom(4, "\"100\"></a></span>\n</div><span about=\"#mwt4\">\n</span><p about=\"#mwt4\"><b><a rel=\"mw:WikiLink\" href=\"./Oerip_Soemohardjo\" title=\"O"..., 66570, 0, NULL, NULL) = 66570
[pid 39490]      0.000113 poll([{fd=4, events=POLLIN}], 1, 5000) = 1 ([{fd=4, revents=POLLIN}])
[pid 39490]      0.000033 recvfrom(4, "\r\n", 2, 0, NULL, NULL) = 2
[pid 39490]      0.000042 poll([{fd=4, events=POLLIN}], 1, 5000) = 1 ([{fd=4, revents=POLLIN}])
[pid 39490]      0.000048 recvfrom(4, "0\r\n\r\n", 8192, 0, NULL, NULL) = 5
[pid 39490]      0.001547 poll([{fd=4, events=POLLOUT}], 1, 5000) = 1 ([{fd=4, revents=POLLOUT}])
[pid 39490]      0.000089 poll([{fd=4, events=POLLOUT}], 1, 5000) = 1 ([{fd=4, revents=POLLOUT}])
[pid 39490]      0.000026 sendto(4, "GET /en.wikipedia.org/v1/page/graph/png/User%3APchelolo%2FGraph/670213569/1533aaad45c733dcc7e07614b54cbae4119a6747.png HTTP/1.1\r"..., 175, 0, NULL, 0) = 175
[pid 39490]      0.000142 poll([{fd=4, events=POLLIN}], 1, 5000) = 1 ([{fd=4, revents=POLLIN}])
[pid 39490]      0.118737 recvfrom(4, "HTTP/1.1 200 \r\naccess-control-allow-origin: *\r\naccess-control-allow-headers: accept, content-type\r\nx-xss-protection: 1; mode=blo"..., 8192, 0, NULL, NULL) = 5797
[pid 39490]      0.002595 poll([{fd=4, events=POLLOUT}], 1, 5000) = 1 ([{fd=4, revents=POLLOUT}])
[pid 39490]      0.000121 poll([{fd=4, events=POLLOUT}], 1, 5000) = 1 ([{fd=4, revents=POLLOUT}])
[pid 39490]      0.000049 sendto(4, "POST /en.wikipedia.org/v1/media/math/svg HTTP/1.1\r\nAccept-Encoding: identity\r\nContent-Length: 21\r\nHost: 127.0.0.1\r\nContent-Type:"..., 187, 0, NULL, 0) = 187
[pid 39490]      0.000136 poll([{fd=4, events=POLLIN}], 1, 5000) = 1 ([{fd=4, revents=POLLIN}])
[pid 39490]      0.012898 recvfrom(4, "HTTP/1.1 200 \r\ncontent-type: image/svg+xml\r\naccess-control-allow-origin: *\r\naccess-control-allow-headers: accept, content-type\r\n"..., 8192, 0, NULL, NULL) = 4589
[pid 39490]      0.002872 poll([{fd=4, events=POLLOUT}], 1, 5000) = 1 ([{fd=4, revents=POLLOUT}])
[pid 39490]      0.000121 poll([{fd=4, events=POLLOUT}], 1, 5000) = 1 ([{fd=4, revents=POLLOUT}])
[pid 39490]      0.000040 sendto(4, "POST /en.wikipedia.org/v1/media/math/mml HTTP/1.1\r\nAccept-Encoding: identity\r\nContent-Length: 21\r\nHost: 127.0.0.1\r\nContent-Type:"..., 187, 0, NULL, 0) = 187
[pid 39490]      0.000178 poll([{fd=4, events=POLLIN}], 1, 5000) = 1 ([{fd=4, revents=POLLIN}])
[pid 39490]      0.012142 recvfrom(4, "HTTP/1.1 200 \r\ncontent-type: application/mathml+xml\r\naccess-control-allow-origin: *\r\naccess-control-allow-headers: accept, conte"..., 8192, 0, NULL, NULL) = 1153
All endpoints are healthy
[pid 39490]      0.010277 +++ exited with 0 +++
     0.000087 --- SIGCHLD {si_signo=SIGCHLD, si_code=CLD_EXITED, si_pid=39490, si_uid=4849, si_status=0, si_utime=14, si_stime=6} ---
0.15user 0.06system 0:03.74elapsed 5%CPU (0avgtext+0avgdata 16708maxresident)k
0inputs+0outputs (0major+4555minor)pagefaults 0swaps
     0.002051 +++ exited with 0 +++

there's also a bug in service_checker, in case of endpoint timeout it still exited 0 considering all endpoints healthy

Event Timeline

fgiunchedi raised the priority of this task from to Medium.
fgiunchedi updated the task description. (Show Details)
fgiunchedi subscribed.

judging from strace timing it seems mobileapps endpoint is slow, in the ~2s range and sometimes takes more than the service_checker timeout to reply (5s).
running ab against mobileapps shows a tail of 4s requests sometimes, not always reproducible when testing over one minute

$ ab -n 4096 -c 1 -t 60 mobileapps.svc.eqiad.wmnet:8888/en.wikipedia.org/v1/page/mobile-html/Main_Page
This is ApacheBench, Version 2.3 <$Revision: 655654 $>
Copyright 1996 Adam Twiss, Zeus Technology Ltd, http://www.zeustech.net/
Licensed to The Apache Software Foundation, http://www.apache.org/

Benchmarking mobileapps.svc.eqiad.wmnet (be patient)
Finished 71 requests


Server Software:        
Server Hostname:        mobileapps.svc.eqiad.wmnet
Server Port:            8888

Document Path:          /en.wikipedia.org/v1/page/mobile-html/Main_Page
Document Length:        60950 bytes

Concurrency Level:      1
Time taken for tests:   60.343 seconds
Complete requests:      71
Failed requests:        0
Write errors:           0
Total transferred:      4381055 bytes
HTML transferred:       4327450 bytes
Requests per second:    1.18 [#/sec] (mean)
Time per request:       849.903 [ms] (mean)
Time per request:       849.903 [ms] (mean, across all concurrent requests)
Transfer rate:          70.90 [Kbytes/sec] received

Connection Times (ms)
              min  mean[+/-sd] median   max
Connect:        0    0   0.2      0       1
Processing:   518  850 596.7    707    4713
Waiting:      518  848 596.6    707    4712
Total:        519  850 596.7    708    4713

Percentage of the requests served within a certain time (ms)
  50%    705
  66%    746
  75%    777
  80%    783
  90%   1086
  95%   1694
  98%   2978
  99%   4713
 100%   4713 (longest request)

Change 249388 had a related patch set uploaded (by Giuseppe Lavagetto):
service_checker: correctly return an error in case of timeouts

https://gerrit.wikimedia.org/r/249388

Change 249388 merged by Giuseppe Lavagetto:
service_checker: correctly return an error in case of timeouts

https://gerrit.wikimedia.org/r/249388

Joe claimed this task.
Joe set Security to None.

this isn't resolved as mobileapps still seems slow, related T116770: service_checker reports success even on endpoints timing out is resolved though

I think this was caused by a bug in the mobile app end point monitoring spec. See https://github.com/wikimedia/restbase/pull/389 for a fix.

The fix above (as well as one for T116911) were deployed this morning.