Page MenuHomePhabricator

Uptimerobot monitoring for the Articlerequest tool flaps
Open, Needs TriagePublic

Description

I have a monitor on the articlerequest tool. As of late, it's been going up and down quite frequently with connection timeouts. I have not modified the articlerequest tool in some time, so this is obviously an infrastructure problem.

A sample of the logs (times are mountain)

Down 	2017-09-11 12:26:11	Connection Timeout	0 hrs, 0 mins
Up 	2017-09-11 11:28:46	OK (200)	0 hrs, 57 mins
Down 	2017-09-11 11:25:08	Connection Timeout	0 hrs, 3 mins
Up 	2017-09-11 02:29:16	OK (200)	8 hrs, 55 mins
Down 	2017-09-11 02:28:13	Connection Timeout	0 hrs, 1 mins
Up 	2017-09-11 00:27:26	OK (200)	2 hrs, 0 mins
Down 	2017-09-11 00:25:50	Connection Timeout	0 hrs, 1 mins
Up 	2017-09-10 18:28:49	OK (200)	5 hrs, 57 mins
Down 	2017-09-10 18:27:25	Connection Timeout	0 hrs, 1 mins
Up 	2017-09-10 15:29:03	OK (200)	2 hrs, 58 mins

Event Timeline

Restricted Application added a subscriber: Aklapper. · View Herald Transcript

98% (9,838 of the last 10,000) of requests to the articlerequest tool are from http://www.uptimerobot.com/. Each hit by uptimerobot is actually 2 requests, the first for http://tools.wmflabs.org/articlerequest which receives a 301 redirect and the second to http://tools.wmflabs.org/articlerequest/. The number of requests is not completely symmetrical, there are 4,928 301 responses and only 4,910 200 responses. It seems possible that either intermittent routing failures or other request timeout limits account for the discrepancy. Without detailed traceroute information from the monitoring endpoint to the Toolforge ingress server it is very difficult to determine where the networking error may be.

Based on the /data/project/articlerequest/service.log file, it does not look like there is an actual problem with the uptime of the Kubernetes container that is hosting the application.

bd808 renamed this task from Articlerequest tool goes up and down often to Uptimerobot monitoring for the Articlerequest tool flaps.Sep 12 2017, 11:43 PM

From my experience with uptimerobot, it's quite flaky - we got one or two false positive alerts each day from it. I've switched to using Pingdom for that reason.

I've switched the monitor to use the http://tools.wmflabs.org/articlerequest/ link, though I have the same monitor running for articlerequest-dev with no issues.

If the issue persists, I'll contact them for more details with regard to a traceroute.

As for the 98% of requests - Yes, articlerequest is not currently in use as a tool (it's still under development). So that traffic is expected.

@yuvipanda I have subscribed to Pingdom, and it reports the exact same problem much more frequently. The error text is "Socket timeout, unable to connect to server"

I'll keep an eye on the issue, but I still strongly suspect it's a Toolforge problem, not a monitoring system problem.

Traceroute from pingdom:

traceroute to 208.80.155.131 (208.80.155.131), 30 hops max, 60 byte packets
1 184.75.214.65 (184.75.214.65) 0.182 ms 0.441 ms 0.582 ms
2 te0-7-0-9.221.ccr22.yyz02.atlas.cogentco.com (38.122.69.121) 0.631 ms 0.775 ms 0.902 ms
3 be2994.ccr22.cle04.atlas.cogentco.com (154.54.31.233) 7.739 ms 7.947 ms 8.121 ms
4 be2718.ccr42.ord01.atlas.cogentco.com (154.54.7.129) 14.969 ms 15.087 ms 15.183 ms
5 be2766.ccr41.ord03.atlas.cogentco.com (154.54.46.178) 15.013 ms 15.125 ms 15.206 ms
6 zayo.ord03.atlas.cogentco.com (154.54.9.38) 14.807 ms 14.831 ms 14.941 ms
7 ae17.cr2.ord2.us.zip.zayo.com (64.125.31.82) 15.048 ms 15.014 ms 15.148 ms
8 ae27.cs2.ord2.us.eth.zayo.com (64.125.30.244) 28.082 ms 28.149 ms 28.304 ms
9 ae3.cs2.lga5.us.eth.zayo.com (64.125.29.212) 28.158 ms 28.161 ms 28.159 ms
10 ae0.cs1.lga5.us.eth.zayo.com (64.125.29.186) 28.180 ms 28.341 ms 28.496 ms
11 ae4.cs1.dca2.us.eth.zayo.com (64.125.29.203) 28.457 ms 28.175 ms 28.437 ms
12 ae27.cr1.dca2.us.zip.zayo.com (64.125.30.247) 29.345 ms 28.822 ms 29.070 ms
13 ae6.er1.iad10.us.zip.zayo.com (64.125.20.118) 69.324 ms 35.092 ms 35.286 ms
14 64.125.192.142.IPYX-125449-001-ZYO.zip.zayo.com (64.125.192.142) 28.156 ms 28.191 ms 28.501 ms
15 tools.wmflabs.org (208.80.155.131) 30.726 ms 30.798 ms 31.068 ms
16 www.tools.wmflabs.org (208.80.155.131) 31.267 ms 31.070 ms 31.106 ms