Page MenuHomePhabricator

CopyPatrol going down intermittently
Open, MediumPublic

Description

From UptimeRobot:

Up2017-12-12 19:43:19OK (200)0 hrs, 0 mins
Down2017-12-12 19:36:32Internal Server Error (500)0 hrs, 6 mins
Up2017-12-12 14:46:16OK (200)4 hrs, 50 mins
Down2017-12-12 14:45:14Internal Server Error (500)0 hrs, 1 mins
Up2017-12-09 02:08:22OK (200)84 hrs, 36 mins
Down2017-12-09 02:07:21Internal Server Error (500)0 hrs, 1 mins
Up2017-12-05 20:33:22OK (200)77 hrs, 33 mins
Down2017-12-05 20:32:20Internal Server Error (500)0 hrs, 1 mins
Up2017-12-02 05:41:23OK (200)86 hrs, 50 mins
Down2017-12-02 05:40:22Internal Server Error (500)0 hrs, 1 mins

The error logs aren't too telling. I have a hunch this might be a Toolforge issue, because I have a similar problem with Topviews, where it goes down for a minute or so at a time, and at around the same intervals. The exact times Topviews goes down are not the same as CopyPatrol, however. The one thing these two tools have in common is they make queries to the tools-db on page load. None of my other tools do this, and they all have 100% uptime, so that's my guess... something with tools-db.

Event Timeline

I wanted to report the same issue. Could Uptime Robot be adjusted to not report downtime unless it's more than 5 minutes?

I wanted to report the same issue. Could Uptime Robot be adjusted to not report downtime unless it's more than 5 minutes?

It looks at though you cannot :( But I have changed the monitoring interval to every 15 minutes instead of 10, and I've deleted the monitor for the Leadboard page. It's odd that it sometimes goes down independently of the feed... I think it all has to do with response time. More than 5 seconds and UptimeRobot considers it down. In reality I think the queries just sometimes are slow. On the other hand a "timeout" is not the same as a 500 response code, as UptimeRobot claims it is. Who knows!

We might try to find a different monitoring service that better suits our needs.

Hmm, I'm seeing ErrorException: count(): Parameter must be an array or an object that implements Countable and Undefined index: HTTP_ACCEPT_LANGUAGE on GET, so maybe those are worth fixing. I've submitted a PR for the latter at https://github.com/wikimedia/CopyPatrol/pull/60

Hmm, I'm seeing ErrorException: count(): Parameter must be an array or an object that implements Countable and Undefined index: HTTP_ACCEPT_LANGUAGE on GET, so maybe those are worth fixing. I've submitted a PR for the latter at https://github.com/wikimedia/CopyPatrol/pull/60

Nice catch! These both were just warnings, I hope? The former was a regression with https://github.com/wikimedia/CopyPatrol/pull/58. I don't think either has to do with the downtime. Definitely still worth fixing, though :)

@MusikAnimal I don't have the logs in front of me but I'm pretty sure they are handled as HTTP 500 errors, and that aligns with the emails I get from Uptime Robot: CopyPatrol (http://tools.wmflabs.org/copypatrol) is currently DOWN (HTTP 500 - Internal Server Error)

Also, I didn't test PR 60 as I don't have a local environment, so please double check on staging that it doesn't break anything before deploying :)

PR 60 works in my testing, and I've deployed it. I made sure to browse to the app with a different HTTP_ACCEPT_LANGUAGE, etc.

I can also confirm the Undefined index: HTTP_ACCEPT_LANGUAGE was thrown as a PHP notice, and ErrorException: count() was apparently handled by SlimApp as a "level": "ERROR" (I can only assume there's a FATAL). If you happen to check the logs now you may not see these because I just truncated the files (they were growing to be quite large, and we don't have any log rotation).

I did some reading at https://uptimerobot.com/about, and they're saying:

If the status code is~400+ and 500+ ... Uptime Robot makes several more checks in the next 30 seconds. If the site is still down, it sends an alert.

If this is true (downtime spans 30+ seconds), I think we can conclude the 500 errors to be authentic, and that they probably don't have anything to do with response time.

I am skeptical if it's really our code causing the problem. The copyvio records aren't tended to super duper often, so it would be odd to get 500's at one moment, then 1 minute later everything is fine -- when the data being presented and the code executed has not changed.

Part of it might be Toolforge. I use Uptime Robot for my other apps, and they experience brief periods of downtime, too.

I'm not sure if this is still a problem. Boldly closing, anyone please re-open if you feel this still needs attention.

I regularly (most days in March) see errors where CopyPatrol reports a 500 error for a few minutes at a time.