Page MenuHomePhabricator

Monitor Copypatrol/EranBot uptime
Closed, ResolvedPublicOct 21 2020

Description

Currently users re-open task T256501: CopyPatrol has stopped when they think that new items are not imported.
It's not practical: a user has to look at the tool at that moment, have knowledge of phabricator, an account on it, know the task in question and have time to point it out :)
A longer-term sustainable operation would be to have an an automatic alert as soon as the problem appears, like looking if new items were created in last 4 hours. May involve Community-Tech member direct notification or automatic re-opening of T256501 task.

Details

Due Date
Oct 21 2020, 4:00 AM

Event Timeline

Thanks for creating this! It's been on the mental to-dos for some time. We definitely need some means to better monitor the whole pipeline. Usually, by the time we get around to looking into it, the issue magically fixed itself. The error logs don't usually reveal much either. I know that if the bot dies, it restarts on its own within 10 minutes. It is somewhat rare that we actually have to manually kill the bot and restart it (the most recent incident is an example where manual intervention was needed). This leads me to believe that perhaps sometimes there just really are periods that long without any copyvios? But it could also be the Turnitin service itself, or any number of other things. It's hard to pinpoint when there are three completely separate components (Turnitin, the bot, and the web application). We do have monitoring on the web app, at least, see https://stats.uptimerobot.com/BN16RUOP5/784331770

@MusikAnimal May I suggest you a quick temporary solution: as I see uptimerobot monitors only the website uptime, not its content. To have it trigger alerts when no new entries are detected since some time, you can add a special endpoint in the webservice to serve GOOD or BAD keywords. This healtcheck would be defined by sql query "does there any new entry in last 4 hours for each wiki". Uptimerobot can be configured to assert the presence of a unique keyword: in this case you can minitor the presence of GOOD one in the healtcheck. If no new entries are added you'll then receive an alert. That solve this task, but in a hacky way, I agree.

In my opinion uptimerobot is a good temporary solution, but on an external platform. But sadly without central monitoring for tool platform (T53434: Establish an internal system or a recommended external system for monitoring user-created Toolforge web services) that looks the only way to solve this need for me.

To have it trigger alerts when no new entries are detected since some time, you can add a special endpoint in the webservice to serve GOOD or BAD keywords. This healtcheck would be defined by sql query "does there any new entry in last 4 hours for each wiki"

Clever! Yes we could do this, and make use of our existing UptimeRobot service, returning a 500-level error when there hasn't been much activity. I'll look into this soon.

I can modify the bot to give some indication on the last titles/edits requests to turnin, or the last turnin response.
Just let me know what would be easiest to monitor

This is live on the staging tool. There are two URL params that can be passed, lang (default en) and offset (default 4), where offset is the number of hours.

Examples:

ARamirez_WMF changed the subtype of this task from "Task" to "Deadline".

This is now live. Maintainers of CopyPatrol should get an email when there are no cases younger than 4 hours old. @Diannaa you no longer need to ping us, but I'm not saying you can't, because of course you still can :) I should also point out that it seems you work on a slightly earlier schedule than most of us here at CommTech, and historically by the time I wake up the bot had fixed itself. But at least now we have some automation in place and may be able to catch things earlier. The next step obviously is to figure out why it keeps going down and what we can do to mitigate it. More on that soon, I hope!

There's nothing to QA here so I think this task I think can be resolved.