Currently users re-open task T256501: CopyPatrol has stopped when they think that new items are not imported.
It's not practical: a user has to look at the tool at that moment, have knowledge of phabricator, an account on it, know the task in question and have time to point it out :)
A longer-term sustainable operation would be to have an an automatic alert as soon as the problem appears, like looking if new items were created in last 4 hours. May involve Community-Tech member direct notification or automatic re-opening of T256501 task.
Description
Details
- Due Date
- Oct 21 2020, 4:00 AM
Status | Subtype | Assigned | Task | ||
---|---|---|---|---|---|
Resolved | eranroz | T256501 CopyPatrol has stopped | |||
Resolved | MusikAnimal | T262767 Monitor Copypatrol/EranBot uptime |
Event Timeline
Other tools/wmcs monitoring related tasks i'm aware of:
- T53434: Establish an internal system or a recommended external system for monitoring user-created Toolforge web services, the big goal for tools platform
- T256361: PAWS: get new service and cluster metrics into prometheus
- T205150: Develop the monitoring of Quarry
- T152235: Simple logrotate service for users of Tools as stopgap before central logging talks about "future central logging" but can't find any task for it
Thanks for creating this! It's been on the mental to-dos for some time. We definitely need some means to better monitor the whole pipeline. Usually, by the time we get around to looking into it, the issue magically fixed itself. The error logs don't usually reveal much either. I know that if the bot dies, it restarts on its own within 10 minutes. It is somewhat rare that we actually have to manually kill the bot and restart it (the most recent incident is an example where manual intervention was needed). This leads me to believe that perhaps sometimes there just really are periods that long without any copyvios? But it could also be the Turnitin service itself, or any number of other things. It's hard to pinpoint when there are three completely separate components (Turnitin, the bot, and the web application). We do have monitoring on the web app, at least, see https://stats.uptimerobot.com/BN16RUOP5/784331770
@MusikAnimal May I suggest you a quick temporary solution: as I see uptimerobot monitors only the website uptime, not its content. To have it trigger alerts when no new entries are detected since some time, you can add a special endpoint in the webservice to serve GOOD or BAD keywords. This healtcheck would be defined by sql query "does there any new entry in last 4 hours for each wiki". Uptimerobot can be configured to assert the presence of a unique keyword: in this case you can minitor the presence of GOOD one in the healtcheck. If no new entries are added you'll then receive an alert. That solve this task, but in a hacky way, I agree.
In my opinion uptimerobot is a good temporary solution, but on an external platform. But sadly without central monitoring for tool platform (T53434: Establish an internal system or a recommended external system for monitoring user-created Toolforge web services) that looks the only way to solve this need for me.
To have it trigger alerts when no new entries are detected since some time, you can add a special endpoint in the webservice to serve GOOD or BAD keywords. This healtcheck would be defined by sql query "does there any new entry in last 4 hours for each wiki"
Clever! Yes we could do this, and make use of our existing UptimeRobot service, returning a 500-level error when there hasn't been much activity. I'll look into this soon.
I can modify the bot to give some indication on the last titles/edits requests to turnin, or the last turnin response.
Just let me know what would be easiest to monitor
This is live on the staging tool. There are two URL params that can be passed, lang (default en) and offset (default 4), where offset is the number of hours.
Examples:
- https://plagiabot.toolforge.org/activity_check => should give a 200 status code, blank page
- https://plagiabot.toolforge.org/activity_check?lang=cs&offset=1 => most likely throws a 500. If it's a 200 then verify there's a case from the last hour at https://copypatrol.toolforge.org/cs/
This is now live. Maintainers of CopyPatrol should get an email when there are no cases younger than 4 hours old. @Diannaa you no longer need to ping us, but I'm not saying you can't, because of course you still can :) I should also point out that it seems you work on a slightly earlier schedule than most of us here at CommTech, and historically by the time I wake up the bot had fixed itself. But at least now we have some automation in place and may be able to catch things earlier. The next step obviously is to figure out why it keeps going down and what we can do to mitigate it. More on that soon, I hope!
There's nothing to QA here so I think this task I think can be resolved.