Page MenuHomePhabricator

Copypatrol is down
Closed, ResolvedPublic

Description

The CopyPatrol bot has stopped, as there's been no new reports for about 5 hours. There were two outages earlier today:

From 00:17 to 03:16 (3 hours); from 09:46 to 12:42 (3 hours) - 5 reports were posted, and the bot went down again at 12:51 (over 5 hours ago). Thank you!

Event Timeline

It looks like there must have been some reports in a buffer or something - reports with timestamps of 17:05 through 17:54 were not on the board when I filed this report. Then the bot apparently went down again, from 17:54 to 21:03. It appears to be running properly again at the moment.

Another outage - on 3/13, from 02:30 to 09:11. Nearly 7 hours. If someone could look into this problem that would be great. Thanks!

Just an update - we are still getting daily outages ranging in length from 2.5 to 7 hours. The bot is currently down (5 hours). Any assistance with this problem would be appreciated. Thank you!

I'm afraid I've done all I can do as far as debugging the issue with the bot. We need Python expertise and someone familiar with the code, which does not include anyone at Community Tech :( Or, we could rewrite it entirely, but that's a rather large project and definitely not something we'll be able to do in the near-term.

What I can do, however, is maybe introduce a way for Diannaa to restart the bot herself. We could have a secret URL somewhere like https://copypatrol.toolforge.org/en/restart that only works if the authenticated user is a sysop. It's a hacky workaround, but that gives you the power to do the only thing I know to do when there is downtime -- restart the bot.

However:

Another outage - on 3/13, from 02:30 to 09:11. Nearly 7 hours.

As we've learned, the bot auto-restarts roughly every 4 hours. So restarting apparently wouldn't have fixed it in this case.

CopyPatrol was originally developed in response to a Community wish list request. According to the documentation at https://meta.wikimedia.org/wiki/CopyPatrol, the tool is maintained by the Wikimedia Foundation's Community Tech team, so I am surprised that nobody at Community Tech actually knows how to do so.

Re-writing the bot will definitely have to be undertaken at some point to make it possible for Community Tech team programmers to look after maintenance, and to keep this important bot active in the long term.

The ability for me or other sysops to restart the bot would be a good interim solution in my opinion.

Community Tech built the web tool that we call CopyPatrol. It has had close to 100% uptime going back several months according to https://stats.uptimerobot.com/BN16RUOP5/778436724. The issue is with the bot that provides the data exposed in CopyPatrol, and that bot is maintained by @eranroz. You've been reporting bugs as CopyPatrol bugs, which is totally fine – because what good is CopyPatrol without any data to show, right? :) We've invested significant time already into figuring out why Eranbot so frequently goes down or is "stuck", and while we have managed to fix some bugs, it seems the larger overall issue of instability remains, and I'm not really sure what else to do at this point. I say this with no disrespect to Eran, who graciously volunteered their time to write the bot, and probably has limited time to invest in debugging issues. This is why I think long-term, a complete rewrite might be worthwhile.

I am pretty confident I can make a "restart bot" button work fairly easily though, if that's acceptable for the time being. @eranroz do you have any concerns with this idea? Again we will restrict this functionality to just sysops on the applicable wiki.

I would be cautious with restarting. If I'm not mistaken, the bot loses any queued pages when it restarts.

A rewrite is definitely worthwhile. Currently, it uses Python 2, which is past EOL.
Python 3.7+ would be ideal, but that is not available on the Toolforge bastions/grid.

I would be cautious with restarting. If I'm not mistaken, the bot loses any queued pages when it restarts.

If I understand what's being said here correctly, restarting the bot is the only way to get it back up right now.

A rewrite is definitely worthwhile. Currently, it uses Python 2, which is past EOL.
Python 3.7+ would be ideal, but that is not available on the Toolforge bastions/grid.

Per this https://wikitech.wikimedia.org/wiki/Help:Toolforge/Python, python 3 is supported on toolforge

I would be cautious with restarting. If I'm not mistaken, the bot loses any queued pages when it restarts.

If I understand what's being said here correctly, restarting the bot is the only way to get it back up right now.

Yes, if it is actually down. If an admin restarts it when it is actually running (They cannot tell if it is from the Copypatrol interface.), then it may lose its page queue (edits that need to be checked).
(The automatic restarts every 4 hours could already cause queue loss if it isn't empty.)
Currently, the bot is running and its queue is approaching 600 pages.

Python 3.7+ would be ideal, but that is not available on the Toolforge bastions/grid.

Per this https://wikitech.wikimedia.org/wiki/Help:Toolforge/Python, python 3 is supported on toolforge

Only 3.5 is available on the Toolforge bastions/grid. It is also past EOL. 3.7 is only on Toolforge k8s.

Yes, restarting does lose whatever edits were queued up. However from what we've seen it often won't ever recover until a restart. As I recall there is a way to make it re-scan edits starting with the given iThenticate ID, such that you can pick up where it left off, though I don't think we've tested that feature.

At any rate it seems a restart button needs more thought. We could perhaps somehow also (or instead) show the queue size, the last restart time, and maybe the last diff that was scanned. Whenever I check the logs manually, I usually go by the revision IDs to see how long ago that was. If it was very recent, that tells me the bot is at still running. But as I said, hours can go by with no progress before you finally resort to restarting.

Currently, it uses Python 2, which is past EOL.

This is a good point. According to https://wikitech.wikimedia.org/wiki/Help:Toolforge/Python#Deprecating_Python_2 it seems Toolforge's official support will end in 2022? A rewrite might be forced on us sooner than later.

3.7 is only on Toolforge k8s.

Kubernetes cron jobs are now officially supported if I'm reading correctly, so this might be a viable option. Or we could use a different language. But I personally wouldn't mind getting my feet wet with some Python :) Plus obviously we have the current codebase to go off of.

Diannaa claimed this task.

The bot seems to be functioning well for quite a few days. Closing ticket. Thanks everybody who had a look.