Page MenuHomePhabricator

CopyPatrol has stopped
Closed, ResolvedPublic

Description

The CopyPatrol bot has stopped, as there's been no new reports for about 16 hours. MusikAnimal is investigating log-in issues.

Event Timeline

There are a very large number of changes, so older changes are hidden. Show Older Changes

The tool has been functioning properly for several hours - closing again for now. Thanks.

The bot appears to have stopped again; there's been no new reports for about 16 hours. Thanks

Same issue -- all sessions were forced to logout recently due to a security issue. I used history to find the command Eran used to reinitialize the credentials, but I need the password to proceed. @eranroz could you create a file with the credentials on Toolforge (and of course run chmod 400 on it :) ?

restarted the session, should work now.

The CopyPatrol bot appears to have stopped again . There's been no new reports for 4 hours (the last report was posted at 09:53 UTC August 30). Thanks.

Activity has been continuous for several hours - closing. Thanks.

The bot has stopped again - no new reports since 05:09 UTC (7 hours ago). Thank you!

Still no new reports; it's now been down for 17 hours. Any assistance would be appreciated. Thanks.

I restarted it not long after your report at T256501#6450200, however it didn't actually restart as I thought it did; rather, the job was "stuck" entirely. I had to get a Toolforge admin to kill it for me.

It looks like everything's back to normal now. Apologies for the downtime!

The CopyPatrol bot has stopped again. There's been no new reports for 12 hours (since 23:50 UTC yesterday - Sept 30), iThenticate reports are failing to load, and the buttons are not working. Thank you in advance for your help.

I've restarted the bot, we'll see if that does anything. As for buttons not working, I suspect that was related to the general instability of Toolforge we've been experiencing the last few days, see https://stats.uptimerobot.com/BN16RUOP5 (the only tools at 100% uptime are on VPS, not Toolforge).

I haven't forgotten about T262767 either. I already have a patch in progress and will try to finish that up soon, though in this case the flood of emails from all of our tools going down might have obscured the one about the CopyPatrol feed being out of date.

Still no new reports, and iThenticate links are failing to load. I wonder if we are out of credits at the Turnitin service?

Still no new reports, and iThenticate links are failing to load. I wonder if we are out of credits at the Turnitin service?

I tried logging in with the bot's iThenticate credentials, and I get error code E503. Same as T205917 apparently. @eranroz How do we go about reactivating the account? I assume you probably got an email.

Following a recent change all users where forced to logout - https://lists.wikimedia.org/pipermail/wikitech-l/2020-October/093922.html
I had to relogin to set the session again - it should start to work soon

There's still no new reports - it's been three hours. The iThenticate links are not working either.

Following a recent change all users where forced to logout - https://lists.wikimedia.org/pipermail/wikitech-l/2020-October/093922.html
I had to relogin to set the session again - it should start to work soon

That needed to be done too, for sure, but as Dianna says it appears we are still having issues with iThenticate. The error message suggests it's the same issue as T205917. Did you get an email from Turnitin?

Following a recent change all users where forced to logout - https://lists.wikimedia.org/pipermail/wikitech-l/2020-October/093922.html
I had to relogin to set the session again - it should start to work soon

That needed to be done too, for sure, but as Dianna says it appears we are still having issues with iThenticate. The error message suggests it's the same issue as T205917. Did you get an email from Turnitin?

I didn't get email from ithenticate, but it seems I can't connect so this looks like similar issue. @MusikAnimal @Niharika do you know remember how did it solved last time?

I tried to login to the ithenticate account but it says Please contact your account administrator with the code E503. I think you need to shoot them an email like last time, @eranroz. I see a past email where we were talking to Mark Ricksen about this.

Turnitin helped me to get the bot back to work and it should start to work now

The bot has been filing reports successfully for a couple hours, time to close the ticket. Thank you!

The bot appears to have stopped again, as there's been no new reports for 4 hours (since 20:50 UTC). Thanks,

Looks like activity has long since resumed. I'm happy to report, by the way, that our monitoring of the feed (T262767) will be deployed soon, and you will no longer need to inform us when there is no activity :)

The system has been behaving erratically for the last day or so with long gaps where no reports are filed. Outages from 2020-10-27 16:36 UTC to 2020-10-27 20:43; from 2020-10-28 00:23 to 2020-10-28 04:03; from 2020-10-28 04:03 to 2020-10-28 08:33; and from 2020-10-28 09:50 to present. All of these gaps are approximately 4 hours long except for the current outage, which is around 6 hours so far. Thanks for any assistance!

I changed our monitoring to email us after 6 hours (which we did receive). This is because when it was set to 4 hours, we were getting a lot of emails, and every time I checked it appeared the bot was working properly.

However, this time is different. I see there were over a 1000 pages in the buffer, and that number was only growing larger. I restarted the bot and it seems to be actually processing things again.

Unfortunately, myself and Community Tech are not familiar with the Eranbot code base nor the Python programming language for that matter. @eranroz would you be okay with receiving emails about downtime too, and perhaps assist with debugging the persistent downtime?

My next idea is to automatically restart the bot after every N hours. Very cheap, and this would cause brief periods of downtime from when the bot dies and starts back up, but it still seems better than the status-quo.

Is it possible that the bot is running but actually missing things during these gaps? Because it doesn't make sense to me that it would find 4 to 8 things to report each hour every hour for days, weeks, or even months at a time and then suddenly have a 4-hour gap or a series of several four-hour gaps.

In the last 30 days there have been slightly more than the count handled by
the top 10 volunteers which is:
2,282

The turns out to be 3.17 per hour

While the reports are probably not evenly spread, I think it would be
extremely rare to go four hours without a single report. Maybe not
impossible but it should be very rare, excluding the obvious examples where
the system is down. Note that my numbers are based upon English only which
I think is the bulk of the reports, but I assume the email should be
triggered by zero reports in four hours across all languages, which means
it should be even less likely.

I'm trying to figure out how to square this with the observation that there
have been multiple reports of fout hour gaps, each of which turned out to
be false positives.

I note that the other language queues have lower activity. Is there any
possibility that the monitoring to automatically email you is based upon
activity separately by language? Because it is highly likely that there are
four-hour gaps in the other languages. in fact, I noticed that there are
zero case is closed in the last 30 days for the Czech Wikipedia and the
French Wikipedia which is troubling although it may be an unrelated issue.

I'll also note and it may be coincidence, but I received the following
error message several times in the last 48 hours:
502 Bad Gateway

Phil

There were also Bad Gateway timeouts at Earwig's tool; perhaps there was an issue at Toolforge.

I like the idea of automatic re-starts of the bot.

Is there any possibility that the monitoring to automatically email you is based upon activity separately by language? Because it is highly likely that there are four-hour gaps in the other languages.

We only get emails for English for this very reason. You can play around with the endpoint yourself. lang is the language, offset is the number of hours to check. You'll see a message with a 500 response code if there are no cases, or a blank page if there are some. For instance:

We then use UptimeRobot to ping those endpoints (currently only English), which emails us if there's a 500 response.

Unfortunately, as we've discovered, no reports doesn't always mean the bot is truly "down". Today's incident was different but the other times I checked it was very clearly still communicating with Turnitin and crawling recent edits. But as I said before, I'm not eliminating some other issue with the bot, I just personally don't know what more to do as far as debugging :( One thing I have gathered is that there is a "buffer", or a queue so-to-speak, of diffs to process. Sometimes that gets a bit backed up, and that appears to be what happened today. Restarting the bot clears the queue.

I'm certainly okay with auto-restarts, if you all are. I suspect the resultant downtime from a single restart will exceed no more than ~12 minutes: We would kill the job at say, 00:08 UTC, then the cron should restart it again two minutes later, but if it doesn't (Toolforge is slow and the job didn't completely die in time), then you'd have to wait no more than 10 additional minutes. So on average, I would say just 2 minutes of downtime. Multiply that by however many times a day we want to restart, which is up for debate. How does 4 times a day sound?

I think 8 times a day would be better (every 3 hours rather than every 6 hours). How does that sound?

I'm fine with the auto restart. My hope is that this will occur as close to
the top of the hour as possible so that we can conclude if the system seems
to be unavailable at the top of the hour it's in the middle of a restart
and it will be available soon probably no more than 12 minutes later.

Phil

restarting

My next idea is to automatically restart the bot after every N hours. Very cheap, and this would cause brief periods of downtime from when the bot dies and starts back up, but it still seems better than the status-quo.

this is already happening (or should happen) - the crontab command that lunch the bot is already doing it, e.g:

*/10 * * * * jsub -N enwiki -mem 500m -l h_rt=4:05:00 -once -quiet ...

Which mean the bot should restart itself every 4 hours and 5 minutes.
(more precisely: try to lunch the bot every 10 minutes, if it is not running, and don't run it for more than 4 hours and 5 minutes)

queue overflow

However, this time is different. I see there were over a 1000 pages in the buffer, and that number was only growing larger. I restarted the bot and it seems to be actually processing things again.

Unfortunately, myself and Community Tech are not familiar with the Eranbot code base nor the Python programming language for that matter. @eranroz would you be okay with receiving emails about downtime too, and perhaps assist with debugging the persistent downtime?

Based on this information, it sounds to me like a code issue in the bot itself (not a service issue, or toolforge issue) - it would be best to split it to a separate phabricator ticket and I would like to claim it and assign it to me.

I usually have more time during the weekend to address issues - hence to effectively address issues I would like to have the context for offline debugging. What I need to start addressing it is context of the execution e.g

  • What is the last copyvio that was reported successfully? this help me to find the context of the bot - it is best that it would come from @Diannaa or @Sphilbrick. when I look for it afterwards I will have strong pointer than just time
  • out and err logs (/data/project/eranbot/outs/<EXECUTION ID> and "enwiki.err" with the relevant out of the execution). @MusikAnimal , when you find the bot in zomby state, can you please dump the result of "tail -n 10000 enwiki.err" to separate file and look for the last "EXECUTION ID" and mention it

based on that I can further explore the code for relevant issues (or expand the logging in suspicious areas).

In the recent outages, outages occurred

from 2020-10-27 16:36 UTC to 2020-10-27 20:43;
from 2020-10-28 00:23 to 2020-10-28 04:03;
from 2020-10-28 04:03 to 2020-10-28 08:33;
and from 2020-10-28 09:50 to 2020-10-28 16:40.

Is this the information you are looking for, @eranroz ?

For the most recent period (2020-10-28 09:50 to 2020-10-28 16:40) the last
item processed was:
article – Gebhard Leberecht von Bl?cher
Diff -
https://en.wikipedia.org/w/index.php?title=Gebhard_Leberecht_von_Bl%FCcher&diff=985851997
Time - 2020-10-28 09:50

Phil

For the prior period (from 2020-10-28 04:03 to 2020-10-28 08:33;)
article - Refractive index
Diff -
https://en.wikipedia.org/w/index.php?title=Refractive_index&diff=985821711
Time - 2020-10-28 04:03

Phil

The reported issue was resolved - closing. Further improvements to the bot can be worked on, but we don't need this ticket to be kept open for that to happen. Thanks again folks.

Hello again. There's no new posts at CopyPatrol since 15:33 UTC (7 hours ago). Not sure what's going on, since there's new items posted at https://en.wikipedia.org/wiki/Special:Log/EranBot . So it looks to me like the bot is running, but the items it's finding are not getting posted at CopyPatrol. Thank you!

One step ahead of you! I was debugging based on T256501#6590756.

...
MusikAnimal , when you find the bot in zomby state, can you please dump the result of "tail -n 10000 enwiki.err" to separate file and look for the last "EXECUTION ID" and mention it

The last execution file in the outs/ directory I see are as follows (with sizes/timestamps):

137  Nov 10 19:30 enwiki.o2267813
531  Nov 10 22:11 enwiki.o2270002

I did not find 2270002 or 2267813 anywhere in enwiki.err, or mention of a "EXECUTION ID". Not sure if I'm doing this right, but at any rate, the files in outs/ seem to be very recent.

My next idea is to automatically restart the bot after every N hours. Very cheap, and this would cause brief periods of downtime from when the bot dies and starts back up, but it still seems better than the status-quo.

this is already happening (or should happen) - the crontab command that lunch the bot is already doing it, e.g:

*/10 * * * * jsub -N enwiki -mem 500m -l h_rt=4:05:00 -once -quiet ...

Which mean the bot should restart itself every 4 hours and 5 minutes.
(more precisely: try to lunch the bot every 10 minutes, if it is not running, and don't run it for more than 4 hours and 5 minutes)

You blew my mind telling me about this. Unfortunately, I can confirm -l h_rt=4:05:00 did not actually kill the job for my bot. Maybe it doesn't work here either. From now on I'll also check qstat to see how long the job has been running.

That said, as I was doing this debugging the enwiki job died on its own and restarted. So we should be back to normal soon. The enwiki.err looks normal, and looked normal when I first started debugging too. So this definitely isn't the same "zombie state" as last time where >1000 pages were in the buffer.


UPDATE: expected downtime due to toolsdb being in read-only mode https://lists.wikimedia.org/pipermail/cloud-announce/2020-November/000340.html

Adding: When I try to mark an assessment for a case already on the board I get an error message "There was an error in connecting to the database"

Not sure what's going on, since there's new items posted at https://en.wikipedia.org/wiki/Special:Log/EranBot . So it looks to me like the bot is running, but the items it's finding are not getting posted at CopyPatrol. Thank you!

Adding: When I try to mark an assessment for a case already on the board I get an error message "There was an error in connecting to the database"

Ohh, I forgot; toolsdb has been under maintenance for much of today and in read-only mode (see email from Cloud Services). That is why EranBot appears to be running from Special:Log (thank you for pointing that out, by the way!), but nothing is appearing in CopyPatrol, and you're unable to mark assessments. I'm led to believe the maintenance will be completed in some number of hours, but I really am not sure.

So never mind to T256501#6617925; this is expected downtime outside our control. Once you stop seeing the error you describe at T256501#6617958, you know toolsdb is back and you should start seeing new cases show up.

I guess there's not much we can do, but unless I'm misreading time stamps,
it has been down for 8 hours. My understanding is that when it is
restarted, it doesn't go back so that's a quite a few potential violations
that will have to be caught by other means.

Phil

The bot appears to have stopped or stalled again, as there's been no new reports for 4.5 hours. Thank you.

Another stoppage - no new reports for 4.5 hours. Adding: activity resumed a few minutes ago.

We've had several stoppages in the last 24 hours:

  • From 21:27 on 12/6 to 01:16 on 12/7 - 4 hours 20 minutes.
  • The bot then appears to have almost immediately stopped again, filing a few reports and then stopping again at around 01:22. Activity resumed at 09:41. This outage was 8 hours and 20 minutes.
  • The bot again stopped, at 10:02, and appears to still be down (no reports for over 2 hours).

The outage that started at 10:02 on the 7th ended up being nearly 11 hours long, as activity (other than one post) did not resume until 20:55. The bot only functioned for about another hour before it went down again for another 4 hours. It only posted 2 reports and then went down for another 8 hours. It posted reports for quite a while on December 8, from 09:20 until 21:38, when it went down for another 3 hours. I would appreciate it if the bot maintainers could have a look, as we're into the third day of erratic service. Thank you.

First off, sorry for the late reply. We have been swamped with conducting our annual Wishlist Survey.

I looked at the logs just now, and saw the buffer size was almost at 200 pages, so perhaps the same issue described at T256501#6585457 is to blame. I followed @eranroz's debugging tips at T256501#6590756. The last execution ID was 3780039 (enwiki.o3780039 to be exact). I did not find any mention of this anywhere in the enwiki.err file.

I'm afraid at this point I personally can't help much further. I am hoping Eran as the bot operator can take a closer look. My "solution" when this has happened in the past was to simply restart the bot. However I can confirm the bot has been auto-restarting every 4 hours as @eranroz reported at T256501#6590756, and it has been all this time. So my "fix" wasn't ever really a fix. Rather it was just a workaround that came at the cost of losing reports of N possible copyvios, depending how big the buffer was at that time. Though, it doesn't explain the 8+ hour outage that Dianna mentioned in T256501#6672880. The bot should have auto-restarted twice during that time, so maybe there's more at play here than just the buffer size becoming too large.

The honest truth is myself nor anyone at Community Tech are familiar with how the bot works, and we aren't Python experts either :( Sorry! But we certainly would love to resolve this issue and are happy to help if given some direction.

https://github.com/valhallasw/plagiabot/pull/19 is for one issue that I found. This causes noticeable downtime.

There are also some other issues that I haven't figured out a solution. These don't cause noticeable downtime, but pages are lost from the buffer.

UnicodeEncodeError latin-1
WARNING: API error bad-pagetriage-page: The page specified does not exist in pagetriage queue
Triage triage 991702530: bad-pagetriage-page: The page specified does not exist in pagetriage queue [help:See https://en.wikipedia.org/w/api.php for API usage. Subscribe to the mediawiki-api-announce mailing list at &lt;https://lists.wikimedia.org/mailman/listinfo/mediawiki-api-announce&gt; for notice of API deprecations and breaking changes.]
Traceback (most recent call last):
  File "/data/project/eranbot/gitPlagiabot/plagiabot/plagiabot.py", line 854, in <module>
    main()
  File "/data/project/eranbot/gitPlagiabot/plagiabot/plagiabot.py", line 849, in main
    bot.run()
  File "/data/project/eranbot/gitPlagiabot/plagiabot/plagiabot.py", line 610, in run
    self.report_uploads()  # report checked edits
  File "/data/project/eranbot/gitPlagiabot/plagiabot/plagiabot.py", line 503, in report_uploads
    self.report_log.add_report(rep['new'], rep['diff_date'], rep['title_no_ns'], rep['ns'], rep['report_id'], rep['source'])
  File "/mnt/nfs/labstore-secondary-tools-project/eranbot/gitPlagiabot/plagiabot/report_logger.py", line 90, in add_report
    report))
  File "/usr/lib/python2.7/dist-packages/MySQLdb/cursors.py", line 207, in execute
    args = tuple(map(db.literal, args))
  File "/usr/lib/python2.7/dist-packages/MySQLdb/connections.py", line 304, in literal
    s = self.escape(o, self.encoders)
  File "/usr/lib/python2.7/dist-packages/MySQLdb/connections.py", line 222, in unicode_literal
    return db.literal(u.encode(unicode_literal.charset))
UnicodeEncodeError: 'latin-1' codec can't encode character u'\u2013' in position 20: ordinal not in range(256)

Something is failing to encode with latin-1 when trying to write to the database.

UnicodeDecodeError ascii
Traceback (most recent call last):
  File "/data/project/eranbot/gitPlagiabot/plagiabot/plagiabot.py", line 854, in <module>
    main()
  File "/data/project/eranbot/gitPlagiabot/plagiabot/plagiabot.py", line 849, in main
    bot.run()
  File "/data/project/eranbot/gitPlagiabot/plagiabot/plagiabot.py", line 622, in run
    self.process_changes()
  File "/data/project/eranbot/gitPlagiabot/plagiabot/plagiabot.py", line 422, in process_changes
    pywikibot.output("Error occurred - skipping: %s" % str(e))
UnicodeDecodeError: 'ascii' codec can't decode byte 0xc3 in position 14: ordinal not in range(128)

Same for 0xe2 and 0xc4

Some exception is being caught. Then this exception is raised when attempting to output the exception to stdout before continuing.

ConnectionError: Connection reset by peer
ERROR: Traceback (most recent call last):
  File "/data/project/eranbot/pywikibot/core/pywikibot/data/api.py", line 1969, in submit
    body=body, headers=headers)
  File "/data/project/eranbot/pywikibot/core/pywikibot/tools/__init__.py", line 1488, in wrapper
    return obj(*__args, **__kw)
  File "/data/project/eranbot/pywikibot/core/pywikibot/comms/http.py", line 317, in request
    r = fetch(baseuri, method, params, body, headers, **kwargs)
  File "/data/project/eranbot/pywikibot/core/pywikibot/comms/http.py", line 519, in fetch
    error_handling_callback(request)
  File "/data/project/eranbot/pywikibot/core/pywikibot/comms/http.py", line 405, in error_handling_callback
    raise request.data
ConnectionError: ('Connection aborted.', error(104, 'Connection reset by peer'))

May not be an issue. It appears to be retried successfully.

ReadTimeout (EventStreams)
Traceback (most recent call last):
  File "/data/project/eranbot/gitPlagiabot/plagiabot/plagiabot.py", line 854, in <module>
    main()
  File "/data/project/eranbot/gitPlagiabot/plagiabot/plagiabot.py", line 849, in main
    bot.run()
  File "/data/project/eranbot/gitPlagiabot/plagiabot/plagiabot.py", line 598, in run
    for page in live_gen:
  File "/data/project/eranbot/gitPlagiabot/plagiabot/plagiabot.py", line 588, in <genexpr>
    filter_gen = lambda gen: (p for p in gen if self.page_filter(p))
  File "/data/project/eranbot/pywikibot/core/pywikibot/pagegenerators.py", line 2601, in LiveRCPageGenerator
    for entry in site_rc_listener(site, total=total):
  File "/data/project/eranbot/pywikibot/core/pywikibot/comms/eventstreams.py", line 272, in __iter__
    event = next(self.source)
  File "/data/project/eranbot/.local/lib/python2.7/site-packages/sseclient.py", line 72, in __next__
    self._connect()
  File "/data/project/eranbot/.local/lib/python2.7/site-packages/sseclient.py", line 47, in _connect
    self.resp = requester.get(self.url, stream=True, **self.requests_kwargs)
  File "/data/project/eranbot/.local/lib/python2.7/site-packages/requests/api.py", line 72, in get
    return request('get', url, params=params, **kwargs)
  File "/data/project/eranbot/.local/lib/python2.7/site-packages/requests/api.py", line 58, in request
    return session.request(method=method, url=url, **kwargs)
  File "/data/project/eranbot/.local/lib/python2.7/site-packages/requests/sessions.py", line 502, in request
    resp = self.send(prep, **send_kwargs)
  File "/data/project/eranbot/.local/lib/python2.7/site-packages/requests/sessions.py", line 612, in send
    r = adapter.send(request, **kwargs)
  File "/data/project/eranbot/.local/lib/python2.7/site-packages/requests/adapters.py", line 516, in send
    raise ReadTimeout(e, request=request)
ReadTimeout: HTTPSConnectionPool(host='stream.wikimedia.org', port=443): Read timed out. (read timeout=30)
HTTPError: 429 Client Error: Too Many Requests (EventStreams)
Traceback (most recent call last):
  File "/data/project/eranbot/gitPlagiabot/plagiabot/plagiabot.py", line 854, in <module>
    main()
  File "/data/project/eranbot/gitPlagiabot/plagiabot/plagiabot.py", line 849, in main
    bot.run()
  File "/data/project/eranbot/gitPlagiabot/plagiabot/plagiabot.py", line 598, in run
    for page in live_gen:
  File "/data/project/eranbot/gitPlagiabot/plagiabot/plagiabot.py", line 588, in <genexpr>
    filter_gen = lambda gen: (p for p in gen if self.page_filter(p))
  File "/data/project/eranbot/pywikibot/core/pywikibot/pagegenerators.py", line 2601, in LiveRCPageGenerator
    for entry in site_rc_listener(site, total=total):
  File "/data/project/eranbot/pywikibot/core/pywikibot/comms/eventstreams.py", line 272, in __iter__
    event = next(self.source)
  File "/data/project/eranbot/.local/lib/python2.7/site-packages/sseclient.py", line 72, in __next__
    self._connect()
  File "/data/project/eranbot/.local/lib/python2.7/site-packages/sseclient.py", line 52, in _connect
    self.resp.raise_for_status()
  File "/data/project/eranbot/.local/lib/python2.7/site-packages/requests/models.py", line 937, in raise_for_status
    raise HTTPError(http_error_msg, response=self)
HTTPError: 429 Client Error: Too Many Requests for url: https://stream.wikimedia.org/v2/stream/recentchange
ResponseNotReady
Traceback (most recent call last):
  File "/data/project/eranbot/gitPlagiabot/plagiabot/plagiabot.py", line 854, in <module>
    main()
  File "/data/project/eranbot/gitPlagiabot/plagiabot/plagiabot.py", line 849, in main
    bot.run()
  File "/data/project/eranbot/gitPlagiabot/plagiabot/plagiabot.py", line 626, in run
    while not self.uploads_ready(): continue
  File "/data/project/eranbot/gitPlagiabot/plagiabot/plagiabot.py", line 198, in uploads_ready
    document_get_response = self.server.document.get({'id': upload_id, 'sid': self.sid})
  File "/usr/lib/python2.7/xmlrpclib.py", line 1243, in __call__
    return self.__send(self.__name, args)
  File "/usr/lib/python2.7/xmlrpclib.py", line 1602, in __request
    verbose=self.__verbose
  File "/usr/lib/python2.7/xmlrpclib.py", line 1283, in request
    return self.single_request(host, handler, request_body, verbose)
  File "/usr/lib/python2.7/xmlrpclib.py", line 1313, in single_request
    response = h.getresponse(buffering=True)
  File "/usr/lib/python2.7/httplib.py", line 1124, in getresponse
    raise ResponseNotReady()
ResponseNotReady

Looks like https://stackoverflow.com/questions/3231543/python-httplib-responsenotready

HTTPError: 429 Client Error: Too Many Requests (EventStreams)

One of my bots has been hitting this lately. There might be a larger issue with EventStreams

HTTPError: 429 Client Error: Too Many Requests (EventStreams)

One of my bots has been hitting this lately. There might be a larger issue with EventStreams

Yea, I think the ReadTimeouts cause it to attempt to reconnect. Repeatedly doing that could lead to 429s.

@JJMC89 good catch - thank you for the pull request - I've merged that and we shall see how this affects the bot stability.

  • For the encoding issues - I saw them in the past, and solved some of them, though there are probably more. I couldn't find in the log a good failing example to test and fix, but I think it is quite rare.
  • I suspect there may be a more general issue related to event streams. The bot have some legacy code that allows it to run with irc.wikimedia.org to get events - I prefer to keep using the new eventstream which is based on more modern code, but if we see the issue keeps poping we can try also to use the legacy IRC mode and see how does it with with that (it may have some dependency on IRC library)

The bot appears to have stopped again; there's been no new reports for over two hours after coming in at a rapid pace all day (3 to 10 reports per hour). Thanks.

The bot appears to have stopped again; there's been no new reports for over two hours after coming in at a rapid pace all day (3 to 10 reports per hour). Thanks.

It is running. The only thing that I see in the logs between Draft:Greco-Japanese wars and Draft:Carroll Stevens are non-violations and what looks like the automatic restart.

Now I am getting an error message when I try to load the page: "Error: A website error has occurred. The website administrator has been notified of the issue. Sorry for the temporary inconvenience."

Now I am getting an error message when I try to load the page: "Error: A website error has occurred. The website administrator has been notified of the issue. Sorry for the temporary inconvenience."

The database backing CopyPatrol is down for maintenance to address T266587: ToolsDB replication is broken.

The maintenance seems to be complete, but the bot has not yet re-started.

The maintenance seems to be complete, but the bot has not yet re-started.

It is running, but it hasn't found any violations since the maintenance ended.

Diannaa reopened this task as Open.EditedFeb 27 2021, 9:21 PM

Hi everybody. We've had some erratic service from this bot in the last few days, with numerous outages ranging from 1.5 to 4 hours in length.

Most recently the bot stopped posting at 11:28 UTC and resumed at 13:56, posted two reports, and stopped again at 13:58. It resumed at 18:00 and stopped at 18:48 after posting only six reports. The bot is currently stopped. Any assistance would be appreciated. Thanks.

The logs don't have timestamps, which makes investigating issues difficult.


Between John 1:6 (2021-02-27 11:28) and Draft:Dr. Shulamith Reich Elster (2021-02-27 13:56)
Traceback (most recent call last):
  File "/data/project/eranbot/gitPlagiabot/plagiabot/plagiabot.py", line 861, in <module>
    main()
  File "/data/project/eranbot/gitPlagiabot/plagiabot/plagiabot.py", line 856, in main
    bot.run()
  File "/data/project/eranbot/gitPlagiabot/plagiabot/plagiabot.py", line 633, in run
    while not self.uploads_ready(): continue
  File "/data/project/eranbot/gitPlagiabot/plagiabot/plagiabot.py", line 193, in uploads_ready
    if time.time()-self.last_uploads_status < 45:
KeyboardInterrupt

This is probably the forced restart due to max runtime. Should try removing the max runtime.


Before 26 Martyrs of Japan (2021-02-27 18:00)
Traceback (most recent call last):
  File "/data/project/eranbot/gitPlagiabot/plagiabot/plagiabot.py", line 861, in <module>
    main()
  File "/data/project/eranbot/gitPlagiabot/plagiabot/plagiabot.py", line 856, in main
    bot.run()
  File "/data/project/eranbot/gitPlagiabot/plagiabot/plagiabot.py", line 633, in run
    while not self.uploads_ready(): continue
  File "/data/project/eranbot/gitPlagiabot/plagiabot/plagiabot.py", line 200, in uploads_ready
    raise Exception('Error retreving document {}. Response status: {}'.format(upload_id, document_get_response['status']))
Exception: Error retreving document 69421521. Response status: 401

This should be caught somewhere so that it doesn't crash the bot.


No issues in the log between 2021-02-27 18:48 and 2021-02-27 19:13.


Last in log after Maharashtracha Superstar (2021-02-27 19:13)
Traceback (most recent call last):
  File "/data/project/eranbot/gitPlagiabot/plagiabot/plagiabot.py", line 861, in <module>
    main()
  File "/data/project/eranbot/gitPlagiabot/plagiabot/plagiabot.py", line 856, in main
    bot.run()
  File "/data/project/eranbot/gitPlagiabot/plagiabot/plagiabot.py", line 629, in run
    self.process_changes()
  File "/data/project/eranbot/gitPlagiabot/plagiabot/plagiabot.py", line 429, in process_changes
    pywikibot.output("Error occurred - skipping: %s" % str(e))
UnicodeDecodeError: 'ascii' codec can't decode byte 0xc3 in position 37: ordinal not in range(128)

Another decoding issue

The 19:13 report was actually posted to CopyPatrol about 2 hours after it occurred. It's the most recent report om the board, circa 3.5 hours ago.

It would be welcome if new tasks were created for new incidents, and keeping old tasks from months ago resolved - thanks!

Okay I will do that in the future, and add the same subscribers.

Has been working ok for a few days, closing.