Page MenuHomePhabricator

Pywikibot stops when finding the character \uFFFD - 'REPLACEMENT CHARACTER'
Open, HighPublic

Description

I have a Pywikibot - script that adds tens of thousands of files to a category. But when the script gets to a file containing the character \uFFFD, it will stop and say:

"Title contains illegal char (\uFFFD 'REPLACEMENT CHARACTER')"

I believe the script should not stop in such case, but it should simply report such files to an error log. And maybe skip them, but in any case, the script should not stop.

My script is this:
python pwb.py category add -pt:1 -file:yH5.txt -to:"Taken with Sony DSC-H5" -summary:"Bot: Adding category [[:Category:Taken with Sony DSC-H5|Taken with Sony DSC-H5]] using Pywikibot in automatic mode"

And it stops at this file:
[[:commons:File:A Warli painting by Jivya Soma Mashe, Thane district.jpg]]

Event Timeline

It is [[Category:Mus�e du quai Branly]] giving problems in page.py(5362)init()

(Pdb) '\ufffd' in text
True
(Pdb) text
'Category:Mus�e du quai Branly'
(Pdb)

Maybe

if u'\ufffd' in t:
    raise pywikibot.Error(
        "Title contains illegal char (\\uFFFD 'REPLACEMENT CHARACTER')")

should raise InvalidTitle instead.

TheSandDoctor renamed this task from Pywikibot stops when finding the character \uFFFD - 'REPLACEMENT CHARACTER' to Pywikibot stops when finding the character \uFFFD - 'REPLACEMENT CHARACTER' in category add.Feb 21 2020, 3:53 PM
TheSandDoctor subscribed.
This comment was removed by TheSandDoctor.

@zhuyifei1999 Do you think that such a raise could be made? The problem that I see with both handlings though is that the titles are not "invalid" as they are the valid image titles on the wiki(?). I am also having this issue when it comes to my Commons Corruption Checking task.

TheSandDoctor renamed this task from Pywikibot stops when finding the character \uFFFD - 'REPLACEMENT CHARACTER' in category add to Pywikibot stops when finding the character \uFFFD - 'REPLACEMENT CHARACTER'.Feb 21 2020, 4:00 PM

@zhuyifei1999 Do you think that such a raise could be made? The problem that I see with both handlings though is that the titles are not "invalid" as they are the valid image titles on the wiki(?). I am also having this issue when it comes to my Commons Corruption Checking task.

\ufffd is invalid, no user can create a page with this in title or upload an image with this in name. I would throw InvalidTitle error too though

Change 574111 had a related patch set uploaded (by Dvorapa; owner: Dvorapa):
[pywikibot/core@master] [bugfix] Throw InvalidTitle for title containing illegal char

https://gerrit.wikimedia.org/r/574111

@Dvorapa But what would cause it to return it then when looking at images? I am sort of confused here.

Also: thanks for switching it to a catchable exception! :)

TheSandDoctor triaged this task as High priority.

Also: thanks for switching it to a catchable exception! :)

Yeah, the error should be better after the patch.

@Dvorapa But what would cause it to return it then when looking at images? I am sort of confused here.

I can see the issue (it's in text, not in title), but don't know, what Pywikibot part makes it worse. Could you describe in detail the steps you took to observe this issue?

Change 574111 merged by jenkins-bot:
[pywikibot/core@master] [bugfix] Raise InvalidTitle for title containing illegal char

https://gerrit.wikimedia.org/r/574111

@Dvorapa Just trying to create a file page reference given a name of a file from recent changes. Here is the error log.

The line that causes the problem is file_page = pywikibot.FilePage(site, change.title) of rcworker.py

I'm sorry, I don't understand, what your script does and how. There should be no page title in commons with that character and if there are links to a page title with that character or someone tried to upload a file with that character in its title using Pywikibot, the freshly called InvalidTitle error should be always checked in scripts.

So what else do you think we should fix in Pywikibot? Because after the patch it fails correctly, FilePage('invalid characters') should return this error. Pywikibot's category.py now correctly tries to remove the category with an invalid title (I'm not sure if this is desired, but noone has complained yet I think)

Dvorapa changed the task status from Open to Stalled.EditedFeb 22 2020, 7:36 PM

@Fructibus @TheSandDoctor Anything else we should fix in Pywikibot regarding this issue?

@Dvorapa grabs the file from recent changes using site_rc_listener (script, ImageObj) and then sends it to rcworker (linked above) using redis. rcworker then creates a pwb FilePage object out of the title from the recent changes log and processes the file. site_rc_listener is what must be giving it the invalid image titles? Something just doesn't add up here for me as it doesn't make sense why the script is being given invalid image titles by pwb's site_rc_listener.

I don't understand it either. Could you find out, which file/image makes this (or is it random files/images? or all files/images?) and also tell me, which wiki do you check? (Commons?)

@Dvorapa Commons. The issue appears to happen at random. I have improved the ordering of my logs so next time it happens it should hopefully actually tell me the file name at issue (configured to log the file name before trying to make a FilePage object out of it, hopefully it will do that before crashing). Given that the files are only run from recent changes if they are new uploads, this isn't something easily repeatable and does appear to happen at random. I will update here when I have more logs. Thank you for your patch to make the exception catchable.

Now we need to know, why is this happening, because I don't feel like there are so many new files in Commons with this character. Perhaps there is an issue with sseclient, who knows?

What is the traceback?

Any minimal reproducible test case (even if it takes a long time to reproduce that's still something)?

@zhuyifei1999 The only log currently available is as follows (and linked above):

2020-02-18 18:45:16,662 __main__    : INFO File:PICT0430 - 301032 - onroerenderfgoed.jpg :Not corrupt. Stored
Traceback (most recent call last):
  File "rcworker.py", line 221, in <module>
    main()
  File "rcworker.py", line 214, in main
    run_worker()
  File "rcworker.py", line 61, in run_worker
    file_page = pywikibot.FilePage(site, change.title)
  File "/usr/local/lib/python3.8/site-packages/pywikibot/tools/__init__.py", line 1744, in wrapper
 return obj(*__args, **__kw)
  File "/usr/local/lib/python3.8/site-packages/pywikibot/page.py", line 2478, in __init__
    super(FilePage, self).__init__(source, title, 6)
  File "/usr/local/lib/python3.8/site-packages/pywikibot/tools/__init__.py", line 1744, in wrapper
    return obj(*__args, **__kw)
  File "/usr/local/lib/python3.8/site-packages/pywikibot/page.py", line 2327, in __init__
    super(Page, self).__init__(source, title, ns)
  File "/usr/local/lib/python3.8/site-packages/pywikibot/page.py", line 200, in __init__
    self._link = Link(title, source=source, default_namespace=ns)
File "/usr/local/lib/python3.8/site-packages/pywikibot/tools/__init__.py", line 1744, in wrapper
    return obj(*__args, **__kw)
  File "/usr/local/lib/python3.8/site-packages/pywikibot/page.py", line 6029, in __init__
    raise pywikibot.Error(
pywikibot.exceptions.Error: Title contains illegal char (\uFFFD 'REPLACEMENT CHARACTER')

I have improved the logging so that prior to attempting to generate a FilePage object it logs the name, instead of doing so afterwords. The current code setup is basically a minimal test case in the sense that rcwatcher.py would not change and rcworker.py would just be reduced to a few lines. The best way to go forward with this would be to just wait for it to crash again as that way it is still doing work in the meantime. I will post here when the next log is available.

rcworker.py would just be reduced to a few lines

Is redis queue critical to reproducing the issue? If not, that is an extra layer of complexity and a minimal reproducible test case does not need that.

@zhuyifei1999 Unknown at this point. Implemented and running alongside it now. If/when either it or the any of the 5 workers crash, will report back here.

@zhuyifei1999 For both of these worth noting that I have not updated the the latest version with the change in behaviour that this task merged.

From a normal worker:

2020-02-24 08:42:16,590 __main__    : INFO File:�রপক্ষ মন্দির থেকে ভক্তদের বের হওয়াjpgশ্য..
Traceback (most recent call last):
  File "rcworker.py", line 221, in <module>
    main()
  File "rcworker.py", line 214, in main
    run_worker()
  File "rcworker.py", line 63, in run_worker
    file_page = pywikibot.FilePage(site, change.title)
  File "/usr/local/lib/python3.8/site-packages/pywikibot/tools/__init__.py", line 1744, in wrapper
    return obj(*__args, **__kw)
  File "/usr/local/lib/python3.8/site-packages/pywikibot/page.py", line 2478, in __init__
    super(FilePage, self).__init__(source, title, 6)
  File "/usr/local/lib/python3.8/site-packages/pywikibot/tools/__init__.py", line 1744, in wrapper
    return obj(*__args, **__kw)
File "/usr/local/lib/python3.8/site-packages/pywikibot/page.py", line 2478, in __init__
    super(FilePage, self).__init__(source, title, 6)
  File "/usr/local/lib/python3.8/site-packages/pywikibot/tools/__init__.py", line 1744, in wrapper
    return obj(*__args, **__kw)
  File "/usr/local/lib/python3.8/site-packages/pywikibot/page.py", line 2327, in __init__
    super(Page, self).__init__(source, title, ns)
  File "/usr/local/lib/python3.8/site-packages/pywikibot/page.py", line 200, in __init__
    self._link = Link(title, source=source, default_namespace=ns)
  File "/usr/local/lib/python3.8/site-packages/pywikibot/tools/__init__.py", line 1744, in wrapper
    return obj(*__args, **__kw)
  File "/usr/local/lib/python3.8/site-packages/pywikibot/page.py", line 6029, in __init__
    raise pywikibot.Error(
pywikibot.exceptions.Error: Title contains illegal char (\uFFFD 'REPLACEMENT CHARACTER')

From another worker:

2020-02-23 19:44:35,182 __main__    : INFO File:Januš Radzivił. Януш �адзівіл (1646-53).jpg
Traceback (most recent call last):
  File "rcworker.py", line 221, in <module>
    main()
  File "rcworker.py", line 214, in main
    run_worker()
  File "rcworker.py", line 63, in run_worker
    file_page = pywikibot.FilePage(site, change.title)
  File "/usr/local/lib/python3.8/site-packages/pywikibot/tools/__init__.py", line 1744, in wrapper
    return obj(*__args, **__kw)
  File "/usr/local/lib/python3.8/site-packages/pywikibot/page.py", line 2478, in __init__
    super(FilePage, self).__init__(source, title, 6)
  File "/usr/local/lib/python3.8/site-packages/pywikibot/tools/__init__.py", line 1744, in wrapper
    return obj(*__args, **__kw)
 File "/usr/local/lib/python3.8/site-packages/pywikibot/page.py", line 2327, in __init__
    super(Page, self).__init__(source, title, ns)
  File "/usr/local/lib/python3.8/site-packages/pywikibot/page.py", line 200, in __init__
    self._link = Link(title, source=source, default_namespace=ns)
  File "/usr/local/lib/python3.8/site-packages/pywikibot/tools/__init__.py", line 1744, in wrapper
    return obj(*__args, **__kw)
  File "/usr/local/lib/python3.8/site-packages/pywikibot/page.py", line 6029, in __init__
    raise pywikibot.Error(
pywikibot.exceptions.Error: Title contains illegal char (\uFFFD 'REPLACEMENT CHARACTER')

From minimal reproducible test case:

2020-02-24 19:52:12,881 __main__    : DEBUG File:Мавзолей Д�тэ Мицумуне в храме Энцуин.jpg
Traceback (most recent call last):
  File "test_rc.py", line 68, in <module>
    main()
  File "test_rc.py", line 60, in main
    run_watcher()
  File "test_rc.py", line 51, in run_watcher
    file_page = pywikibot.FilePage(site, change['title'])
  File "/usr/local/lib/python3.8/site-packages/pywikibot/tools/__init__.py", line 1744, in wrapper
    return obj(*__args, **__kw)
  File "/usr/local/lib/python3.8/site-packages/pywikibot/page.py", line 2478, in __init__
super(FilePage, self).__init__(source, title, 6)
  File "/usr/local/lib/python3.8/site-packages/pywikibot/tools/__init__.py", line 1744, in wrapper
    return obj(*__args, **__kw)
  File "/usr/local/lib/python3.8/site-packages/pywikibot/page.py", line 2327, in __init__
    super(Page, self).__init__(source, title, ns)
  File "/usr/local/lib/python3.8/site-packages/pywikibot/page.py", line 200, in __init__
    self._link = Link(title, source=source, default_namespace=ns)
  File "/usr/local/lib/python3.8/site-packages/pywikibot/tools/__init__.py", line 1744, in wrapper
    return obj(*__args, **__kw)
  File "/usr/local/lib/python3.8/site-packages/pywikibot/page.py", line 6029, in __init__
    raise pywikibot.Error(
pywikibot.exceptions.Error: Title contains illegal char (\uFFFD 'REPLACEMENT CHARACTER')

They appear to have stopped on three separate files, none of which exist.

This seems like an encoding issue

Would you mind posting the code of the 'minimal test case' somewhere?

@Dvorapa all encoding is set to UTF-8.

@zhuyifei1999 Oops, thought I had. see here

@Dvorapa all encoding is set to UTF-8.

@zhuyifei1999 Oops, thought I had. see here

Python 2 or 3?

Also, what is the output of echo $LC_CTYPE and echo $LANG? Weird is this happens only for some letters. I can not reproduce it on my machine. If you use Pywikibot packaged with scripts, could you share the output of python pwb.py version? And what is your version of requests and sseclient?

Also, what is the output of echo $LC_CTYPE and echo $LANG? Weird is this happens only for some letters. I can not reproduce it on my machine. If you use Pywikibot packaged with scripts, could you share the output of python pwb.py version? And what is your version of requests and sseclient?

@Dvorapa echo $LC_CTYPE is just a blank return (null?). echo $LANG returns en_US.UTF-8. For this issue to appear running the minimal test case, it could take many hours or days (uninterrupted). It works fine 99% of the time, but crashes every couple days it seems due to this issue.

The first traceback above
2020-02-22 20:02:09 <- start
2020-02-24 08:42:16 <- crash

The second traceback I posted above
2020-02-22 20:02:20 <-start
2020-02-23 19:44:35 <-crash

2020-02-23 07:46:35 <- test_rc.py launched
2020-02-24 19:52:12 <- crash

python3 pwb.py version
Pywikibot: [https] r-pywikibot-core.git (a511f01, g11811, 2020/01/20, 16:27:32, OUTDATED)
Release version: 3.1.dev0
requests version: 2.22.0

I have the latest version of sseclient installed per the below.

sudo pip3 show sseclient | grep Version
WARNING: The directory '/home/thesanddoctor/.cache/pip' or its parent directory is not owned or is not writable by the current user. The cache has been disabled. Check the permissions and owner of that directory. If executing pip with sudo, you may want sudo's -H flag.
Version: 0.0.24

And the second latest version of requests (new one released 6 days ago)

sudo pip3 show requests | grep Version
WARNING: The directory '/home/thesanddoctor/.cache/pip' or its parent directory is not owned or is not writable by the current user. The cache has been disabled. Check the permissions and owner of that directory. If executing pip with sudo, you may want sudo's -H flag.
Version: 2.22.0

The first traceback above
2020-02-22 20:02:09 <- start
2020-02-24 08:42:16 <- crash

The second traceback I posted above
2020-02-22 20:02:20 <-start
2020-02-23 19:44:35 <-crash

2020-02-23 07:46:35 <- test_rc.py launched
2020-02-24 19:52:12 <- crash

Hmm so the first one crashed during the third's run? I would assume that if events were global then all readers would crash at once.

@zhuyifei1999 the first and second traceback are from "production" worker instances and pop items off of the same redis queue (all fed by a single instance of rcwatcher.py), thus they wouldn't get the same image. So it isn't feasible that they would crash all at once. They basically get images first-come, first-serve from recent changes.

So while the event data are loaded from json, hex escaping non-acsii are optional:

>>> json.loads('{"a":"File:Januš Radzivił. Януш Радзівіл (1646-53).jpg"}')
{'a': 'File:Januš Radzivił. Януш Радзівіл (1646-53).jpg'}
>>> json.loads('{"a":"File:Janu\u0161 Radzivi\u0142. \u042f\u043d\u0443\u0448 \u0420\u0430\u0434\u0437\u0456\u0432\u0456\u043b (1646-53).jpg"}')
{'a': 'File:Januš Radzivił. Януш Радзівіл (1646-53).jpg'}

This leaves me to believe the input to the decoder is already corrupted. Possibly packet / chunk boundary?

This remind me of https://github.com/btubbs/sseclient/issues/38 https://github.com/btubbs/sseclient/pull/39. @TheSandDoctor Could you see if the issue is still there after the patch? CC @Count_Count

@zhuyifei1999 the first and second traceback are from "production" worker instances and pop items off of the same redis queue (all fed by a single instance of rcwatcher.py), thus they wouldn't get the same image. So it isn't feasible that they would crash all at once. They basically get images first-come, first-serve from recent changes.

The first and third are two separate consumers of rcstreams right?

@zhuyifei1999 Yes, first and third are two separate customers. The first and second are working with the same customer. The third (code) is just a plain/direct printing of the file name straight from recent changes listener and trying to turn it into a file page (until it fails).

Yes, what I was saying was, the first and the third and two separate consumers, so events on first should also be received on the third. If there were something fundamentally wrong with the event data, then both would crash. Since this is not the case, there is nothing fundamentally wrong with the event data and therefore the error must be in other places, such as transmission / decoding, which leads to the linked bug report / PR.

In any case, could you see if that fixes the issue?

@zhuyifei1999 requests has been updated & the workers/feeder all restarted. I have re-started test_rc.py and will post back here if anything crashes. If it is good in a few days/week or something like that I think we could consider this resolved. Thanks for your help so far!

Hopefully this helps. There were more issues with sseclient 0.0.23 and 0.0.24 (https://github.com/wikimedia/pywikibot/commit/6cf25c0fe51991a892486a32211a434c695357b6), hopefully 0.0.25 will solve them all

Since updating to the latest master version of sseclient (post-fix merges) more of the workers crashed than usual (4 of the 5). 3 of the 4 crashes were due to the same issue.

The 4th was due to the page slipping through due to move. I will have to add a catch for that.

2020-02-27 23:07:01,324 __main__    : DEBUG None
2020-02-27 23:07:01,324 __main__    : INFO File:Vanessa Mai at Gruenspan 2019 (3).png
Traceback (most recent call last):
  File "rcworker.py", line 221, in <module>
    main()
  File "rcworker.py", line 214, in main
    run_worker()
  File "rcworker.py", line 110, in run_worker
    revision = change.getRevision(file_page)
  File "/home/ccc/Commons-image-corruption-detector/Image.py", line 29, in getRevision
    revision = file_page.get_file_history()[pywikibot.Timestamp.fromtimestampformat(self.log_timestamp)]
  File "/usr/local/lib/python3.8/site-packages/pywikibot/page.py", line 2533, in get_file_history
    self.site.loadimageinfo(self, history=True)
  File "/usr/local/lib/python3.8/site-packages/pywikibot/site.py", line 3104, in loadimageinfo
    raise PageRelatedError(
pywikibot.exceptions.PageRelatedError: loadimageinfo: Query on [[commons:File:Vanessa Mai at Gruenspan 2019 (3).png]] returned no imageinfo
2020-02-27 21:50:56,281 __main__    : INFO File:Classic Remise Berlin, Wiebestra�e 36, Berlin-Moabit, Bild 15.jpg
Traceback (most recent call last):
  File "rcworker.py", line 221, in <module>
    main()
  File "rcworker.py", line 214, in main
    run_worker()
  File "rcworker.py", line 63, in run_worker
    file_page = pywikibot.FilePage(site, change.title)
  File "/usr/local/lib/python3.8/site-packages/pywikibot/tools/__init__.py", line 1744, in wrapper
    return obj(*__args, **__kw)
  File "/usr/local/lib/python3.8/site-packages/pywikibot/page.py", line 2478, in __init__
    super(FilePage, self).__init__(source, title, 6)
  File "/usr/local/lib/python3.8/site-packages/pywikibot/tools/__init__.py", line 1744, in wrapper
    return obj(*__args, **__kw)
  File "/usr/local/lib/python3.8/site-packages/pywikibot/page.py", line 2327, in __init__
    super(Page, self).__init__(source, title, ns)
  File "/usr/local/lib/python3.8/site-packages/pywikibot/page.py", line 200, in __init__
    self._link = Link(title, source=source, default_namespace=ns)
  File "/usr/local/lib/python3.8/site-packages/pywikibot/tools/__init__.py", line 1744, in wrapper
    return obj(*__args, **__kw)
  File "/usr/local/lib/python3.8/site-packages/pywikibot/page.py", line 6029, in __init__
    raise pywikibot.Error(
pywikibot.exceptions.Error: Title contains illegal char (\uFFFD 'REPLACEMENT CHARACTER')

But an interesting log here...

2020-02-28 18:51:30,439 __main__    : INFO File:Paris FC-AC Ajaccio Stade Charl�ty 16.jpg
Traceback (most recent call last):
  File "rcworker.py", line 169, in run_worker
    store_image(file_page.title(), False, img_hash=change.hash)  # store in database
  File "/home/ccc/Commons-image-corruption-detector/database_stuff.py", line 67, in store_image
    page_id = manapi.getPageID(title)
  File "/home/ccc/Commons-image-corruption-detector/manapi.py", line 38, in getPageID
    return int(getImageInfo(title)['pageid'])
  File "/home/ccc/Commons-image-corruption-detector/manapi.py", line 27, in getImageInfo
    for _, v in response.json()['query']['pages'].items():
  File "/home/ccc/.local/lib/python3.8/site-packages/requests/models.py", line 898, in json
    return complexjson.loads(self.text, **kwargs)
  File "/usr/local/lib/python3.8/json/__init__.py", line 357, in loads
return _default_decoder.decode(s)
  File "/usr/local/lib/python3.8/json/decoder.py", line 337, in decode
    obj, end = self.raw_decode(s, idx=_w(s, 0).end())
  File "/usr/local/lib/python3.8/json/decoder.py", line 355, in raw_decode
    raise JSONDecodeError("Expecting value", s, err.value) from None
json.decoder.JSONDecodeError: Expecting value: line 1 column 1 (char 0)
Traceback (most recent call last):
  File "rcworker.py", line 221, in <module>
    main()
  File "rcworker.py", line 214, in main
    run_worker()
  File "rcworker.py", line 63, in run_worker
    file_page = pywikibot.FilePage(site, change.title)
  File "/usr/local/lib/python3.8/site-packages/pywikibot/tools/__init__.py", line 1744, in wrapper
    return obj(*__args, **__kw)
  File "/usr/local/lib/python3.8/site-packages/pywikibot/page.py", line 2478, in __init__
    super(FilePage, self).__init__(source, title, 6)
  File "/usr/local/lib/python3.8/site-packages/pywikibot/tools/__init__.py", line 1744, in wrapper
    return obj(*__args, **__kw)
  File "/usr/local/lib/python3.8/site-packages/pywikibot/page.py", line 2327, in __init__
    super(Page, self).__init__(source, title, ns)
  File "/usr/local/lib/python3.8/site-packages/pywikibot/page.py", line 200, in __init__
    self._link = Link(title, source=source, default_namespace=ns)
  File "/usr/local/lib/python3.8/site-packages/pywikibot/tools/__init__.py", line 1744, in wrapper
    return obj(*__args, **__kw)
  File "/usr/local/lib/python3.8/site-packages/pywikibot/page.py", line 6029, in __init__
    raise pywikibot.Error(
pywikibot.exceptions.Error: Title contains illegal char (\uFFFD 'REPLACEMENT CHARACTER')
2020-02-28 19:00:41,492 __main__    : INFO File:�იქაელი.jpg
Traceback (most recent call last):
  File "rcworker.py", line 221, in <module>
    main()
  File "rcworker.py", line 214, in main
    run_worker()
  File "rcworker.py", line 63, in run_worker
    file_page = pywikibot.FilePage(site, change.title)
  File "/usr/local/lib/python3.8/site-packages/pywikibot/tools/__init__.py", line 1744, in wrapper
    return obj(*__args, **__kw)
  File "/usr/local/lib/python3.8/site-packages/pywikibot/page.py", line 2478, in __init__
  super(FilePage, self).__init__(source, title, 6)
  File "/usr/local/lib/python3.8/site-packages/pywikibot/tools/__init__.py", line 1744, in wrapper
    return obj(*__args, **__kw)
  File "/usr/local/lib/python3.8/site-packages/pywikibot/page.py", line 2327, in __init__
    super(Page, self).__init__(source, title, ns)
  File "/usr/local/lib/python3.8/site-packages/pywikibot/page.py", line 200, in __init__
    self._link = Link(title, source=source, default_namespace=ns)
  File "/usr/local/lib/python3.8/site-packages/pywikibot/tools/__init__.py", line 1744, in wrapper
    return obj(*__args, **__kw)
  File "/usr/local/lib/python3.8/site-packages/pywikibot/page.py", line 6029, in __init__
    raise pywikibot.Error(
pywikibot.exceptions.Error: Title contains illegal char (\uFFFD 'REPLACEMENT CHARACTER')

I tried this file and a random file to compare and Pywikibot is correct. The file has no imageinfo as it is a redirect. See the file page: https://commons.wikimedia.org/wiki/File:Vanessa_Mai_at_Gruenspan_2019_(3).png

pywikibot.exceptions.Error: Title contains illegal char (\uFFFD 'REPLACEMENT CHARACTER')

This should not happen after the patch to sseclient anymore? Is the patch working correctly?

@Dvorapa I saw that myself & just updated my above comment prior to seeing your response. It appears to have slipped through; I will have to add a catch for that. Others still relevant though.

pywikibot.exceptions.Error: Title contains illegal char (\uFFFD 'REPLACEMENT CHARACTER')

This should not happen after the patch anymore?

No, it shouldn't. sseclient's master just had a patch merged prior to my updating that should've fixed it (it was the one that @zhuyifei1999 linked).

I will be running the script with pdb + save all sseclient trace over the weekend.

I will be running the script with pdb + save all sseclient trace over the weekend.

Running test_rc.py, right? And thanks. That's a good length -- as I've said before, can take just a few hours to a couple days to run into the issue.

I just discovered that rcwatcher.py crashed at some point within the past couple of days. Interesting.

Traceback (most recent call last):
  File "rcwatcher.py", line 65, in <module>
    main()
  File "rcwatcher.py", line 57, in main
    run_watcher()
  File "rcwatcher.py", line 41, in run_watcher
    for change in rc:
  File "/usr/local/lib/python3.8/site-packages/pywikibot/comms/eventstreams.py", line 291, in __iter__
    self.source = EventSource(**self.sse_kwargs)
  File "/home/thesanddoctor/sseclient/sseclient.py", line 48, in __init__
    self._connect()
  File "/home/thesanddoctor/sseclient/sseclient.py", line 63, in _connect
    self.resp.raise_for_status()
  File "/home/ccc/.local/lib/python3.8/site-packages/requests/models.py", line 941, in raise_for_status
    raise HTTPError(http_error_msg, response=self)
requests.exceptions.HTTPError: 429 Client Error: Too Many Requests for url: https://stream.wikimedia.org/v2/stream/recentchange

test_rc.py is still going strong.

I just discovered that rcwatcher.py crashed at some point within the past couple of days. Interesting.

[...]

requests.exceptions.HTTPError: 429 Client Error: Too Many Requests for url: https://stream.wikimedia.org/v2/stream/recentchange

This is not the fault of the sseclient library. It happens if more than a few (don't know the exact limit) connections are made to wikimedia event streaming servers.

TheSandDoctor closed this task as Resolved.EditedMar 11 2020, 7:45 PM

Going to resolve this then, given that it appears unavoidable. At least now that it throws an exception, it can be handled (ie skip to the next item in the queue).

zhuyifei1999 claimed this task.

Sorry, was extremely busy last two weeks. I think if it's a bug it should stay open. I'll work on it next week.

Thanks @zhuyifei1999 ! Have you been able to work on this any?

Sorry, I think I was working on it last year and then forgot about this
ticket. I'll check what I was doing back then.

Is this bug still reproducible?

Removing task assignee due to inactivity as this open task has been assigned for more than two years. See the email sent to the task assignee on August 22nd, 2022.
Please assign this task to yourself again if you still realistically [plan to] work on this task - it would be welcome!
If this task has been resolved in the meantime, or should not be worked on ("declined"), please update its task status via "Add Action… 🡒 Change Status".
Also see https://www.mediawiki.org/wiki/Bug_management/Assignee_cleanup for tips how to best manage your individual work in Phabricator. Thanks!