Page MenuHomePhabricator

pywikibot replace.py hangs on certain conditions
Closed, ResolvedPublic

Description

Log from a debug run on Ubuntu 12.04
The version worked for some files and crash with a similar error before I decided to update to the latest core version. Now it does not work at all.

$python  pwb.py replace.py -regex '(?s)^(.*)$' "{{యాంత్రిక అనువాదం}}\1" -file:"/home/arjun/RCourse/tewsn/gtp2mk2.txt" -simulate -debug -v -log
The base directory is /home/arjun/corenew/core
=== Pywikibot framework v2.0 -- Logging header ===
COMMAND: ['replace.py', '-regex', '(?s)^(.*)$', '{{\xe0\xb0\xaf\xe0\xb0\xbe\xe0\xb0\x82\xe0\xb0\xa4\xe0\xb1\x8d\xe0\xb0\xb0\xe0\xb0\xbf\xe0\xb0\x95 \xe0\xb0\x85\xe0\xb0\xa8\xe0\xb1\x81\xe0\xb0\xb5\xe0\xb0\xbe\xe0\xb0\xa6\xe0\xb0\x82}}\\1', '-file:/home/arjun/RCourse/tewsn/gtp2mk2.txt', '-simulate', '-debug', '-v', '-log']
DATE: 2015-05-14 03:56:51.045862 UTC
VERSION: [https] r-pywikibot-core.git (41901c7, g5504, 2015/05/11, 20:05:08, n/a)
SYSTEM: ('Linux', 'arjun-945GCM-S2L', '3.2.48-ctl471', '#1 SMP Thu Aug 15 09:42:30 IST 2013', 'i686')
CONFIG FILE DIR: /home/arjun/corenew/core
PACKAGES:
  Tkinter (/usr/lib/python2.7/lib-tk/Tkinter.pyc) = $Revision: 81008 $
  distutils (/usr/lib/python2.7/distutils/) = 2.7.3
  email (/usr/lib/python2.7/email/) = 4.0.3
  json (/usr/lib/python2.7/json/) = 2.0.9
  logging (/usr/lib/python2.7/logging/) = 0.5.1.2
  mpl_toolkits (/usr/lib/pymodules/python2.7/mpl_toolkits/) = ??
  mwparserfromhell: No module named mwparserfromhell
  pickle (/usr/lib/python2.7/pickle.pyc) = $Revision: 72223 $
  pywikibot ([path unknown]) = ??
  re (/usr/lib/python2.7/re.pyc) = 2.2.1
  setuptools (/usr/lib/python2.7/dist-packages/setuptools/) = 0.6
  urllib (/usr/lib/python2.7/urllib.pyc) = 1.17
  urllib2 (/usr/lib/python2.7/urllib2.pyc) = 2.7
MODULES:
  /home/arjun/corenew/core/pywikibot/textlib.py 462afa2 2015-05-10 17:15:28.994976
  /home/arjun/corenew/core/pywikibot/data/api.py c6cbf67 2015-05-10 17:15:28.922976
  /home/arjun/corenew/core/pywikibot/userinterfaces/__init__.py 43eceeb 2015-05-10 17:15:28.998976
  /home/arjun/corenew/core/pywikibot/i18n.py 77e57c6 2015-05-10 17:15:28.946976
  /home/arjun/corenew/core/pywikibot/comms/threadedhttp.py 69cf1f8 2015-05-13 15:21:59.643962
  /home/arjun/corenew/core/pywikibot/date.py 262e786 2015-05-10 17:15:28.930976
  /home/arjun/corenew/core/pywikibot/data/__init__.py 44183c7 2015-05-10 17:15:28.914976
  /home/arjun/corenew/core/pywikibot/fixes.py d84788c 2015-05-10 17:15:28.946976
  /home/arjun/corenew/core/pywikibot/exceptions.py 4a30d02 2015-05-13 15:21:59.655962
  /home/arjun/corenew/core/pywikibot/site.py eae3500 2015-05-13 15:21:59.699962
  /home/arjun/corenew/core/pywikibot/bot.py de68ebd 2015-05-10 17:15:28.902976
  /home/arjun/corenew/core/pywikibot/__init__.py d6ea7ec 2015-05-13 15:21:59.639962
  /home/arjun/corenew/core/pywikibot/throttle.py 4157254 2015-05-10 17:15:28.994976
  /home/arjun/corenew/core/pywikibot/page.py db5a1ef 2015-05-13 15:21:59.675962
  /home/arjun/corenew/core/pywikibot/editor.py 7d0aa1b 2015-05-10 17:15:28.934976
  /home/arjun/corenew/core/pywikibot/family.py b978cc6 2015-05-13 15:21:59.663963
  /home/arjun/corenew/core/pywikibot/plural.py c9edb6b 2015-05-10 17:15:28.970976
  /home/arjun/corenew/core/pywikibot/version.py 8de383e 2015-05-10 17:15:29.010976
  /home/arjun/corenew/core/pywikibot/userinterfaces/terminal_interface.py 9a5fbf1 2015-05-10 17:15:28.998976
  /home/arjun/corenew/core/pywikibot/config2.py 971b19d 2015-05-10 17:15:28.910976
  /home/arjun/corenew/core/pywikibot/tools/ip.py 808c0cc 2015-05-10 17:15:28.998976
  /home/arjun/corenew/core/pywikibot/comms/http.py b336a0a 2015-05-13 15:21:59.643962
  /home/arjun/corenew/core/pywikibot/userinterfaces/terminal_interface_base.py 968a14b 2015-05-10 17:15:29.002976
  /home/arjun/corenew/core/pywikibot/pagegenerators.py 12e7523 2015-05-13 15:21:59.679962
  /home/arjun/corenew/core/pywikibot/userinterfaces/terminal_interface_unix.py 60d8cb2 2015-05-10 17:15:29.002976
  /home/arjun/corenew/core/pywikibot/tools/__init__.py 692fc89 2015-05-13 15:21:59.699962
  /home/arjun/corenew/core/pywikibot/diff.py 015dcbd 2015-05-10 17:15:28.934976
  /home/arjun/corenew/core/pywikibot/login.py 70f3f31 2015-05-13 15:21:59.663963
  /home/arjun/corenew/core/pywikibot/comms/__init__.py 747d0a7 2015-05-10 17:15:28.902976
  /home/arjun/corenew/core/pywikibot/userinterfaces/transliteration.py efd4103 2015-05-10 17:15:29.010976
=== === === === === === === === === === === === === === 
Pywikibot rd6ea7ece4f4d7867f211e16c13e62c9366627207
Python 2.7.3 (default, Dec 18 2014, 19:03:52) 
[GCC 4.6.3]
The summary message for the command line replacements will be something like: Bot: Automated text replacement  (-(?s)^(.*)$ +{{యాంత్రిక అనువాదం}}\1)
Press Enter to use this automatic message, or enter a description of the
changes your bot will make: +{{యాంత్రిక అనువాదం}}
LOADING SITE wikipedia:te VERSION: 1.26wmf4
Found 1 wikipedia:te processes running, including this one.
Retrieving 50 pages from wikipedia:te.
^CTraceback (most recent call last):
  File "pwb.py", line 239, in <module>
    if not main():
  File "pwb.py", line 233, in main
    run_python_file(filename, argv, argvu, file_package)
  File "pwb.py", line 88, in run_python_file
    main_mod.__dict__)
  File "./scripts/replace.py", line 947, in <module>
    main()
  File "./scripts/replace.py", line 938, in main
    bot.run()
  File "./scripts/replace.py", line 589, in run
    new_text = self.apply_replacements(last_text, applied)
  File "./scripts/replace.py", line 516, in apply_replacements
    allowoverlap=self.allowoverlap, site=self.site)
  File "/home/arjun/corenew/core/pywikibot/textlib.py", line 308, in replaceExcept
    (match.group(groupID) or '') +
KeyboardInterrupt
Dropped throttle(s).
<type 'exceptions.KeyboardInterrupt'>
CRITICAL: Waiting for 1 network thread(s) to finish. Press ctrl-c to abort
All threads finished.

Event Timeline

Arjunaraoc raised the priority of this task from to Needs Triage.
Arjunaraoc updated the task description. (Show Details)
Arjunaraoc added a project: Pywikibot.
Arjunaraoc subscribed.
Restricted Application added subscribers: Aklapper, Unknown Object (MLST). · View Herald TranscriptMay 14 2015, 4:02 AM

This problem was investigated to be happening only when using -file:"filename" parameter for the command replace.py

The problem was traced to some specific pages, which are translated using google translate.

A test page is created here
https://te.wikipedia.org/wiki/user:Arjunaraoc/sandbox_https://phabricator.wikimedia.org/T99032
Another page
https://te.wikipedia.org/wiki/user:Arjunaraoc/sandbox2_https://phabricator.wikimedia.org/T99032
One more page
https://te.wikipedia.org/wiki/user:Arjunaraoc/sandbox3_https://phabricator.wikimedia.org/T99032

From my experience, if such page is the first one to be processed , script hangs. If it is a later one, it crashes. (Note: The order of processing of items is as per the order in the file for compat and some unknown order for core)

Regarding the order in core, https://gerrit.wikimedia.org/r/#/c/199631/ might fix this. Replace does preload the pages and core does return them in the order they were returned by the API.

Yes, the order in core is working now.

Looks like atleast in one instance, if the text contains malformed template code like missing a brace ( {{...} ), the replace.py crashes

Can you post the page and the command to reproduce it?

@Mpaa: This was mostly given from the opening post and from the comment where they mention the sandbox page but here for everyone to copy:

python pwb.py replace.py -regex '(?s)^(.*)$' "{{యాంత్రిక అనువాదం}}\1" -simulate -lang:te -family:wikipedia -page:'user:Arjunaraoc/sandbox_https://phabricator.wikimedia.org/T99032' -debug -v -log

As far as I can see I also get it to hang using Python 3.4.2.

I also tried to reproduce your other issue that it crashes. I personally don't really understand why it would hang in one situation and crash in another. And with python pwb.py replace.py -regex '(?s)^(.*)$' "{{యాంత్రిక అనువాదం}}\1" -simulate -lang:te -family:wikipedia -page:Main\ Page -page:'user:Arjunaraoc/sandbox_https://phabricator.wikimedia.org/T99032' -debug -v -log I first get asked about the Main Page (and that works) and then it hangs again. So either is the crash related to something else or it's not simply triggered by a replacement before (it should be noted that I didn't saved the changes (apart from the fact that I was using simulation mode)).

It seems to be related to replaceExcept or at least in my tests and in the original post the traceback ends (or starts) in replaceExcept. And it might be related to the fact that you try to match everything and maybe it uses certain exceptions which have a problem with that. Anyway with that command it's reproducible so it's easier to look into.

And regarding my patch, you need to download it in order to have the “features” from it as it hasn't been merged. And it should only fix the order in which the bot works on the pages. If it also fixed your issues in this bug report you should tell us how you downloaded that patch. Because with that patch my computer hangs too.

XZise renamed this task from pywikibot replace.py hangs for any operation from linux/windows. to pywikibot replace.py hangs on certain conditions.May 23 2015, 5:10 PM
XZise updated the task description. (Show Details)
XZise set Security to None.

Ah! In the first sandbox page is \0 and replaceExcept searches for backslash+number to replace that with the group (like you did with {{...}}\1 but this time it's part of the text. As far as I can see the first time it works fine and it generates {{...}}REST OF PAGE and then (for some reason) it searches in the new result for eventual references again and then finds \0. And that code is an infinite loop unless it doesn't find any references again: https://git.wikimedia.org/blob/pywikibot%2Fcore.git/7f50b4ed6bdf27e7e29e6de0c3cf7999d267337e/pywikibot%2Ftextlib.py#L299

I might have a solution so before someone else wastes their time I just claim it for now.

Change 212978 had a related patch set uploaded (by XZise):
[FIX] replaceExcept: Replace references iteratively

https://gerrit.wikimedia.org/r/212978

Change 212978 merged by jenkins-bot:
[FIX] replaceExcept: Replace references iteratively

https://gerrit.wikimedia.org/r/212978

Is this fixed now, or are there other known corner cases?

I've added quite a few corner cases to the tests so I don't think this has not been resolved.