Page MenuHomePhabricator

Apparently "text-contains" exceptions in user-fixes.py are ignored
Open, HighPublic

Description

Hi! I noticed that replace.py does not apply the exceptions "text-contains" listed in user-fixes.py.

I'm running Python 3.5.2 and an updated pywikibot-core on Win7.

Pywikibot: [https] r-pywikibot-core.git (6a84859, g7382, 2016/08/06, 00:29:37, n/a)
Release version: 3.0-dev
requests version: 2.9.1
  cacerts: C:\Users\Dave\AppData\Local\Programs\Python\Python35\lib\site-packages\requests\cacert.pem
    certificate test: ok
Python: 3.5.2 (v3.5.2:4def2a2901a5, Jun 25 2016, 22:18:55) [MSC v.1900 64 bit (AMD64)]
PYWIKIBOT2_DIR: Not set
PYWIKIBOT2_DIR_PWB: C:\pywikicore
PYWIKIBOT2_NO_USER_CONFIG: Not set
Config base dir: C:\pywikicore
Usernames for family "wikipedia":
        it: FrescoBot (no sysop configured)
        en: FrescoBot (no sysop configured)
        es: FrescoBot (no sysop configured)
Usernames for family "commons":
        commons: FrescoBot (no sysop configured)

For example this is a sample fix in my user-fixes.py:

fixes['test_exc'] = {
    'regex': True,
    'recursive': False,
    'msg': {
              'en': 'Bot: test',
              'it': 'Bot: test'
        },
    'replacements': [

('great', r'neat'),

],
    'exceptions': {
            'inside-tags': [
                        'hyperlink',
#                       'template',
                        'comment',
                        'timeline',
##                      'gallery',
                        'math',
                        'pre',
##                      'startspace',
                        'source', 
                        'nowiki'
                        ],
            'text-contains': [
                        r'test',
                        ]
        }
}

if I simply run replace.py with this fix on a page that contains only "this is a great test...", the bot tries to replace "great". The problem is that this page should be skipped. Is it my fault?

Event Timeline

Change 304596 had a related patch set uploaded (by Dalba):
replace.py: check 'text-contains' exceptions for each user-defined fix

https://gerrit.wikimedia.org/r/304596

Dalba triaged this task as Medium priority.Aug 14 2016, 8:06 AM

Just noting that I can reproduce the problem.
This feels very much like 20b289d3da, in that the fix exceptions are not being used by isTextExcepted, and isTextExcepted is not being called for each fix .

Dalba removed Dalba as the assignee of this task.Sep 18 2016, 4:14 AM
Dalba added a subscriber: Dalba.
Basilicofresco raised the priority of this task from Medium to High.EditedSep 25 2016, 3:16 PM

Can I triage this as high?
In my opinion it is a sneaky trap for bot operators that can lead to subtle errors around the encyclopedia.

I'm having the same problem; inside-tags and inside seem to be ignored as well.

@Nemo_bis: Could you explain a little bit more. This Tasks is for the 'text-contains' tag but you issue is about 'inside'and 'inside-tags' and the corresponding options -exceptinside and -exceptinsidetag

The problem is that ReplaceRobot.exceptions is empty. It does not contain the exceptions declared in the fixes file. A print statement inside isTextExcepted() shows this:

C:\pwb\GIT\core>py -2 pwb.py replace -fix:test -page:user:xqt/Test -simulate

{u'inside': [],
 u'inside-tags': [],
 u'require-title': [],
 u'text-contains': [],
 u'title': []}
This comment was removed by Xqt.

Change 353714 had a related patch set uploaded (by Xqt; owner: Xqt):
[pywikibot/core@master] [bugfix] Enable exceptions from fixes file

https://gerrit.wikimedia.org/r/353714

I've looked into the code more yesterday, but I don't understand if the "inside" exceptions are ever actually passed to textlib.

Change 353714 merged by jenkins-bot:
[pywikibot/core@master] [bugfix] Enable exceptions from fixes file

https://gerrit.wikimedia.org/r/353714

After the patch, exceptions entered via CLI are erased.
And if fix contains no 'exceptions' key:

fixes['example'] = {
    'regex': True,
    'msg': {
        '_default':u'no summary specified',
     },
    'replacements': [
         (r'\bword\b', u'two words'),
     ]
 }
user@pc:~/python/core {master}$ python scripts/replace.py -fix:example -pt:0 -exceptinsidetag:table -exceptinsidetag:template -exceptinsidetag:comment -multiline -prefixindex:"Page:Cuthbert Bede--Little Mr Bouncer and Tales of College Life.djvu"
Retrieving 50 pages from wikisource:en.
Traceback (most recent call last):
  File "scripts/replace.py", line 1177, in <module>
    main()
  File "scripts/replace.py", line 1168, in main
    bot.run()
  File "scripts/replace.py", line 720, in run
    if self.isTitleExcepted(page.title()):
  File "scripts/replace.py", line 591, in isTitleExcepted
    if 'title' in exceptions:
TypeError: argument of type 'NoneType' is not iterable
<type 'exceptions.TypeError'>
CRITICAL: Closing network session.

Change 361306 had a related patch set uploaded (by Mpaa; owner: Mpaa):
[pywikibot/core@master] replace.py: do not overwrite exceptions given via CLI

https://gerrit.wikimedia.org/r/361306

Change 361306 merged by jenkins-bot:
[pywikibot/core@master] replace.py: do not overwrite exceptions given via CLI

https://gerrit.wikimedia.org/r/361306

Hi! I tested it again with the latest version and it still doesn't work as expected. The problem now is slightly different: the text-contains exceptions within the user-fixes.py are always treated as non-regex. Other exceptions like inside and title work as regex as expected. I would like to stress that oddly the -regex parameter on the command line is able to solve the problem. Apparently these exceptions are precompiled with the call precompile_exceptions(exceptions, regex, flags) in the main and therefore they are using the CLI regex parameter instead of the flag within user-fixes.py.

I will subscribe me because I have not problem.. Per command python pwb.py version I have:

Pywikibot: [ssh] pywikibot-core.git (9060d67, g8565, 2017/09/01, 18:04:14, n/a)
Release version: 3.0-dev
requests version: 2.13.0
  cacerts: C:\Program Files\Python36-32\lib\site-packages\requests\cacert.pem
    certificate test: ok
Python: 3.6.1 (v3.6.1:69c0db5, Mar 21 2017, 17:54:52) [MSC v.1900 32 bit (Intel)
]
PYWIKIBOT2_DIR: Not set
PYWIKIBOT2_DIR_PWB:
PYWIKIBOT2_NO_USER_CONFIG: Not set
Config base dir: C:\Users\Zoran\Documents\GitHub\core
Usernames for family "wikipedia":
        sr: ZoranBot (no sysop configured)

Hi! I tested it again with the latest version and it still doesn't work as expected. The problem now is slightly different: the text-contains exceptions within the user-fixes.py are always treated as non-regex. Other exceptions like inside and title work as regex as expected. I would like to stress that oddly the -regex parameter on the command line is able to solve the problem. Apparently these exceptions are precompiled with the call precompile_exceptions(exceptions, regex, flags) in the main and therefore they are using the CLI regex parameter instead of the flag within user-fixes.py.

You downloaded the latest version from gerrit or?

Removing task assignee due to inactivity, as this open task has been assigned for more than two years (see emails sent to assignee on May26 and Jun17, and T270544). Please assign this task to yourself again if you still realistically [plan to] work on this task - it would be very welcome!

(See https://www.mediawiki.org/wiki/Bug_management/Assignee_cleanup for tips how to best manage your individual work in Phabricator.)