Page MenuHomePhabricator

replace doesn't support optional groups
Closed, InvalidPublic

Description

Originally from: http://sourceforge.net/p/pywikipediabot/bugs/1484/
Reported by: Anonymous user
Created on: 2012-07-02 13:25:23
Subject: replace doesn't support optional groups
Original description:
textlib.py \(method replaceExcept\) doesn't support optional capturing groups in regex.

I tried to run replace.py with the following regex: "RISHMI\(T |IM\)?" => "RISHMI\1"
when running it on a page containing the following text "SOMETHING RISHMI SOMETHING"
it crashes with the following error:
textlib.py, line 178, in replaceExcept
match.group\(groupID\) + \
TypeError: coercing to Unicode: need string or buffer, NoneType found

line 178 contains the statement:
replacement = replacement\[:groupMatch.start\(\)\] + \
match.group\(groupID\) + \
replacement\[groupMatch.end\(\):\]

textlib.py should check for match.group\(groupID\) ==None and if so, add here empty string instead of match.group\(groupID\)


Version: unspecified
Severity: normal
See Also:
https://sourceforge.net/p/pywikipediabot/bugs/1484

Details

Reference
bz55184

Event Timeline

bzimport raised the priority of this task from to Needs Triage.Nov 22 2014, 2:25 AM
bzimport set Reference to bz55184.
bzimport added a subscriber: Unknown Object (????).

The group must exist to reuse it. What should this regex do in your opinion. What about RISHMI\(T |IM|\)" or RISHM\(\(?:T |IM\)?\)"? Errors should never pass silently unless explicitly silenced \(PEP 20\). Maybe replacing empty strings could lead to unwanted side effects but I have'nt thought about it.

This regex here is just an example, and probably a bad one \(as the regex it does nothing by this replacement\). Your suggestion regarding the specific regex \(to use inner optional group within group\) would probably fix this specific regex, but this is workaround - replace.py should support replacing capturing optional capturing group the same way re.findall behaves.

The behaviour of replacing None to empty string is compatible with the behaviour of re.findall \(re.findall\('a\(b\)?\(c\)','ac'\) => \[\('', 'c'\)\]\) and with regex engines of most languages \(in JS: 'ac'.replace\(/a\(b\)?\(c\)/,'a$1c'\)\), though python re isn't consistent here \(re.sub\('a\(b\)?\(c\)','X\\\1','ac'\) - is error\).

I had my own fights with this problem and my conclusion was that there's nothing to do about it but rewriting your regexes, it's how python works.
Mostly, what's nasty is the idiotic error message.