Page MenuHomePhabricator

template.py fails removing a template
Closed, DuplicatePublic

Description

template.py fails removing a template; the bot removes a following table:

https://de.wikipedia.org/w/index.php?title=Flughafen_Tivat&diff=190185925&oldid=189561758&diffmode=source

Probably this is caused by cosmetic_changes.py

Event Timeline

Xqt triaged this task as High priority.Jul 6 2019, 7:19 PM

Something similar happened to me lately, I think there could be some issue with TemplateMatchBuilder in textlib.py?

Something similar happened to me lately, I think there could be some issue with TemplateMatchBuilder in textlib.py?

Probably yes. I never trust them because there is a restriction on nested templates. Maybe we should use mwparserfromhell or @Dalba's wikitextparser and make it mandatory.

Okay, just tested, this is the issue:

template.py
builder = textlib._MultiTemplateMatchBuilder(self.site)
template_regex = builder.pattern(old)
elif self.getOption('remove'):
    separate_line_regex = re.compile(
        r'^[*#:]* *{0} *\n'.format(template_regex.pattern),
        re.DOTALL | re.MULTILINE)
    replacements.append((separate_line_regex, ''))

    spaced_regex = re.compile(
        r' +{0} +'.format(template_regex.pattern),
        re.DOTALL)
    replacements.append((spaced_regex, ' '))

    replacements.append((template_regex, ''))

template.py compiles new regexes from builder: first removes template + newline, then template + space and finally template itself. This is not really good approach as the regex from textlib is not prepared to be extended like this. It then tries to fullfill the mandatory newline/space at the end and matches way more than it should:

$ python pwb.py shell
Welcome to the Pywikibot interactive shell!
>>> s=pywikibot.Site('de')
>>> p=pywikibot.Page(s, 'Wikipedia:Spielwiese')
>>> from pywikibot import textlib
>>> builder = textlib._MultiTemplateMatchBuilder(s)
>>> t='Flughafen-Verkehrsaufkommen'
>>> template_regex = builder.pattern(t)
>>> import re
>>> separate_line_regex = re.compile(
...                     r'^[*#:]* *{0} *\n'.format(template_regex.pattern),
...                     re.DOTALL | re.MULTILINE)
>>> spaced_regex = re.compile(
...                     r' +{0} +'.format(template_regex.pattern),
...                     re.DOTALL)
>>> l=str(p.text)
>>> re.search(spaced_regex, l)
>>> re.search(separate_line_regex, l)
<re.Match object; span=(4903, 5663), match='{{Flughafen-Verkehrsaufkommen|iata="TIV"|Legende=>
>>> re.search(template_regex, l)
<re.Match object; span=(4903, 4964), match='{{Flughafen-Verkehrsaufkommen|iata="TIV"|Legende=>
>>> re.search(template_regex, l).group(0)
'{{Flughafen-Verkehrsaufkommen|iata="TIV"|Legende=|width=800}}'
>>> re.search(separate_line_regex, l).group(0)
'{{Flughafen-Verkehrsaufkommen|iata="TIV"|Legende=|width=800}}<!--  ENDE der Grafikdefinition   -->\n<!--                              -->\n\n{| class="wikitable sortable zebra" style="text-align:right;"\n|+ Flughafen Tivat – Verkehrszahlen 2005–2017<ref name="statistics" />\n|-\n! Jahr !! Fluggastaufkommen !! Flugbewegungen\n|-\n| 2017 || 1.129.716 || 6.323\n|-\n| 2016 || 979.432 || 5.985\n|-\n| 2015 || 895.050 || 5.422\n|-\n| 2014 || 910.264 || 5.281\n|-\n| 2013 || 868.343 || 5.198\n|-\n| 2012 || 725.412 || 4.605\n|-\n| 2011 || 647.184 || 4.531\n|-\n| 2010 || 541.870 || 4.017\n|-\n| 2009 || 532.080 || 4.226\n|-\n| 2008 || 570.636 || 4.630\n|-\n| 2007 || 574.011 || 4.079\n|-\n| 2006 || 451.289 || 3.261\n|-\n| 2005 || 377.013 || 2.522\n|}\n\n== Weblinks ==\n{{commonscat|Tivat Airport}}\n'
>>> separate_line_regex
re.compile('^[*#:]* *\\{\\{ *([Vv][Oo][Rr][Ll][Aa][Gg][Ee]:|[Tt][Ee][Mm][Pp][Ll][Aa][Tt][Ee]:|[mM][sS][gG]:)?[Ff]lughafen\\-Verkehrsaufkommen(?P<parameters>\\s*\\|.+?|) *}} *\\n', re.MULTILINE|re.DOTALL)

I never trust them because there is a restriction on nested templates.

Me too

Maybe we should use mwparserfromhell or @Dalba's wikitextparser and make it mandatory.

Maybe in the future, I like the idea of using and cooperating with other py-wiki projects.

The regex from textlib is not prepared to be extended like this. It then tries to fullfill the mandatory newline/space at the end and matches way more than it should.

Okay, this will need a better approach. On both template.py and textlib.py sides we can not do much. We can a) prepare a better regex in template.py in-place b) use mwparser/wtparser here instead - adds mandatory dependency c) fix the TemplateMatchBuilder regex for these cases d) use recursive patterns from PyPI regex library (?R) in textlib.py - adds mandatory dependency (any other possibilities?)

Probably
\{\{ *(Vorlage:|Template:|[mM][sS][gG]:)?Flughafen-Verkehrsaufkommen(?P<parameters>\s*\|[^}]+?|) *}}
for the pattern where . is replaced with [^}] ?
Nested templates aren't supported there, see teh TODO-comment.

Maybe? Or better to use NESTED_TEMPLATE_REGEX as suggested?

Maybe? Or better to use NESTED_TEMPLATE_REGEX as suggested?

Replacing . with [^}] causes template_bot_tests.py to fail. Seems there is not a very trivial solution.