Page MenuHomePhabricator

File regex gets stuck on a link in the caption
Closed, ResolvedPublic

Description

Run this code:

import pywikibot
import re

from pywikibot import pagegenerators
from pywikibot import textlib

site = pywikibot.Site()

pattern = textlib.FILE_LINK_REGEX % '|'.join(site.namespaces[6])

regex = re.compile(pattern, re.VERBOSE)

pywikibot.output(regex.pattern)

def my_replace(match):
    pywikibot.output(match)

for page in pagegenerators.RandomPageGenerator(total=100, site=site, namespaces=[0]):
    page.get()
    pywikibot.output(page.title())
    regex.sub(my_replace, page.text)

When you bump into an article with a file having a wikilink inside its caption (like [[File:ABC.jpg|123px|text [[Lorem ipsum]] text]]), the bot stops printing and just gets stuck.

Event Timeline

Xqt triaged this task as High priority.Oct 29 2016, 2:24 PM
Xqt lowered the priority of this task from High to Low.Oct 29 2016, 3:29 PM
Xqt subscribed.
This comment was removed by Xqt.
Xqt changed the task status from Open to Stalled.Oct 29 2016, 3:36 PM

Works for me:

>>> import pwb, pywikibot as py
>>> from pywikibot import textlib
>>> import re
>>> site = py.Site()
>>> pattern = textlib.FILE_LINK_REGEX % '|'.join(site.namespaces[6])
>>> regex = re.compile(pattern, re.VERBOSE)
>>> text = 'This [[File:ABC.jpg|123px|text [[Lorem ipsum]] text]] bar'
>>> regex.sub('foo', text)
'This foo bar'
>>>

Could you please add the result of version.py here and point the page you where processing at last.

version.py:

Pywikibot: pywikibot/__init__.py (70771dd, -1 (unknown), 2016/09/03, 14:34:12, n/a)
Release version: 3.0-dev
requests version: 2.11.1
  cacerts: C:\Users\Temp\AppData\Local\Programs\Python\Python35-32\lib\site-packages\requests\cacert.pem
    certificate test: ok
Python: 3.5.2 (v3.5.2:4def2a2901a5, Jun 25 2016, 22:01:18) [MSC v.1900 32 bit (Intel)]
PYWIKIBOT2_DIR: Not set
PYWIKIBOT2_DIR_PWB:
PYWIKIBOT2_NO_USER_CONFIG: Not set

I ran my example again and the first page to stop at was Willibald Gebhardt:

import pywikibot
import re

from pywikibot import pagegenerators
from pywikibot import textlib

site = pywikibot.Site('cs', 'wikipedia')

pattern = textlib.FILE_LINK_REGEX % '|'.join(site.namespaces[6])

regex = re.compile(pattern, re.VERBOSE)

def my_replace(match):
    pywikibot.output(match)
    return 'foo'

for page in [pywikibot.Page(site, 'Willibald Gebhardt')]:
    page.get()
    pywikibot.output(page.title())
    regex.sub(my_replace, page.text)
Xqt changed the task status from Stalled to Open.Oct 29 2016, 3:55 PM
Xqt raised the priority of this task from Low to High.Oct 29 2016, 5:12 PM

I found this small piece of text where the regex stucks:

[[Soubor:Gedenktafel_Olympischer_Platz_4_(West)_Willibald_Gebhardt.jpg|thumb|Pamětní deska na [[Olympiastadion Berlín|Olympijském staidonu]] v Berlíně]]

[[Demetrius Vikelas|Demetria Vikelase]]
[[Německo na Letních olympijských hrách 1896|Německý tým]] se skládal z 21 sportovců a 8 členů doprovodu.

Change 318874 had a related patch set uploaded (by Dalba):
textlib.py: Use atomic grouping in FILE_LINK_REGEX

https://gerrit.wikimedia.org/r/318874

Change 318874 abandoned by Dalba:
textlib.py: Use atomic grouping in FILE_LINK_REGEX

https://gerrit.wikimedia.org/r/318874

Dalba removed Dalba as the assignee of this task.Oct 31 2016, 3:46 AM
Dalba removed a project: Patch-For-Review.
Dalba subscribed.

Change 318884 had a related patch set uploaded (by Dalba):
textlib.py: Limit catastrophic backtracking in FILE_LINK_REGEX

https://gerrit.wikimedia.org/r/318884

Here is a shorter script according to the discussion above to reproduce the problem:

import re

from pywikibot.textlib import FILE_LINK_REGEX


pattern = FILE_LINK_REGEX % 'file|Soubor'
regex = re.compile(pattern, re.VERBOSE)
text = """[[Soubor:Gedenktafel_Olympischer_Platz_4_(West)_Willibald_Gebhardt.jpg|thumb|Pamětní deska na [[Olympiastadion Berlín|Olympijském staidonu]] v Berlíně]]

[[Demetrius Vikelas|Demetria Vikelase]]
[[Německo na Letních olympijských hrách 1896|Německý tým]] se skládal z 21 sportovců a 8 členů doprovodu."""
regex.sub('', text)

Patch looks good and no longer stucks for this text above but it doesn't match the string. I guess there is another problem with some characters inside the string: Removing all occurrences of ('č', 'ě', 'ů') match.

@Xqt: For me, it gives a match:

>>> regex.search(text).group()
'[[Soubor:Gedenktafel_Olympischer_Platz_4_(West)_Willibald_Gebhardt.jpg|thumb|Pamětní deska na [[Olympiastadion Berlín|Olympijském staidonu]] v Berlíně]]'

What is the code you are testing it with?

It seems that pyregex environment has some encoding issues. Even a single character like پ as a regex won't match itself.

Xqt claimed this task.

Change 318884 merged by jenkins-bot:
textlib.py: Limit catastrophic backtracking in FILE_LINK_REGEX

https://gerrit.wikimedia.org/r/318884