File regex gets stuck on a link in the caption
Closed, ResolvedPublic
Actions

Assigned To

Authored By

	matej_suchanek
	Oct 24 2016, 1:13 PM

Description

Run this code:

import pywikibot
import re

from pywikibot import pagegenerators
from pywikibot import textlib

site = pywikibot.Site()

pattern = textlib.FILE_LINK_REGEX % '|'.join(site.namespaces[6])

regex = re.compile(pattern, re.VERBOSE)

pywikibot.output(regex.pattern)

def my_replace(match):
    pywikibot.output(match)

for page in pagegenerators.RandomPageGenerator(total=100, site=site, namespaces=[0]):
    page.get()
    pywikibot.output(page.title())
    regex.sub(my_replace, page.text)

When you bump into an article with a file having a wikilink inside its caption (like [[File:ABC.jpg|123px|text [[Lorem ipsum]] text]]), the bot stops printing and just gets stuck.

Details

	Subject	Repo	Branch	Lines +/-
	textlib.py: Limit catastrophic backtracking in FILE_LINK_REGEX	pywikibot/core	master	+1 -9
	textlib.py: Use atomic grouping in FILE_LINK_REGEX	pywikibot/core	master	+2 -2

Customize query in gerrit

Related Objects
Search...

Status	Assigned	Task
Resolved	None	T141024 Add Phabricator tasks to comments for buggy cosmetic changes
Resolved	matej_suchanek	T151107 Enable translateMagicWords
Invalid	None	T63996 textlib.replaceExcept() may hang or cause an infinite loop
Resolved	Dalba	T148959 File regex gets stuck on a link in the caption

Event Timeline

matej_suchanek created this task.Oct 24 2016, 1:13 PM

Restricted Application added subscribers: pywikibot-bugs-list, Aklapper. · View Herald TranscriptOct 24 2016, 1:13 PM

matej_suchanek added a project: Pywikibot.Oct 24 2016, 1:13 PM

Xqt triaged this task as High priority.Oct 29 2016, 2:24 PM

Xqt added a subscriber: jayvdb.Oct 29 2016, 3:16 PM

Xqt lowered the priority of this task from High to Low.Oct 29 2016, 3:29 PM

Xqt subscribed.

This comment was removed by Xqt.

Works for me:

>>> import pwb, pywikibot as py
>>> from pywikibot import textlib
>>> import re
>>> site = py.Site()
>>> pattern = textlib.FILE_LINK_REGEX % '|'.join(site.namespaces[6])
>>> regex = re.compile(pattern, re.VERBOSE)
>>> text = 'This [[File:ABC.jpg|123px|text [[Lorem ipsum]] text]] bar'
>>> regex.sub('foo', text)
'This foo bar'
>>>

Could you please add the result of version.py here and point the page you where processing at last.

version.py:

Pywikibot: pywikibot/__init__.py (70771dd, -1 (unknown), 2016/09/03, 14:34:12, n/a)
Release version: 3.0-dev
requests version: 2.11.1
  cacerts: C:\Users\Temp\AppData\Local\Programs\Python\Python35-32\lib\site-packages\requests\cacert.pem
    certificate test: ok
Python: 3.5.2 (v3.5.2:4def2a2901a5, Jun 25 2016, 22:01:18) [MSC v.1900 32 bit (Intel)]
PYWIKIBOT2_DIR: Not set
PYWIKIBOT2_DIR_PWB:
PYWIKIBOT2_NO_USER_CONFIG: Not set

I ran my example again and the first page to stop at was Willibald Gebhardt:

import pywikibot
import re

from pywikibot import pagegenerators
from pywikibot import textlib

site = pywikibot.Site('cs', 'wikipedia')

pattern = textlib.FILE_LINK_REGEX % '|'.join(site.namespaces[6])

regex = re.compile(pattern, re.VERBOSE)

def my_replace(match):
    pywikibot.output(match)
    return 'foo'

for page in [pywikibot.Page(site, 'Willibald Gebhardt')]:
    page.get()
    pywikibot.output(page.title())
    regex.sub(my_replace, page.text)

Xqt changed the task status from Stalled to Open.Oct 29 2016, 3:55 PM

I found this small piece of text where the regex stucks:

[[Soubor:Gedenktafel_Olympischer_Platz_4_(West)_Willibald_Gebhardt.jpg|thumb|Pamětní deska na [[Olympiastadion Berlín|Olympijském staidonu]] v Berlíně]]

[[Demetrius Vikelas|Demetria Vikelase]]
[[Německo na Letních olympijských hrách 1896|Německý tým]] se skládal z 21 sportovců a 8 členů doprovodu.

Change 318874 had a related patch set uploaded (by Dalba):
textlib.py: Use atomic grouping in FILE_LINK_REGEX

https://gerrit.wikimedia.org/r/318874

gerritbot added a project: Patch-For-Review.Oct 31 2016, 3:03 AM

Dalba claimed this task.Oct 31 2016, 3:04 AM

Change 318874 abandoned by Dalba:
textlib.py: Use atomic grouping in FILE_LINK_REGEX

https://gerrit.wikimedia.org/r/318874

Dalba removed Dalba as the assignee of this task.Oct 31 2016, 3:46 AM

Dalba removed a project: Patch-For-Review.

Dalba subscribed.

Change 318884 had a related patch set uploaded (by Dalba):
textlib.py: Limit catastrophic backtracking in FILE_LINK_REGEX

https://gerrit.wikimedia.org/r/318884

gerritbot added a project: Patch-For-Review.Oct 31 2016, 5:32 AM

Here is a shorter script according to the discussion above to reproduce the problem:

import re

from pywikibot.textlib import FILE_LINK_REGEX


pattern = FILE_LINK_REGEX % 'file|Soubor'
regex = re.compile(pattern, re.VERBOSE)
text = """[[Soubor:Gedenktafel_Olympischer_Platz_4_(West)_Willibald_Gebhardt.jpg|thumb|Pamětní deska na [[Olympiastadion Berlín|Olympijském staidonu]] v Berlíně]]

[[Demetrius Vikelas|Demetria Vikelase]]
[[Německo na Letních olympijských hrách 1896|Německý tým]] se skládal z 21 sportovců a 8 členů doprovodu."""
regex.sub('', text)

Patch looks good and no longer stucks for this text above but it doesn't match the string. I guess there is another problem with some characters inside the string: Removing all occurrences of ('č', 'ě', 'ů') match.

@Xqt: For me, it gives a match:

>>> regex.search(text).group()
'[[Soubor:Gedenktafel_Olympischer_Platz_4_(West)_Willibald_Gebhardt.jpg|thumb|Pamětní deska na [[Olympiastadion Berlín|Olympijském staidonu]] v Berlíně]]'

What is the code you are testing it with?

@Dalba: I did some tests at http://www.pyregex.com/

It seems that pyregex environment has some encoding issues. Even a single character like پ as a regex won't match itself.

Xqt closed this task as Resolved.Jan 21 2017, 2:08 PM

Xqt claimed this task.

Change 318884 merged by jenkins-bot:
textlib.py: Limit catastrophic backtracking in FILE_LINK_REGEX

https://gerrit.wikimedia.org/r/318884

matej_suchanek mentioned this in T151107: Enable translateMagicWords.Mar 9 2017, 1:02 PM

Xqt reassigned this task from Xqt to Dalba.Mar 9 2017, 1:06 PM

matej_suchanek added a parent task: T151107: Enable translateMagicWords.Mar 9 2017, 5:28 PM

matej_suchanek added a parent task: T63996: textlib.replaceExcept() may hang or cause an infinite loop.Mar 13 2017, 5:23 PM