Page MenuHomePhabricator

textlib.extract_sections hangs up for a text containing too many continous whitespace
Closed, ResolvedPublic

Description

It looks like even

extract_sections('<!--                                         -->', site)

causes an infinite loop, and when I interrupt the program, the error looks like this:

File "/data/project/archiving/pkgsrc/core/scripts/archivebot.py", line 451, in load_page
  header, threads, footer = extract_sections(text, self.site)
File "/mnt/nfs/labstore-secondary-tools-project/archiving/pkgsrc/core/pywikibot/textlib.py", line 917, in extract_sections
  last_section_content).group().lstrip()
File "/mnt/nfs/labstore-secondary-tools-project/archiving/venv/lib/python3.5/re.py", line 173, in search
  return _compile(pattern, flags).search(string)

pointing to this code segment:

footer = re.search(
    r'(%s)*\Z' % r'|'.join((langlink_pattern, cat_regex.pattern, r'\s+')),
    last_section_content).group().lstrip()

The regex has effectively '(\s+)*$' in it, which can be problematic: https://www.regular-expressions.info/catastrophic.html.

Originally found in https://commons.wikimedia.org/w/index.php?title=Commons:Bar&oldid=347447603 .

Details

Related Gerrit Patches:

Event Timeline

whym created this task.May 6 2019, 11:46 PM
Restricted Application added a project: Pywikibot. · View Herald TranscriptMay 6 2019, 11:46 PM
Restricted Application added subscribers: pywikibot-bugs-list, Aklapper. · View Herald Transcript

Change 508473 had a related patch set uploaded (by Whym; owner: Whym):
[pywikibot/core@master] textlib: avoid infinite execution of regex

https://gerrit.wikimedia.org/r/508473

Change 508473 merged by jenkins-bot:
[pywikibot/core@master] textlib: avoid infinite execution of regex

https://gerrit.wikimedia.org/r/508473

Xqt closed this task as Resolved.May 7 2019, 7:28 AM