Page MenuHomePhabricator

In a compiled module, a sufficiently complex Regex.Replace can loop if \r characters are still in the article text
Open, Needs TriagePublicBUG REPORT

Description

Steps to replicate the issue

  • As reported on [[:en:w:Wikipedia talk:AutoWikiBrowser]], create a C# module with Regex.Replace, using an insanely complex RE (that happens to contain \r characters, although removing them does not help).
  • Start to load [[:en:w:German Empire]].

What happens?:
AWB hangs indefinitely (the green bar just animates continuously).

What should have happened instead?:
AWB completes the RE evaluation in 40-50 seconds on a typical laptop, as it does when the equivalent RE replacement is specified in the Advanced F/R instead..

Software version (skip for WMF-hosted wikis like Wikipedia):
AWB 6.2.1.1

The problem is circumvented by removing \r characters from ArticleText before calling the custom module. However, I haven't analyzed why that makes a huge difference, and whether that would introduce any regressions. Sample fix: in Article.cs:SendPageToCustomModule, replace

string strTemp = module.ProcessArticle(processArticleEventArgs.ArticleText,

with

string localText = Tools.ConvertFromLocalLineEndings(processArticleEventArgs.ArticleText);
string strTemp = module.ProcessArticle(localText,

and

_customModuleMadeChanges = !strTemp.Equals(processArticleEventArgs.ArticleText);

with

_customModuleMadeChanges = !strTemp.Equals(localText);

Event Timeline

DavidBrooks renamed this task from A sufficiently complex Regex.Replace can loop if \r characters are still in the article text to In a compiled module, a sufficiently complex Regex.Replace can loop if \r characters are still in the article text.Nov 6 2023, 10:54 PM

Thinking about it overnight, I realize the suggested fix could be a regression, for those module authors who assume \r\n sequences in the text. An alternative would be to set a timeout on all user-supplied REs (say 2 minutes) and catch the exception with a "please simplify" dialog. But that would be a more significant amount of coding. Given that it's a rare occurrence (i.e. reported once after all this time), Won't Fix would be fine with me.