Page MenuHomePhabricator

TextExtracts exception on very long repetitive content
Closed, ResolvedPublic1 Story Points

Description

Here's an example query that failed: https://en.wikipedia.org/w/api.php?action=query&prop=extracts&exsentences=5&explaintext=&titles=User:Pchelolo/Long_Test

Output:

{
    "servedby": "mw1282",
    "error": {
        "code": "internal_api_error_Exception",
        "info": "[V9MPcApAAE0AAc-HGa4AAAAN] Exception Caught: TextExtracts\\ExtractFormatter::getFirstSentences() error compiling regular expression /^(.+?(?:[^\\p{Lu}]\\.(?:[ \\n]|$)|[\\!\\?](?:[ \\n]|$)|\u3002|\uff0e|\uff01|\uff1f|\uff61)+){1,5}/u"
    }
}

Details

Related Gerrit Patches:
mediawiki/extensions/TextExtracts : mastergetFirstSentences(): don't use crazy regexes

Event Timeline

Pchelolo created this task.Sep 9 2016, 7:39 PM
Restricted Application added a subscriber: Aklapper. ยท View Herald TranscriptSep 9 2016, 7:39 PM

@MaxSem I believe you're most familiar with this code; any insights?

phuedx added a subscriber: phuedx.Sep 14 2016, 5:15 PM
Jhernandez triaged this task as High priority.Sep 14 2016, 5:19 PM
Jhernandez moved this task from Incoming to Triaged but Future on the Readers-Web-Backlog board.

Try preg_split()?

Change 331742 had a related patch set uploaded (by MaxSem):
getFirstSentences(): don't use crazy regexes

https://gerrit.wikimedia.org/r/331742

Change 331742 merged by jenkins-bot:
getFirstSentences(): don't use crazy regexes

https://gerrit.wikimedia.org/r/331742

phuedx closed this task as Resolved.Jan 27 2017, 10:17 AM

I copied @Pchelolo's test page to the Beta Cluster under User:Phuedx-test-2/T145231 and requested an extract with the following URL: https://en.wikipedia.beta.wmflabs.org/w/api.php?action=query&prop=extracts&exsentences=5&explaintext=&titles=User:Phuedx-test-2/T145231.

That the extract is a monstrosity is reflective of @Pchelolo's monstrous test case ๐Ÿ˜„ ๐Ÿ‘

phuedx set the point value for this task to 1.Jan 27 2017, 10:17 AM

^ 1 point for the review/testing on the Beta Cluster.