Page MenuHomePhabricator

Why don't timeouts work during long regular expression matching?
Closed, ResolvedPublic

Description

This regular expression exploded when run against this edit in Spanish Wikipedia. It ran for hours at 100% CPU and would not respond to a timeout. Why not? This is why we have timeouts.

This task is done when we can demonstrate a failure to timeout and explore alternative mechanisms for enforcing a timeout.

See also:

Event Timeline

Halfak created this task.Jun 27 2017, 3:31 PM
Restricted Application added a project: artificial-intelligence. · View Herald TranscriptJun 27 2017, 3:31 PM
Restricted Application added a subscriber: Aklapper. · View Herald Transcript

https://gist.github.com/halfak/b31b8ddc38ca701c4c964478a53da75f

I can't get the degenerate behavior in this simple example. The regex matches surprisingly fast.

awight added a subscriber: awight.Jul 6 2017, 4:02 AM

The gist looks perfectly "right", like you did it wrong in the right way. I only have really fringe guesses for now, like checking .so library versions, running this gist on the production box that was stuck, trying to lay hands on the strace output that unknown hero captured during the incident.

awight added a comment.Jul 6 2017, 5:40 AM

s/unknown hero/akosiaris/ -- Sorry I didn't read the incident report first; and thank you!

P5633 was also good reading, unfortunately I didn't find any stones left unturned. I'm happy to believe the regex theory for now, and don't need to ask for strace output.

akosiaris added a subscriber: akosiaris.EditedJul 6 2017, 11:28 AM

https://gist.github.com/halfak/b31b8ddc38ca701c4c964478a53da75f
I can't get the degenerate behavior in this simple example. The regex matches surprisingly fast.

I had the same problem, then I realized it's not just that part of the expression that's required. You need the compound one, that is the result of ORing (using the | operator) the entire set of regexps in that file as well as some swear word following the JAJAJAJA sample text. If you look at the edit it has some swear words embedded in-between. These turn out to be crucial to reproduce it. Some sample code is at https://gist.githubusercontent.com/subbuss/12746448368188ce5c3f72d62de61d24/raw/bc238c7f5e4f59d4aa654811b4aa5dd1f7805811/gistfile1.txt

s/unknown hero/akosiaris/ -- Sorry I didn't read the incident report first; and thank you!

You are most welcome :-).

P5633 was also good reading, unfortunately I didn't find any stones left unturned. I'm happy to believe the regex theory for now, and don't need to ask for strace output.

There is no strace output. The processes (the ones in Rl state) were stuck in an (effectively) busy loop in userspace so no syscalls were made. The other processes (the ones in S state) were stuck waiting to read from a pipe, presumably waiting for the former processes to send something down that pipe. I haven't looked into it more but it looked like it was some celery internal IPC that was being done over that pipe.

Halfak added a parent task: Restricted Task.Jul 6 2017, 4:50 PM
Halfak added a comment.Jul 6 2017, 7:31 PM

Just updated https://gist.github.com/halfak/b31b8ddc38ca701c4c964478a53da75f

And confirmed that ORES timeout does not work (after running for an hour when timeout was set to 15 seconds).

So next I want to try to see if there's a better way to time out.

awight added a comment.Jul 6 2017, 8:14 PM

I found that the regex can be reduced to this and still causes the (near-)infinite loop:

bad_re = re.compile('(j+[aeiou]*)*(\\b)', re.I)

Halfak added a comment.Jul 6 2017, 8:39 PM

OK! Figured it out. So a Threading-based timeout will fail to kill an issue like this because of python's GIL. When I switched to a signal-based timeout, things worked much better.

>>> import re
>>> import time
>>> import mwapi
>>> from ores.util import timeout
>>> 
>>> session = mwapi.Session("https://es.wikipedia.org")
Sending requests with default User-Agent.  Set 'user_agent' on mwapi.Session to quiet this message.
>>> doc = session.get(action='query', prop='revisions', revids=100032572, rvprop='content', formatversion=2)
>>> text = doc['query']['pages'][0]['revisions'][0]['content']
>>> print(text[0:100])
{MARICAAAAA CASAMELAAAA FUMA PORRO
JAJAJJAJAJAJAJAJAJJAJAJJAJAJAJAJAJAJJAJAJJAJAJAJAJAJAJJAJAJJAJAJA
>>> 
>>> bad_re = re.compile('(j+[aeiou]*)*(\\b)', re.I)
>>> 
>>> def apply_bad_re(text):
...     return list(bad_re.finditer(text))
... 
>>> timeout(apply_bad_re, text, seconds=2)
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/home/halfak/projects/ores/ores/util.py", line 35, in timeout
    raise TimeoutError("Timed out after {0} seconds.".format(duration))
ores.errors.TimeoutError: Timed out after 2.0002875328063965 seconds.
>>>

So then I submitted https://github.com/wiki-ai/ores/pull/215 but that causes problems in our test suite. It turns out that, in order to test celery, we use celerytest to set up a celery worker in a parallel thread. Unsurprisingly, we get the following erro:

celery.worker.job: ERROR: Task ores.scoring_systems.celery_queue._process_score_map[9999f061-faea-4269-84e0-a3d2afc52250] raised unexpected: ValueError('signal only works in main thread',)

But it seems that the signal strategy still works when a celery worker is run from the main thread. So! Now I think we'll need to update our testing strategy for ORES -- ala T168007: Add API tests to ORES CI

awight assigned this task to Halfak.Jul 6 2017, 10:46 PM
awight moved this task from Active to Review on the Scoring-platform-team (Current) board.

I found that the regex can be reduced to this and still causes the (near-)infinite loop:

bad_re = re.compile('(j+[aeiou]*)*(\\b)', re.I)

Indeed. My experience as well. https://wikitech.wikimedia.org/wiki/Talk:Incident_documentation/20170623-ORES