At 1100 UTC on June 23rd, ORES started to struggle. Within a half hour, it had fully choked and could no longer respond to any requests. It took us 10 hours to diagnose the problem, solve it, and consider it solved. We learned some valuable lessons when studying and addressing this issue.
You can't prevent bad things from happening. Something will always go wrong. So you do the best that you can to handle bad things gracefully. In a distributed processing environment like ORES' cluster, the worst thing that could happen is to have a process block for forever. So, preparing for bad things means you use timeouts for just about everything. So far, this has been a great strategy and it makes it so that, at worst, only a few requests out of many will fail when something goes wrong. Regretfully, for this downtime event we had one of the worst bad things happen, and at the same time we discovered that our timeouts were not capable of stopping deep processes that go rogue in a specific way. In this blog post, I'll explain what happened.
Recursive backtracking in a regular expression
Many of the models deployed in ORES use regular expressions to extract signal about the quality of an edit. For example, we use them to match "badwords" (curse words, racial slurs, and other words that are commonly used to cause offense) and "informals" (linguistic colloquialisms like "haha" or "lol" or "wtf"). One such regular expression that we used to match informal laughing in Spanish language looked like this: /j+[eaiou]+(j+[aeiou]*)*/ It is intended to match strings like "jajajaja" or "jijiji".
In this edit of Spanish Wikipedia, an IP editor added a very long string of repeated "JAJAJJAJAJAJAJAJAJ" to the article for "Terrain". This is exactly what the regular expression was designed to match. But there was a problem. This regular expression was poorly designed in that it caused a catastrophic backtracking pattern. Every time it would match the entire sequence of "JAJAJJAJAJAJAJAJAJ" and then fail when encountered "...JAJAJlentos...", it would re-attempt the entire match dropping just one "JA" from the middle. This problem doesn't really matter for any short sequences. But for one very long one (and this one was 4155 chars long == 230 repetitions of "JAJAJJAJAJAJAJAJAJ"), it would have taken days to finish. The plot below demonstrates how badly things break down at only 14 repetitions.
Where were the timeouts?
Things like this happen. When operating in a distributed processing environment, you should always have timeouts on everything so that if something goes haywire, it doesn't take everything down. Regretfully, matching a regular expression is not just a special opportunity for pathological backtracking, but also an opportunity to learn hard lessons about safe timeouts.
We have timeouts in ORES in a few strategic places. E.g. if a single scoring job takes longer than 15 seconds (extracting informal "JAJAJA" is part of a scoring job), then it is supposed to time out. But for some reason, we weren't timing out during regular expression matching. I started digging into the library we use to implement execution timeouts, and what I learned was horrifying.
Most timeouts in python are implemented with "threads". I put "threads" in quotes because threads in python are a convenient abstraction and not true concurrency. Python's Global Interprer Lock(GIL) is an internal mutex that prevents truly concurrent threading. In order to get around this, python uses separate processes to implement concurrency. I'm not going to get into the details of the GIL or process based concurrency, but suffice it to say, if you use an external C library to execute a regular expression match on a string, any thread that is trying to implement a timeout is going to get locked up and totally fail to do what it is supposed to do!
Because our threading-based timeouts were completely disabled by this long regular expression match, our "precaching" system (makes sure we score every edit and put the score in the cache ASAP) was slowly taking us down. Every time the problematic diff was requested, it would render yet another worker unreachable. Because ORES would just fail to respond, our precaching system registered a Connection Timeout and would simply retry the request. Eventually capacity would decay as our ~200 workers were locking at 100% CPU one by one.
Luckily, there's an easy solution to this problem in unix signals. By having the operating system help us manage our timeouts, we could stop relying on python threads to behave sanely in order for us to recover from future rogue processes.
So, you fixed it right?
First, I should thank @ssastry for his quick work identifying the pathological backtracking problem and submitting a fix. We also completed an emergency deployment of ORES that implemented the use of Unix signals and we've been humming along, scoring all of the things, ever since.
I think the real problem was the timeout. The regular expression issue was just one of many potential problems that can happen in a distributed processing environment.
If we can get similar performance out of re2, then it might be worthwhile.
Sorry, but yes…? ;)
This is how the system is supposed to work when you call a C-library. It is also a known problem for regular expressions, you need to make sure they terminate properly.
In this case you set a maximum number of iterations and then the pattern is found anyhow. Note pattern detection, not pattern replacement. It is easy to avoid this situation if you only do pattern detection, as you then don't need the whole string. If you do pattern replacement you easily get into this kind of trouble, as you then need the whole string.
I'm confused. I thought ORES was some kind of AI or machine learning, but you're telling me it's just regular expressions that I can write myself using AbuseFilters? Who came up with the regex /j+[eaiou]+(j+[aeiou]*)*/, a human or a computer?
@Nirmos, @jeblad got it right. We use a lot of different feature extraction strategies. The AI learns the meaning in the signal for those features. The AIs aren't simply rule based. They learn rules. But it's helpful to provide them with a bit of useful human insights. See feature engineering for a discussion.
We had an interesting problem and we fixed it. This post summarizes what happened. It's good practice report publicly what went wrong. Maybe someone will read this post and learn something about regular expressions or effective timeouts in python.