Page MenuHomePhabricator

[Investigate] Non-backtracking regex parsers
Open, LowestPublic

Description

For example, google's pyre2 package.

This task is done when performance of python's built-in regex lib is compared to at least one non-backtracking regex library.

Event Timeline

Halfak created this task.Aug 18 2017, 2:23 PM
Restricted Application added a project: artificial-intelligence. · View Herald TranscriptAug 18 2017, 2:23 PM
Restricted Application added a subscriber: Aklapper. · View Herald Transcript

I think our test should specifically target the badwords/informals processing. We could also test out tokenizing via the deltas library.

awight triaged this task as Low priority.Sep 6 2017, 5:18 PM
awight added a project: good first task.
Restricted Application added a subscriber: TerraCodes. · View Herald TranscriptSep 6 2017, 5:18 PM
awight lowered the priority of this task from Low to Lowest.Sep 6 2017, 5:19 PM
awight moved this task from Untriaged to Maintenance/cleanup on the Machine Learning Platform board.

Hi, I researched a few possible packages for accomplishing this. If anyone has a recommendation for which one to choose, I'd be happy to try the benchmarking.

Facebook’s pyre2, wrapping Google's RE2, is a BSD-licensed drop-in replacement for Python’s re module. The README mentions that getting RE2 to build requires some particular make steps, which could be make installation a bit annoying. In its tox build it’s tested on Python 2.7, 3.6 and 3.7. There is a disclaimer: “pyre2 has only received basic testing, and I am by no means a Python extension expert, so it is quite possible that it contains bugs. I'd guess the most likely are reference leaks in error cases.”

rure-python is a MIT-licensed drop-in replacement for Python’s re module based on Rust’s regex crate, which “guarantees linear time searching.” For building, the regex crate would need version 1.28 or newer of rustc+cargo. They mention: “One important note regarding this shim: the Rust engine operates on byte offsets in the given search text, while Python operates on Unicode code points.”

The hyperscan package is a CPython extension for Intel’s Hyperscan. It “Currently only supports manylinux-compatible Linux distributions,” which would be an issue if developer workstations use other operating systems, so it's probably not the right choice.

The re2 package is a fork of pyre2 that can fall back to the Python re module if RE2 doesn’t support a given feature. This probably isn’t the right choice because it wouldn’t prevent blow-ups.

The regex package is an alternate regex implementation in Python, but it probably isn’t the right choice because it supports backtracking.