Page MenuHomePhabricator

Include a RegEx library for Lua
Closed, DeclinedPublic

Description

The reason that no RegEx support is added in Lua by default is to reduce the total size of Lua runtime. However as we're running Lua on servers now, size is not something critical for us.


Version: unspecified
Severity: normal
See Also:
https://bugzilla.wikimedia.org/show_bug.cgi?id=47512

Details

Reference
bz50454

Event Timeline

bzimport raised the priority of this task from to Lowest.Nov 22 2014, 1:53 AM
bzimport set Reference to bz50454.
bzimport added a subscriber: Unknown Object (MLST).

On the other hand, it's very easy to create pathological regular expressions. This is much less likely with Lua patterns, and we were able to easily add checkpoints into Lua's pattern processing to allow Scribunto's CPU limiting to continue to work even if someone does manage this. Doing the same for a regex engine is likely to be more difficult.

I don't think this is going to happen.

I think there should be a way open to web admins to dynamically link lua libraries to both LuaSandbox and Lua standalone engine.

@alex-mashin That's not really relevant to this bug. That's more what T63432 is asking for.

jayvdb added a subscriber: jayvdb.Dec 18 2015, 10:20 PM

As this phabricator doesn't show what happened, https://static-bugzilla.wikimedia.org/show_activity.cgi?id=50454 shows @Anomie closed this as wontfix 2013-12-10.

http://www.inf.puc-rio.br/~roberto/lpeg/lpeg.html looks nice, including lpeg.setmaxstack to cause crazy regex to fail.

alex-mashin added a comment.EditedDec 19 2015, 7:42 AM

http://www.inf.puc-rio.br/~roberto/lpeg/lpeg.html looks nice, including lpeg.setmaxstack to cause crazy regex to fail.

Just like lrexlib it won't work until you hack luasandbox (and so I did).

He7d3r added a subscriber: He7d3r.Dec 19 2015, 7:19 PM
jeblad added a subscriber: jeblad.EditedApr 3 2016, 2:12 AM

As Wikidat use regex it has become important to make the same work in Lua, that is we must be able to reuse format as a regular expression (P1793) which is a PCRE-pattern, and thus we need a Perl-type lib for Lua of some kind.

I think this is a blocker for use of property 1793 in Lua, but we could perhaps hack some kind of PCRE-ish sub-pari lib in pure Lua to still be able to use the claims.

thus we need a Perl-type lib for Lua of some kind.

Consider rrthomas.github.io/lrexlib.

Or better, make it possible to enable any external libraries for Lua standalone and sandbox by MediaWiki settings.

Uanfala added a subscriber: Uanfala.Apr 4 2016, 7:04 PM

This should be fixed. Of course breaking reg-exes exist, but we face the same problem as we did when we tried to deny template developers parser functions on the grounds that they were idiots and would break the whole wiki - namely that someone will write Module:RegEx.

And here is some MIT licensed glue to both POSIX and PCRE libraries. http://rrthomas.github.io/lrexlib/

This needs a solution!

LPeg would be great because it could make many string-related tasks easier, though it has a steep learning curve and I am not sure if it would use more or less memory and processing time than the less sophisticated methods that we use now. It might not be possible to cache LPeg patterns, because they can contain references to arbitrary Lua values, including functions (which are prohibited in modules loaded with mw.loadData because they can cause T67258: Information can be passed between #invoke's (tracking)), in which case patterns would have to be newly generated for each module invocation.

LPeg could be useful, for instance, in the more complex transliteration modules on the English Wiktionary.

LPeg doesn't have built-in functions to generate efficient patterns (see this discussion) to match the UTF-8 encodings of sets or ranges of code points. On the other hand, regex with Unicode support has such things built in.

There are two regex libraries that would be better options than PCRE2, because they are designed to avoid pathological behavior: RE2 and Rust regex. In order to do that, they omit some common features like lookahead and lookbehind and backreferences. (The RE2 syntax reference lists many of the unsupported features.) Rust regex has the interesting feature of intersection, negation, and symmetric difference of character classes, which RE2 does not, but the C interface to Rust regex is incomplete, so adding a Scribunto extension would require finishing the Rust-to-C bindings and writing C-to-PHP and PHP-to-Lua bindings (or maybe bindings straight from C to Lua). RE2 on the other hand already has a PHP extension.

ahmad added a subscriber: ahmad.Nov 16 2019, 9:55 PM