Page MenuHomePhabricator

Consider creating a Wikibase datatype for patterns (regex)
Open, Needs TriagePublic

Description

According to the 2020 report on Property constraints (T244043):

It is suggested to create a RegularExpression (or Pattern, or similar) Property type to better edit, monitor, manage, check and process patterns represented by regular expressions. Some technical issues not addressed by the current study that are related to the lack of ability to manage regular expressions are described on phab:T176312, phab:T214378, phab:T236150 and phab:T240884.

Event Timeline

Some technical issues … that are related to the lack of ability to manage regular expressions are described on T176312, T214378, T236150 and T240884.

How would any of these be affected by a separate data type for regular expressions / patterns? The problem is the same: we still need to evaluate user-specified (untrusted) regular expressions safely; I don’t see what a separate data type would change here.

A specific data/property type for regular expressions or patterns would:

  • ensure that the stored regular expressions or patterns are syntactically correct, in a similar way that quantity-type properties ensure that their statements are not paragraphs, something from which all implementations, tools and reusers would benefit;
  • introduce a specific input (interface) control in Wikibase (not WikibaseQualityConstraints) that could make pattern editing more friendly: monospaced text, warnings, colored brackets, and all the amazing features that the developers want to implement. :-)

This does not solve the low-level security issues of each implementation (e.g., avoid running the regular expression "'; DROP TABLE important_things;").

Bugreporter renamed this task from Consider creating a Wikibase Property type for patterns to Consider creating a Wikibase datatype for patterns (regex).Feb 18 2020, 10:14 PM

Would it also allow for reusing a regex?
See https://w.wiki/RK2 for stats regarding multiple properties using the same regex:
[1-9]\d* is used 615 times
\d+ is used 542 times
[1-9][0-9]* is used 89 times

Also, having items for the regex patterns would mean that they would be better monitored by users and optimized ([1-9][0-9]* is the same as [1-9]\d*)

@DannyS712: I think I don't understand your question. What do you mean by reusing a regex? What would you expect could be done on the UI that is not possible now? On https://www.wikidata.org/wiki/Wikidata:2020_report_on_Property_constraints#format there are also some general stats about our regular expressions.