Page MenuHomePhabricator

Support regex queries in TextMatchEditCheck
Open, In Progress, MediumPublic3 Estimated Story Points

Description

Another improvement to TextMatchEditCheck: allow individual rules to have regex queries.

Stories

As someone who is authoring a new TextMatchEditCheck/Suggestion, I want to define the matching rule using regular expressions (regex), so that I can detect broader patterns of text (rather than needing to specify exact strings) and as a result, create more robust and flexible Edit Suggestions.

Requirements

First pass needed from Editing Engineering.

Done

  • Code is merged
  • Publish update to Edit_check/TextMatch that includes info. other developers will need to leverage this new capability

Thank you to @Chaotic_Enby for raising this idea

Related Objects

Event Timeline

There are a very large number of changes, so older changes are hidden. Show Older Changes
dchan triaged this task as High priority.Mar 31 2026, 5:18 PM
dchan lowered the priority of this task from High to Medium.

Complex searches may also lead to the need for complex replacements, e.g. (cat)s? -> dog/dogs - or replacing a list of characters as in T412445 T421445

Has anyone already mentioned languages with extensive use of suffixes?

replacing a list of characters as in T412445

You probably mean T421445, the parent task.

VPuffetMichel changed the point value for this task from 3 to 2.Apr 8 2026, 4:59 PM
VPuffetMichel changed the point value for this task from 2 to 3.

Documenting my query about regex flavors: Editors using AWB are used to the .NET flavor and those using Cirrus search are used to a MediaWiki-flavored version. Hopefully we can avoid introducing a third flavor that could cause confusion.

ppelberg raised the priority of this task from Low to Medium.Apr 10 2026, 11:46 PM
medelius changed the task status from Open to In Progress.Apr 15 2026, 2:42 PM

Documenting my query about regex flavors: Editors using AWB are used to the .NET flavor and those using Cirrus search are used to a MediaWiki-flavored version. Hopefully we can avoid introducing a third flavor that could cause confusion.

... while Lua module maintainers use Lua-flavored regexes, Pywikibot coders use Python-flavored regexes, and AbuseFilter maintainers use PCRE.

Documenting my query about regex flavors: Editors using AWB are used to the .NET flavor and those using Cirrus search are used to a MediaWiki-flavored version. Hopefully we can avoid introducing a third flavor that could cause confusion.

This is going to use the existing regex support in VE, so it'll be using javascript regex. (But not introducing anything new, just continuing to use something that has been present for a long time.)

Hi! I thought now then even when we would have regexps, there would be no way to "simplify" replaces like "pattern": "replacement", "Pattern": "Replacement" (we can simplify the pattern using caseSensitive: false or using regexps, but not the replacements). I'm interested, maybe dev team have some ideas about that?

In my opinion the best solution would be some replacement flavor. I.e. like replacements have variables depending on regexps matches (that's standard), maybe something can be done with case.

Here an example of such "complex" replacement. The pattern has two words, both of which have arbitrary case, and the replacement words should have the same case.

"финалист суперкубка": "участник суперкубка",
"финалист Суперкубка": "участник Суперкубка",
"Финалист суперкубка": "Участник суперкубка",
"Финалист Суперкубка": "Участник Суперкубка"

Change #1277715 had a related patch set uploaded (by Medelius; author: Medelius):

[mediawiki/extensions/VisualEditor@master] TextMatch: support regex queries

https://gerrit.wikimedia.org/r/1277715

Change #1277715 merged by jenkins-bot:

[mediawiki/extensions/VisualEditor@master] TextMatch: support regular expresion queries

https://gerrit.wikimedia.org/r/1277715

We've implemented regex support! I've updated the TextMatch documentation, but below is the most relevant information you'd need to start using the regex configs. Any feedback on this is appreciated.

You can include your pattern(s) in the matchRule's query as normal. Then enable isRegExp, which dictates that TextMatch should treat this query/queries as a regex pattern. Same as with non-regex queries, the query property can contain a single pattern, an array of patterns, or a set of patterns and their replacements.

A few other TextMatch updates:

  • For clarity, we've changed the name matchItem to matchRule, though we still support the former inside on-wiki configs.
  • We've added a preserveCase property. If matchRule is a replace type and preserveCase is enabled, the suggested replacement term will be in the same case as the matched term (limited to title case, lowercase, and uppercase).

So @Wellverywell, your "финалист суперкубк" example could be achieved even without regex, just by using preserveCase.

You could also utilize regex + replace to do something like (as you may already be planning!):

"typos": {
	"query": {
		"Тайланд(а|е|ский)": "Таиланд$1"
	},
	"isRegExp": true,
	"preserveCase": true,
	"mode": "replace",
	...
}

Anyway, any feedback on this is appreciated!

Is it intentional that regex-based replacements are always case-insensitive? At least this is my impression from reading the code.

Also, I think it should be documented that all text rules (regex-based and regex-free) use the "whole word" flag, i.e., the match can start and end only at the word boundary (provided I'm reading the code right).

@medelius That's amazing, thank you so much!!

My first feedback is that, I think, preserveCase should be true by default for replace rules.

@medelius I tested RegExps for a bit. They are amazing! But I still found a few bugs, not sure whether in my edit checks or in the code.

  1. \w doesn't capture Cyrillic letters. I think this is a JavaScript limitation, but still -- what should we do for this? Would \p{Script=Cyrillic} work (I don't know if EditCheck's RegExps use u flag) or should we use [а-яё] as right now? Both are significantly more verbose than just \w.
  1. Seems like \b also doesn't work really good. Take as an example this check which fixes old country names:
"в Киргизии": "в Кыргызстане",
"(?<!\\bв )Киргизии": "Кыргызстана",

Here, the old name and new name have slightly different declensions, so we need to check whether there was a preposition в before it.
So it should work like so:

удав Киргизии -> удав Кыргызстана (not a preposition)
в Киргизии -> в Кыргызстане (preposition)

But in the second case both checks are suggested, so seems like the negative lookbehind assertion didn't work in this case.

  1. preserveCase works good, but it's not a panacea. Take for example this check, which fixes prepositions for islands names in Russian and (hopefully) ignores their cases:
"query": {
	"в (Американских Виргинских|Бермудских|Галапагосских|Каймановых|Кокосовых|Маршалловых|Фарерских|Фолклендских|Соломоновых|Аландских|Багамских) островах": "на $1 островах",
	"в островах Кука": "на островах Кука",
	"в (Кубе|Кюрасао|Мальдивах|Фиджи|Ямайке|Мадагаскаре|Мальте|Филиппинах)": "на $1"
},
"isRegExp": true,
"preserveCase": true,

If we test it, it should work like so:

в Бермудских островах -> на Бермудских островах
в Бермудских Островах -> на Бермудских Островах
в островах Кука -> на островах Кука
в Островах Кука -> на Островах Кука
в Кубе -> на Кубе
В Бермудских островах -> На Бермудских островах
В Бермудских Островах -> На Бермудских Островах
В островах Кука -> На островах Кука
В Островах Кука -> На Островах Кука
В Кубе -> На Кубе

But that's how it really does:

в Бермудских островах -> на Бермудских островах
в Бермудских Островах -> на Бермудских островах
в островах Кука -> на островах Кука
в Островах Кука -> на островах Кука
в Кубе -> на Кубе
В Бермудских островах -> на Бермудских островах
В Бермудских Островах -> На Бермудских Островах
В островах Кука -> на островах Кука
В Островах Кука -> На Островах Кука
В Кубе -> На Кубе

And also PLEASE make an option for case-sensitive regexps! Not sure whether this was intended, but currently caseSensitive: true doesn't work with RegExps, and this is probably the most annoying bug now.

Is there a limit to the number of TextMatchEditCheck that we can add?
On frwiki, we have https://fr.wikipedia.org/w/index.php?title=Wikipédia:Liste_de_fautes_d%27orthographe_courantes&useparsoid=0 which is a list of common typos and https://fr.wikipedia.org/wiki/Wikipédia:AutoWikiBrowser/Typos which is the setup page for corrections to automatically apply when using AutoWikiBrowser.
Both these pages are pretty long (about a thousand lines each), so would it cause issues to make TextMatchEditCheck for everything mentioned in them?

Change #1285862 had a related patch set uploaded (by DLynch; author: DLynch):

[mediawiki/extensions/VisualEditor@master] TextMatchEditCheck: allow regex queries to be case-sensitive

https://gerrit.wikimedia.org/r/1285862

@Escargot_rouge There's a related task for investigating if it's feasible to re-use the existing listings of AWB fixes (rather than duplicating them into TextMatch) at T423650: [Suggestion] Detect typo from existing lists. As DLynch notes in the comments there, yes, it does need to be checked on the performance/scaling issues regarding adding thousands of entries.

Change #1285862 merged by jenkins-bot:

[mediawiki/extensions/VisualEditor@master] TextMatchEditCheck: allow regex queries to be case-sensitive

https://gerrit.wikimedia.org/r/1285862

Change #1285908 had a related patch set uploaded (by Medelius; author: Medelius):

[mediawiki/extensions/VisualEditor@master] EditCheck: preserveCase config property should default to true

https://gerrit.wikimedia.org/r/1285908

My first feedback is that, I think, preserveCase should be true by default for replace rules.

@medelius Right now I also thought that it should be false for case-sensitive rules -- not sure if anything could break if not, but I think there are almost no cases where PC could be useful for case-sensitive patterns.

My first feedback is that, I think, preserveCase should be true by default for replace rules.

@medelius Right now I also thought that it should be false for case-sensitive rules -- not sure if anything could break if not, but I think there are almost no cases where PC could be useful for case-sensitive patterns.

Thanks for all your feedback! Let me look into the bugs you've mentioned.

And you make a good point on case-sensitive rules and the preserveCase default. What I'm imagining is, say, a matchRule with "query": { "Новым Годом": "Новым годом" } and caseSensitive: true and preserveCase: true; it would keep replacing the original text with itself and so would never actually resolve the check. Is that the type of situation you're thinking of?

I agreed with your initial thinking on preserveCase defaulting to true before you helpfully pointed out that undesirable scenario. In that case, I wonder if it's safer to still require preserveCase to be explicitly set to true. Even though many rules might want preserveCase to be true, I'd feel better if we avoid implicit coupling between preserveCase and caseSensitive. Let me know your thoughts, and I'll ask the other engineers too.

Is it intentional that regex-based replacements are always case-insensitive? At least this is my impression from reading the code.

Also, I think it should be documented that all text rules (regex-based and regex-free) use the "whole word" flag, i.e., the match can start and end only at the word boundary (provided I'm reading the code right).

@DLynch added support for regex case sensitivity this morning, so that'll be live this week.

I've updated the documentation to include a note about the whole-word flag - thank you for pointing this out.

Is that the type of situation you're thinking of?

Yeah, something like this -- where the case is intentionally changed.

In that case, I wonder if it's safer to still require preserveCase to be explicitly set to true.

I'm still pretty sure that it would be a safe and reasonable default to make it true when we both have replace mode and caseSensitive being false (including when it defaults to false when undefined) -- at least I can't really think of much examples where it might break things and it would definitely be helpful in most cases.

@medelius I tested RegExps for a bit. They are amazing! But I still found a few bugs, not sure whether in my edit checks or in the code.

  1. \w doesn't capture Cyrillic letters. I think this is a JavaScript limitation, but still -- what should we do for this? Would \p{Script=Cyrillic} work (I don't know if EditCheck's RegExps use u flag) or should we use [а-яё] as right now? Both are significantly more verbose than just \w.
  1. Yes, it's unfortunately a Javascript limitation. I think you'd have to do something like [а-яё], but let me loop in the internationalization expert @dchan and see if he has any input.
  1. Seems like \b also doesn't work really good. Take as an example this check which fixes old country names:
"в Киргизии": "в Кыргызстане",
"(?<!\\bв )Киргизии": "Кыргызстана",

Here, the old name and new name have slightly different declensions, so we need to check whether there was a preposition в before it.
So it should work like so:

удав Киргизии -> удав Кыргызстана (not a preposition)
в Киргизии -> в Кыргызстане (preposition)

But in the second case both checks are suggested, so seems like the negative lookbehind assertion didn't work in this case.

  1. I think this is because Javascript \b is unreliable for Cyrillic. Might something like (?<!^в )(?<!\\sв )Киргизии work? Though that's pretty verbose.
  1. preserveCase works good, but it's not a panacea. Take for example this check, which fixes prepositions for islands names in Russian and (hopefully) ignores their cases:

I created T426123 for this.

  1. Seems like \b also doesn't work really good. Take as an example this check which fixes old country names:
"в Киргизии": "в Кыргызстане",
"(?<!\\bв )Киргизии": "Кыргызстана",

Here, the old name and new name have slightly different declensions, so we need to check whether there was a preposition в before it.
So it should work like so:

удав Киргизии -> удав Кыргызстана (not a preposition)
в Киргизии -> в Кыргызстане (preposition)

But in the second case both checks are suggested, so seems like the negative lookbehind assertion didn't work in this case.

  1. I think this is because Javascript \b is unreliable for Cyrillic. Might something like (?<!^в )(?<!\\sв )Киргизии work? Though that's pretty verbose.

I talked about this with Mitte27 (we discussed a few other possibilities when we need to limit what words couldn't go before Киргизии) and I got a new idea -- maybe it's possible to somehow create some kind of "pattern blocklist"?
To elaborate -- if I write in query parameter something like "!в Киргизии" it would mean the check shouldn't match "в Киргизии" (that's more "query-style" approach, I think, like queries for selecting files in e.g. .gitignore), or maybe some new parameter like ignore: ["в $1"] could be added (more "pattern-style" approach).

@medelius It seems I found another possible bug in regexps matching...
We in ruwiki have the following check:

			"punctuation-typos": {
				"query": {
					"(?\u003C=[\\wа-яё\\d]) ,": ",",
					"(?\u003C=[\\wа-яё\\d]) \\.": ".",
					"(?\u003C=[\\wа-яё\\d]) \\?": "?",
					"(?\u003C=[\\wа-яё\\d]) !": "!",
					"(?\u003C=[\\wа-яё\\d]) :": ":",
					"(?\u003C=[\\wа-яё\\d]) ;": ";",
					"(?\u003C=[\\wа-яё\\d]) »": "»",
					"« (?=[\\wа-яё\\d])": "«",
					"(?\u003C=[\\wа-яё\\d]) “": "“",
					"„ (?=[\\wа-яё\\d])": "„",
					"(?\u003C=[\\wа-яё\\d]) \\)": ")",
					"\\( (?=[\\wа-яё\\d])": "(",
					"(?\u003C=[\\wа-яё\\d]) \\]": "]",
					"\\[ (?=[\\wа-яё\\d])": "["
				},
				"isRegExp": true,
				"mode": "replace",
				"title": "Проверить пунктуацию",
				"message": "Возможно, здесь содержится пунктуационная ошибка, которая должна быть исправлена."
			}

(the patterns look scary because of JSON, they are meant to be (<=[\wа-яё\d]) , -> , and so on)
This check is meant to remove wrong spaces before/after punctuation symbols, and also checks that there should be a Latin/Cyrillic/digit letter next to the space (so that something like award — ? doesn't get matched). The check worked fine until I introduced this check for letter -- now something like word ? is correctly flagged as wrong, but the check suggests to replace ? with ? which doesn't make sense, and I don't understand from where the space in replacement comes from...
My guess is that the regexp works correctly for finding the match, but then, when the replacement is applied only on the match itself, it no longer works because the regexp would no longer match this match... I think this is easy to circumvent locally just by adding ^| but probably it's worth of fixing/documenting at least.

(Also I just found out that having even one broken regex kills EditCheck completely -- this is probably not good)

image.png (231×1 px, 50 KB)

the check suggests to replace ? with ? which doesn't make sense, and I don't understand from where the space in replacement comes from...

Reminds me of T251087: VisualEditor and 2017 wikitext editor: Find & Replace fails to replace when using positive lookaheads/lookbehinds.

Thanks for the link! Yes, seems like it's exactly the same issue.

@matej_suchanek + @Wellverywell + @Escargot_rouge: first and foremost, on behalf of the entire Editing Team, I'd like to share how inspired we've been by seeing you all experimenting with, extending, and as the latest comments here demonstrate, identifying ways TextMatch (and Edit Suggestions more broadly) are breaking!

Now, with regard to this ticket, I'd like to close it out as the work it was scoped for has now been delivered.

Although before doing so, a question for y'all: what (if any) outstanding bugs and/or ideas related to this work are not documented in the tickets below?

@ppelberg thank you so much! These ideas remain:

it would be a safe and reasonable default to make it [preserveCase] true when we both have replace mode and caseSensitive being false (including when it defaults to false when undefined)

maybe it's possible to somehow create some kind of "pattern blocklist"?
To elaborate -- if I write in query parameter something like "!в Киргизии" it would mean the check shouldn't match "в Киргизии" (that's more "query-style" approach, I think, like queries for selecting files in e.g. .gitignore), or maybe some new parameter like ignore: ["в $1"] could be added (more "pattern-style" approach).

and this bug:

(Also I just found out that having even one broken regex kills EditCheck completely -- this is probably not good)

image.png (231×1 px, 50 KB)

I can file tickets for them later too.

UPD: @medelius It seems like I found one more bug 🙂 The check for fixing punctuation problems with - and em dash:

					"--": "—",
					"\\s-\\s": " — ",
					"\\s-(\\d+)": " −$1",
					"\\s-([а-яё]+)": " —$1",
					"([^\\d\\s]+)(—|\\s—|—\\s)([^\\d\\s]+)": "$1 — $3"

Behaves weirdly when there are templates around an em dash. For example, see a section of article Краснодарский край:

image.png (479×1 px, 101 KB)

The edit check suggests something which looks like changing em dash to em dash (which already isn't very useful), but when I click it it just removes the templates... I have no idea why this happens and what I should fix so that it wouldn't.

A similar thing but for links instead of templates happens here.

And one more bug again -- this paragraph:

Параметризуя рациональные дроби полиномов различными числами, можно получать различные факторизации: при параметризации вещественным числом — расширенное поле вещественных, комплексным (не вещественным) — комплексных чисел. Число, используемое для параметризации, есть корень простого (над вещественным полем) полинома, отождествляемого с нулём, т. е. по модулю которого берутся числители и знаменатели (в случае вещественного числа — первой степени, комплексного — квадратный с отрицательным дискриминантом и, соответственно, двумя сопряжёнными комплексными корнями).

from page Комплексное число triggers this check multiple times (with NBSPs before dashes -- something that is meant to be already covered and not be replaced).
The regexps currently are (case sensitive):

					"--": "—",
					"\\s-\\s": " — ",
					"\\s-(\\d+)": " −$1",
					"\\s-([а-яёА-ЯЁ]+)": " —$1",
					"([^\\d\\s]+)(\\s—|—\\s)([^\\d\\s]+)": "$1 — $3",
					"([А-ЯЁA-Z][^\\d\\s]*)(?\u003C![XVI]+)—(?![XVI]+)([А-ЯЁA-Z][^\\d\\s]*)": "$1 — $2",
					"([^\\d\\s]+)(?\u003C![XVI]+)—(?![XVI]+)([^\\d\\s]+)": "$1-$2",
					"([XVI]+)(\\s*-\\s*|\\s—|—\\s|\\s—\\s)([XVI]+)": "$1—$3"

The weirdest thing is that I can't reproduce these matches when testing just in JavaScript.

Change #1295713 had a related patch set uploaded (by DLynch; author: DLynch):

[mediawiki/extensions/VisualEditor@master] TextMatchEditCheck: isolate errors to only disrupt one rule

https://gerrit.wikimedia.org/r/1295713

Change #1295713 merged by jenkins-bot:

[mediawiki/extensions/VisualEditor@master] TextMatchEditCheck: isolate errors to only disrupt one rule

https://gerrit.wikimedia.org/r/1295713

Also, seems like e.g. regexp \. matches the result of <code>.html</code> even if ignoreQuotedContent is set to true.

And one more thing -- I've wanted to introduce a pattern for finding missing spaces after/before punctuation, but it doesn't work because to circumvent T251087 I need to include (?=[\\wа-яё\\d]|$), and then for some reason (I too can't reproduce it in just JavaScript) it matches any ending punctuation in any paragraph. I tried to add (?!\\s) so that it wouldn't detect if there's a newline after, but to no avail.

It seems I found an even crazier bug now...
When there are multiple patterns in one check, \1 refers not to the first capturing group in the pattern, as expected, but to the first capturing group of the FIRST pattern (or something like that)!
That's why for example a pattern didn't work correctly until this edit (where I moved it to a separate check): https://ru.wikipedia.org/w/index.php?title=MediaWiki%3AEditcheck-config.json&diff=153363109&oldid=153362698

And one more strange bug, possibly regex-unrelated (ok, I really should move these bugs to separate tickets, sorry):
With the check:

			"miscapitalization": {
				"query": {
					"Автономн([а-яё]+) республик([а-яё]+) Крым": "Автономн$1 Республик$2 Крым",
					"Велик([а-яё]+) Отечественн([а-яё]+) Войн([а-яё]+)": "Велик$1 Отечественн$2 войн$3",
					"Велик([а-яё]+) отечественн([а-яё]+) Войн([а-яё]+)": "Велик$1 Отечественн$2 войн$3",
					"Велик([а-яё]+) отечественн([а-яё]+) войн([а-яё]+)": "Велик$1 Отечественн$2 войн$3",
					"Верховн([а-яё]+) Рад([а-яё]+)": "Верховн$1 рад$2",
					"Вице-Президент([а-яё]*)": "Вице-президент$1",
					"Вооружённы([а-яё]+) Сил([а-яё]*)": "Вооружённы$1 сил$2",
					"Государственн([а-яё]+) Дум([а-яё]+)": "Государственн$1 дум$2",
					"(Перв[а-яё]+|Втор[а-яё]+|Трет[а-яё]+) Миров([а-яё]+) Войн([а-яё]+)": "$1 миров$2 войн$3",
					"(Перв[а-яё]+|Втор[а-яё]+|Трет[а-яё]+) миров([а-яё]+) Войн([а-яё]+)": "$1 миров$2 войн$3",
					"(Перв[а-яё]+|Втор[а-яё]+|Трет[а-яё]+) Миров([а-яё]+) войн([а-яё]+)": "$1 миров$2 войн$3",
					"Российск([а-яё]+) Импери([а-яё]+)": "Российск$1 импери$2",
					"Российск([а-яё]+) федераци([а-яё]+)": "Российск$1 Федераци$2",
					"Ростов([а-яё]*)-На-Дону": "Ростов$1-на-Дону"
				},
				"isRegExp": true,
				"title": "Проверить капитализацию",
				"mode": "replace",
				"message": "Написание должно соответствовать правилам русского языка и названию соответствующей статьи в Википедии. Исключение допускается для цитат или названий публикаций на внешних ресурсах.",
				"config": {
					"caseSensitive": true,
					"ignoreQuotedContent": true
				}
			},

it correctly detects miscapitalized phrases like "Великая Отечественная Война" and replaces with "Великая Отечественная война", but it is not shown in the suggestion itself. Note that e.g. "Первая Мировая Война" just below it works fine:

image.png (240×311 px, 11 KB)

image.png (307×310 px, 15 KB)

Also, seems like e.g. regexp \. matches the result of <code>.html</code> even if ignoreQuotedContent is set to true.

<code> doesn’t count as a quote. I’m not sure it should, either? We might need to talk about that.