Page MenuHomePhabricator

Support a "number of occurrences" config in TextMatch
Closed, ResolvedPublic

Description

In T409484, we put together some baseline configs for identifying potentially LLM-generated content using TextMatch. We realized that some of these queries are more prone to false positives than others. That is, some queries contain more common words that, on their own, might not indicate LLM-generated content; however, if a certain number of these are found within, say, a paragraph, that could warrant a flag.

To support a higher threshold, we want to allow individual rules to configure (a) how many occurrences of any of its queries within (b) which space should trigger a TextMatch EditCheck.

Story

As someone configuring TextMatch-based EditChecks to identify potentially LLM-generated content, I want to be able to set, for each rule (read: text pattern), how many occurrences of its queries within a defined text space (e.g. a paragraph, a section, or the entire edit) should trigger a flag, so that I can reduce false positives stemming from common words or phrases appearing in isolation.

Requirements

  • Support a new config property for TextMatch that allows a rule (matchItem) to specify the number of occurrences that should trigger a match, and in what range it should be looking (paragraph, sentence, etc).

Event Timeline

Change #1203591 had a related patch set uploaded (by Medelius; author: Medelius):

[mediawiki/extensions/VisualEditor@master] TextMatchEditCheck: support a # of occurrences config

https://gerrit.wikimedia.org/r/1203591

Based on our discussions, it seems like minOccurrences, and expand=paragraph generally, open up a world of new use cases that seem distinct from plain old text matching.

They all seem to involve inferring something about the "vibe" of the paragraph based on one or more matches.

We're wondering should this functionality be a separate edit check, in order to keep TextMatch simple (and to keep the new check simple). We will think more on this.

Based on our discussions, it seems like minOccurrences, and expand=paragraph generally, open up a world of new use cases that seem distinct from plain old text matching.

They all seem to involve inferring something about the "vibe" of the paragraph based on one or more matches.

Mmm. Might you be able to share a "sketch" of some of the use cases this functionality "loosed" in y'alls discussion?

We're wondering should this functionality be a separate edit check, in order to keep TextMatch simple (and to keep the new check simple). We will think more on this.

Sounds great.

Yes, here are some differences. In all the use cases we can think of, TextMatch:

  1. finds an exact word or phrase,
  2. regardless of context,
  3. optionally offering one or more replacements;

whereas matching with expand=paragraph and minOccurences>1:

  1. finds a whole paragraph
  2. heuristically, based on the presence (and quantity) of certain words / phrases,
  3. but the words / phrases are not individually important, other than as indicators,
  4. and no replacements are offered

We think these cases are semantically different enough that we should consider having two different checks, as the code may be harder to comprehend and maintain if we satisfy both use cases in a single check. On the other hand, there is clearly some duplication between both use cases.

Change #1203591 merged by jenkins-bot:

[mediawiki/extensions/VisualEditor@master] TextMatchEditCheck: support a # of occurrences config

https://gerrit.wikimedia.org/r/1203591

@medelius : Could you share how the config file should be formatted to see this. I am not sure if I am setting it correctly here: https://en.wikipedia.beta.wmcloud.org/wiki/MediaWiki:Editcheck-config.json

It should be a "minOccurrences" numeric value under the config block, and then must have an "expand" value set as well (likely to "paragraph"), outside of the item's config block. I've updated the documentation to reflect this. Thanks!

Yeah, I figured it out later, thanks for updating the documentation.