In T409484, we put together some baseline configs for identifying potentially LLM-generated content using TextMatch. We realized that some of these queries are more prone to false positives than others. That is, some queries contain more common words that, on their own, might not indicate LLM-generated content; however, if a certain number of these are found within, say, a paragraph, that could warrant a flag.
To support a higher threshold, we want to allow individual rules to configure (a) how many occurrences of any of its queries within (b) which space should trigger a TextMatch EditCheck.
Story
As someone configuring TextMatch-based EditChecks to identify potentially LLM-generated content, I want to be able to set, for each rule (read: text pattern), how many occurrences of its queries within a defined text space (e.g. a paragraph, a section, or the entire edit) should trigger a flag, so that I can reduce false positives stemming from common words or phrases appearing in isolation.
Requirements
- Support a new config property for TextMatch that allows a rule (matchItem) to specify the number of occurrences that should trigger a match, and in what range it should be looking (paragraph, sentence, etc).