Page MenuHomePhabricator

Reduce text returned by EntityContent::getTextForFilters
Open, NormalPublic

Description

T204109 profiles a 30 seconds API edit request and determines that roughly 10 seconds of that time is spent in the AbuseFilter edit filter hook.
T204109#4602542 looks at how many more elements can possible be ignored when generating the text that is passed to the AbuseFilter.
Ignoring more elements not only speeds up the execution time of the collection of strings, but will also have knock on effects of the time taken to process all of the strings.

We should investigate which strings are actually used by the community in AbuseFilter rules and then remove the elements that are not needed.

Event Timeline

Addshore triaged this task as Normal priority.Sep 24 2018, 8:59 AM
Addshore created this task.
Restricted Application added a project: User-Addshore. · View Herald TranscriptSep 24 2018, 8:59 AM
Addshore changed the task status from Open to Stalled.Sep 24 2018, 9:00 AM

Stalled as it requires community investigation to be done first.

Addshore removed Addshore as the assignee of this task.Sep 24 2018, 9:00 AM
Addshore removed a project: User-Addshore.

Change 474135 had a related patch set uploaded (by Addshore; owner: Addshore):
[mediawiki/extensions/WikibaseMediaInfo@master] Add MediaInfoContent::getIgnoreKeysForFilters

https://gerrit.wikimedia.org/r/474135

Change 474136 had a related patch set uploaded (by Addshore; owner: Addshore):
[mediawiki/extensions/WikibaseLexeme@master] Add LexemeContent::getIgnoreKeysForFilters

https://gerrit.wikimedia.org/r/474136

Change 474137 had a related patch set uploaded (by Addshore; owner: Addshore):
[mediawiki/extensions/Wikibase@master] Add Item/PropertyContent::getIgnoreKeysForFilters

https://gerrit.wikimedia.org/r/474137

The 3 patches above move the definition of keys to ignore out of EntityContent and into each Content object itself so that they can be defined per entity type.

We still need T205254 to be done before we can remove any keys from the text we pass to abuse filters.

Change 474147 had a related patch set uploaded (by Addshore; owner: Addshore):
[mediawiki/extensions/Wikibase@master] EntityContent::getTextForFilters tests

https://gerrit.wikimedia.org/r/474147

Change 474150 had a related patch set uploaded (by Addshore; owner: Addshore):
[mediawiki/extensions/WikibaseMediaInfo@master] MediaInfoContent::getTextForFilters tests

https://gerrit.wikimedia.org/r/474150

Change 474155 had a related patch set uploaded (by Addshore; owner: Addshore):
[mediawiki/extensions/WikibaseLexeme@master] LexemeContent::getTextForFilters tests

https://gerrit.wikimedia.org/r/474155

Change 474150 merged by jenkins-bot:
[mediawiki/extensions/WikibaseMediaInfo@master] MediaInfoContent::getTextForFilters tests

https://gerrit.wikimedia.org/r/474150

Change 474135 merged by jenkins-bot:
[mediawiki/extensions/WikibaseMediaInfo@master] Add MediaInfoContent::getIgnoreKeysForFilters

https://gerrit.wikimedia.org/r/474135

Change 474147 merged by jenkins-bot:
[mediawiki/extensions/Wikibase@master] EntityContent::getTextForFilters tests

https://gerrit.wikimedia.org/r/474147

Change 474155 merged by jenkins-bot:
[mediawiki/extensions/WikibaseLexeme@master] LexemeContent::getTextForFilters tests

https://gerrit.wikimedia.org/r/474155

Change 474136 merged by jenkins-bot:
[mediawiki/extensions/WikibaseLexeme@master] Add LexemeContent::getIgnoreKeysForFilters

https://gerrit.wikimedia.org/r/474136

Change 474137 merged by jenkins-bot:
[mediawiki/extensions/Wikibase@master] Add Item/PropertyContent::getIgnoreKeysForFilters

https://gerrit.wikimedia.org/r/474137

We got to a list of some things that could be removed in an initial iteration @ T205254#5047549

But....

Splitting out guid with the current code poses problems, as the key for the field is "id", but simply removing this at all levels is too heavy handed, as it would result in the ID for statement values being removed too etc.
If we want to decrease the amount of text sent to abusefilter we might have to slightly refactor the way the text is collected.
It might make sense to have a seperate set of serializers / codecs for representing an entity for abusefilter, then these could just be used..

Addshore changed the task status from Stalled to Open.Jun 20 2019, 9:53 PM

Going to create some sub tickets.

Announced on the admin noticeboard, pinging the main AbuseFilter maintainers. Suggested them to make comments here, if nothing blocking we can merge on August 6th.