Page MenuHomePhabricator

Investigation into AntiSpoof maintenance [4H]
Closed, ResolvedPublic

Description

We've been asked to be code stewards for AntiSpoof extension. Before we take on this responsibility, we should do research to understand what the extension does and assess how much investment it would be to maintain this extension.

We would also want to look into https://www.mediawiki.org/wiki/Equivset as part of this investigation because AntiSpoof heavily relies on it.

Event Timeline

Niharika triaged this task as Medium priority.Oct 13 2020, 6:54 PM
Niharika created this task.
ARamirez_WMF renamed this task from Investigation into AntiSpoof maintenance to Investigation into AntiSpoof maintenance [4H].Oct 14 2020, 4:37 PM
Niharika renamed this task from Investigation into AntiSpoof maintenance [4H] to Investigation into AntiSpoof maintenance.Oct 14 2020, 4:38 PM
Niharika renamed this task from Investigation into AntiSpoof maintenance to Investigation into AntiSpoof maintenance [4H].
Niharika added a project: AntiSpoof.
Niharika updated the task description. (Show Details)

Just to give some context, AntiSpoof has two main use cases (which don't always align perfectly well):

  • It is used by AbuseFilter to help filter for vandalism in edits.
  • It is used to prevent the registration of usernames that closely match existing usernames.

I looked into both AntiSpoof and Equivset and to a lesser degree, AbuseFilter (since it uses Equivset as well).

I looked at GH's metrics on the codebases (frequency of commits, lines +/- over time, and what the commits were)
For AntiSpoof:
https://github.com/wikimedia/mediawiki-extensions-AntiSpoof/graphs/commit-activity
https://github.com/wikimedia/mediawiki-extensions-AntiSpoof/graphs/code-frequency
https://github.com/wikimedia/mediawiki-extensions-AntiSpoof/commits/master

For Equivset:
https://github.com/wikimedia/Equivset/graphs/commit-activity
https://github.com/wikimedia/Equivset/graphs/code-frequency
https://github.com/wikimedia/Equivset/commits/master

Based on these charts, it seems like neither of these libraries see much activity (possibly a result of not having code stewards?) but also that they seem to be, on a code level, doing mostly what they're expected to do. Before 2017, AntiSpoof and Equivset were the same thing (https://phabricator.wikimedia.org/T174197). AntiSpoof in its current iteration is a wrapper around Equivset functionality and as such is a fairly simple library. Equivset was taken from AntiSpoof and has apparently not seen any major changes in how it functions (the bump in 2018 is the addition of math character mappings: https://github.com/wikimedia/Equivset/commit/4464b4454b48fe2d79b6b84f0810394d6db6b776) If I look at the phab spaces for them, I can also see that (compared to, say, AbuseFilter) there aren't any critical/high priority issues with either library.

AntiSpoof phab project: https://phabricator.wikimedia.org/project/profile/257/
Equivset phab project: https://phabricator.wikimedia.org/project/profile/3068/

There was a code stewardship review earlier that reviewed AbuseFilter, AntiSpoof, and Equivset: https://phabricator.wikimedia.org/T185154 Everyone agreed that AbuseFilter was very critical and uh there was not necessarily an opinion on AntiSpoof or Equivset.

Since then, the statuses for AntiSpoof/Equivset have not changed that much - There are 11 open tasks in Equivset (https://phabricator.wikimedia.org/maniphest/?project=PHID-PROJ-5i5bluezh4htet5xjkd4&statuses=open()&group=none&order=newest#R) and 28 open tasks (27 if you exclude this investigation) in AntiSpoof (https://phabricator.wikimedia.org/maniphest/?project=PHID-PROJ-sx5kds2srrtinusmybsj&statuses=open()&group=none&order=newest#R).

In AntiSpoof, there are 8 issues that need triage, 8 medium priority, and 11 low priority. 5 issues in AntiSpoof also have the Equivset tag and mostly relate to deciding whether or not to add new character equivalencies. AntiSpoof will (eventually?) be bundled with mediawiki (https://phabricator.wikimedia.org/T191736). In Equivset, 6 issues need triage, 2 are medium priority, and 3 are low priority.

Skimming both backlogs, it seems like there's discussion on whether or not AntiSpoof is calibrated corectly, what to add to Equivset, and whether or not Equivset is the correct system for AntiSpoof (I see confusables.txt floated around a lot and there is some interesting discussion here: https://phabricator.wikimedia.org/T65217). As far as I can tell, none of this discussion has led to any further investigation. It seems like we agree that it will not be a perfect system but we haven't agreed on what sort of imperfect we want. I'd like to bring up this point from Huji as a concern (https://phabricator.wikimedia.org/T246353#6215366) - since we don't have robust tests, it would be difficult to make changes and know they didn't cause regressions (https://phabricator.wikimedia.org/T179834).

I was curious how (and if) changes to equivset would affect AbuseFilter since kaldari mentions that equivset is more tailored toward AbuseFilter rather than AntiSpoof (https://phabricator.wikimedia.org/T246353#6533626). I tried doing a search based on tags but didn't surface much. However, reading through the comments on some of these issues, it seems like if we could improve the filtering system, we would be making moderators' lives easier (example from ptwiki: https://phabricator.wikimedia.org/T178010#3680739) and possibly improve performance (some discussion around that: https://phabricator.wikimedia.org/T185154#3928483).

Also potentially of interesting, AbuseFilter is undergoing an overhaul: https://phabricator.wikimedia.org/project/view/4939/

I think there could be work mostly focused around improving code health (tests, regressions, and performance) but it doesn't seem like these things have become critical painpoints (yet?).

@kaldari We discussed this as a team today and agreed to take on code stewardship for AntiSpoof and Equivset. Equivset is also a dependency for AbuseFilter and there is a higher potential for bugs/feature requests with it. Flagging that in case there might be bugs/feature requests that we are unable to take on because of our existing workloads.