Split off AntiSpoof equivset generation and string normalization into its own library
Closed, ResolvedPublic5 Estimated Story Points
Actions

Assigned To

Authored By

	dmaza
	Aug 25 2017, 7:09 PM

Description

It will be a good idea to split off normalizeString and equivset generation out of AntiSpoof and into a separate library. This way other components (or anyone) can make use of it without the dependency on the extension and it will leave AntiSpoof with only one responsibility, username spoofing.

Details

	Subject	Repo	Branch	Lines +/-
	Split off AntiSpoof equivset generation and string normalization into its own library	mediawiki/libs/Equivset	master	+15 K -0
	Split off AntiSpoof equivset generation and string normalization into its own library	mediawiki/libs/Equivset	master	+5 K -11 K

Customize query in gerrit

Related Objects
Search...

Status	Assigned	Task
Resolved	dbarratt	T178704 Enforce that Equivset Test Coverage Remains at 100%
Resolved	• TBolliger	T166816 Epic ⚡️ : Accuracy improvements to anti-spoof tools across multiple pertinent tools
Resolved	dbarratt	T174195 More extensive unit testing for AntiSpoof
Resolved	dbarratt	T177667 Get Equivset Test Coverage to 100%
Resolved	dbarratt	T175413 Update AbuseFilter to use new AntiSpoof library
Resolved	dbarratt	T177983 Update AntiSpoof to use new Equivset library
Resolved	dbarratt	T178537 Add wikimedia/equivset for AbuseFilter & AntiSpoof
Resolved	dbarratt	T174197 Split off AntiSpoof equivset generation and string normalization into its own library

Event Timeline

There are a very large number of changes, so older changes are hidden. Show Older Changes

Restricted Application added a subscriber: Aklapper. · View Herald TranscriptAug 25 2017, 7:09 PM

Legoktm added a project: Librarization.Aug 25 2017, 7:37 PM

• TBolliger added a parent task: T166816: Epic ⚡️ : Accuracy improvements to anti-spoof tools across multiple pertinent tools.Aug 25 2017, 8:19 PM

• TBolliger moved this task from Untriaged to Cards ready for development on the Anti-Harassment board.Aug 29 2017, 4:49 PM

• TBolliger moved this task from Cards ready for development to Triage/To be Estimated on the Anti-Harassment board.

• TBolliger mentioned this in T175413: Update AbuseFilter to use new AntiSpoof library.Sep 8 2017, 7:34 PM

• TBolliger set the point value for this task to 5.

• TBolliger moved this task from Triage/To be Estimated to Cards ready for development on the Anti-Harassment board.Sep 8 2017, 7:37 PM

Reedy added a parent task: T175413: Update AbuseFilter to use new AntiSpoof library.Sep 9 2017, 3:50 PM

@dmaza Why do we generate equiveset rather than just modifying equiveset.php?

I'm thinking we could just move to having a static file (either equiveset.php or we could have a language-agnostic equiveset.json).

Alternatively, we could just move out the generation into the library and update it like we are doing now, but I don't see the point of generating the file (unless I'm missing something here).

dbarratt removed a parent task: T175413: Update AbuseFilter to use new AntiSpoof library.Sep 12 2017, 3:21 PM

dbarratt added a subtask: T175413: Update AbuseFilter to use new AntiSpoof library.

dmaza moved this task from Cards ready for development to AHT Sprint 5 on the Anti-Harassment board.Sep 12 2017, 6:46 PM

dmaza edited projects, added Anti-Harassment (AHT Sprint 5); removed Anti-Harassment.

• TBolliger moved this task from AHT Sprint 5 to Cards ready for development on the Anti-Harassment board.Sep 12 2017, 7:06 PM

• TBolliger edited projects, added Anti-Harassment; removed Anti-Harassment (AHT Sprint 5).

@dbarratt Based on the current implementation, we generate that file because we keep it in sync with https://www.mediawiki.org/wiki/Extension:AntiSpoof/Equivalence_sets
We also generate equivset.txt which is a more human readable representation of what is being replaced by what, making it easier to catch errors.
Also, some characters look very similar and it is easier to maintain the mapping using the unicode codepoints (which we check before generating equivset.ser and equivset.php)

I still think the fact that we generate four files each time is redundant. Using one file in json format would be sufficient. It should include the Unicode codepoint and the actual character only. The .txt file is kind of excessive; we can have a maintenance script that would generate something like that for those interested, but it doesn't have to be part of the code base.

• TBolliger changed the status of subtask T175413: Update AbuseFilter to use new AntiSpoof library from Open to Stalled.Sep 13 2017, 4:28 PM

This is done and on my computer, just waiting for a new repo to be created so I can push it up. ;)

mediawiki/libs/[something] in Gerrit? Did you file a new repository creation request? We probably want to name the library something other than "AntiSpoof" though since that name is already taken.

In T174197#3609740, @Legoktm wrote:

mediawiki/libs/[something] in Gerrit? Did you file a new repository creation request? We probably want to name the library something other than "AntiSpoof" though since that name is already taken.

I just asked on IRC and haven't gotten a response yet.

I was going to put it on GitHub, something like:
https://github.com/wikimedia/equivset

but it can be in Gerrit too, just needs to end up on Packagist with the update hook. :)

For PHP libraries that are used by MediaWiki where we're the author, we highly recommend Gerrit. I can create mediawiki/libs/EquivSet for you? There will also be a GitHub mirror that will be used to trigger the packagist hook.

In T174197#3609752, @Legoktm wrote:

For PHP libraries that are used by MediaWiki where we're the author, we highly recommend Gerrit. I can create mediawiki/libs/EquivSet for you? There will also be a GitHub mirror that will be used to trigger the packagist hook.

Works for me. you can just create a clone of Abuse Filter (which is what I started from) and then I can push. I'm not sure if it ought to be EquivSet or Equivset (I went with the latter, but I can change it).

Also, is the "vendor" supposed to be "Wikimedia" or "MediaWiki"? (I used the former, but again I can change it).

Thanks!

In T174197#3609755, @dbarratt wrote:

In T174197#3609752, @Legoktm wrote:

For PHP libraries that are used by MediaWiki where we're the author, we highly recommend Gerrit. I can create mediawiki/libs/EquivSet for you? There will also be a GitHub mirror that will be used to trigger the packagist hook.

Works for me. you can just create a clone of Abuse Filter

Er, do you mean AntiSpoof?

In T174197#3609755, @dbarratt wrote:

I'm not sure if it ought to be EquivSet or Equivset (I went with the latter, but I can change it).

I think set is supposed to be a separate word, so EquivSet?

Also, is the "vendor" supposed to be "Wikimedia" or "MediaWiki"? (I used the former, but again I can change it).

For libraries we use Wikimedia.

In T174197#3609756, @Legoktm wrote:

Er, do you mean AntiSpoof?

ha! yes.

In T174197#3609757, @Legoktm wrote:

In T174197#3609755, @dbarratt wrote:

I'm not sure if it ought to be EquivSet or Equivset (I went with the latter, but I can change it).

I think set is supposed to be a separate word, so EquivSet?

It looks like it's always Equivset in AntiSpoof, so I think we should keep it that way:
https://phabricator.wikimedia.org/diffusion/EANS/browse/master/maintenance/generateEquivset.php

Also, is the "vendor" supposed to be "Wikimedia" or "MediaWiki"? (I used the former, but again I can change it).

For libraries we use Wikimedia.

Great! Thanks!

Repo created as mediawiki/libs/Equivset. Mirror on Github: https://github.com/wikimedia/mediawiki-libs-Equivset

Change 378206 had a related patch set uploaded (by Dbarratt; owner: Dbarratt):
[mediawiki/libs/Equivset@master] Split off AntiSpoof equivset generation and string normalization into its own library

https://gerrit.wikimedia.org/r/378206

gerritbot added a project: Patch-For-Review.Sep 15 2017, 7:11 AM

@Legoktm Awesome! There is the change. Would you mind reviewing / merging? Then submit to Packagist and add hook on Github? :)

dbarratt claimed this task.Sep 15 2017, 7:13 AM

dbarratt edited projects, added Anti-Harassment (AHT Sprint 5); removed Anti-Harassment.

dbarratt moved this task from Ready to Code Review on the Anti-Harassment (AHT Sprint 5) board.

In T174197#3602990, @Huji wrote:

I still think the fact that we generate four files each time is redundant. Using one file in json format would be sufficient. It should include the Unicode codepoint and the actual character only. The .txt file is kind of excessive; we can have a maintenance script that would generate something like that for those interested, but it doesn't have to be part of the code base.

I disagree, the .txt file helps to confirm changes on the sets. Regarding moving things over to a json file, json_decode() is slower than unserialize(), might not be a significant difference with the file size we are dealing with now but there is that. I do agree that we don't need multiple files (.ser, .php).

Also, as part of this task, it was my understanding that we are still gonna be syncing with https://www.mediawiki.org/wiki/Extension:AntiSpoof/Equivalence_sets. Are we not gonna do that anymore?

@kaldari any thoughts?

Also, as part of this task, it was my understanding that we are still gonna be syncing with https://www.mediawiki.org/wiki/Extension:AntiSpoof/Equivalence_sets. Are we not gonna do that anymore?

Personally, I think syncing with https://www.mediawiki.org/wiki/Extension:AntiSpoof/Equivalence_sets is a bit awkward and unintuitive, especially if we're intending this to be a more general-purpose library that might be useful to people outside of Wikimedia. I would be OK with discontinuing that practice.

Cool. Sounds good to me.

@Legoktm why do we start of from another repo? It clutters the history and makes the files that matter hard to review due to all the noise of deleted files.

In T174197#3616140, @dmaza wrote:

@Legoktm why do we start of from another repo? It clutters the history and makes the files that matter hard to review due to all the noise of deleted files.

We can do a fresh repo if you want, but I was under the impression that @dbarratt wanted to keep the previous history? Typically we do use fresh repositories though.

dbarratt moved this task from Code Review to In progress on the Anti-Harassment (AHT Sprint 5) board.Sep 25 2017, 2:59 PM

In T174197#3616140, @dmaza wrote:

@Legoktm why do we start of from another repo? It clutters the history and makes the files that matter hard to review due to all the noise of deleted files.

In T174197#3616638, @Legoktm wrote:

In T174197#3616140, @dmaza wrote:

@Legoktm why do we start of from another repo? It clutters the history and makes the files that matter hard to review due to all the noise of deleted files.

We can do a fresh repo if you want, but I was under the impression that @dbarratt wanted to keep the previous history? Typically we do use fresh repositories though.

I started from AntiSpoof's repo since that's where the equivset is located. I'm not a git history/log neat freak and I'd like for the people who worked on this to maintain the credit. :)

However, if you would really like to start fresh, please clear out the repo and replace it with any empty one.

dbarratt moved this task from In progress to Code Review on the Anti-Harassment (AHT Sprint 5) board.Sep 26 2017, 5:10 AM

@dbarratt I think since this is a new library that potentially could be used outside of mediawiki it makes sense to have a clean history. Also, it was my understanding that it will have a basic string normalization like we do on (AntiSpoof::normalizeString)

In T174197#3635569, @dmaza wrote:

@dbarratt I think since this is a new library that potentially could be used outside of mediawiki it makes sense to have a clean history. Also, it was my understanding that it will have a basic string normalization like we do on (AntiSpoof::normalizeString)

okie dokie. @Legoktm would you mind wiping out this repo and creating a clean one of the same name?

I'll add the normalization method.

dbarratt moved this task from Code Review to In progress on the Anti-Harassment (AHT Sprint 5) board.Sep 26 2017, 2:38 PM

dbarratt reassigned this task from dbarratt to Legoktm.Sep 26 2017, 7:46 PM

• TBolliger moved this task from AHT Sprint 5 to AHT Sprint 6 on the Anti-Harassment board.Sep 27 2017, 6:24 PM

• TBolliger edited projects, added Anti-Harassment (AHT Sprint 6); removed Anti-Harassment (AHT Sprint 5).

dmaza moved this task from Ready to Code Review on the Anti-Harassment (AHT Sprint 6) board.Sep 27 2017, 6:36 PM

dbarratt moved this task from Code Review to In progress on the Anti-Harassment (AHT Sprint 6) board.Sep 27 2017, 6:42 PM

dbarratt mentioned this in T174195: More extensive unit testing for AntiSpoof.Sep 28 2017, 1:48 PM

Huji mentioned this in T177024: Function to replace invisible characters with blank.Sep 28 2017, 8:58 PM

Change 378206 abandoned by Legoktm:
Split off AntiSpoof equivset generation and string normalization into its own library

Reason:
Needs resubmit against clean version of repo

https://gerrit.wikimedia.org/r/378206

OK, done. There's now a repository with no history and just a .gitreview file.

Change 381792 had a related patch set uploaded (by Dbarratt; owner: Dbarratt):
[mediawiki/libs/Equivset@master] Split off AntiSpoof equivset generation and string normalization into its own library

https://gerrit.wikimedia.org/r/381792

dbarratt moved this task from In progress to Code Review on the Anti-Harassment (AHT Sprint 6) board.Oct 2 2017, 3:50 PM

dbarratt moved this task from Code Review to Done on the Anti-Harassment (AHT Sprint 6) board.Oct 2 2017, 6:07 PM

@Legoktm since I don't have access to https://github.com/wikimedia/mediawiki-libs-Equivset could you add it to Packagist with the commit hook?

@dbarratt: I added you and Dayllan as collaborators on the repo.

• SPoore subscribed.Oct 2 2017, 8:35 PM

• SPoore unsubscribed.

dbarratt claimed this task.Oct 6 2017, 3:31 PM

dbarratt moved this task from In progress to Code Review on the Anti-Harassment (AHT Sprint 6) board.Oct 6 2017, 3:54 PM

Change 381792 merged by jenkins-bot:
[mediawiki/libs/Equivset@master] Split off AntiSpoof equivset generation and string normalization into its own library

https://gerrit.wikimedia.org/r/381792

dbarratt created subtask T177667: Get Equivset Test Coverage to 100%.Oct 6 2017, 9:22 PM

dbarratt moved this task from Code Review to Done on the Anti-Harassment (AHT Sprint 6) board.

https://packagist.org/packages/wikimedia/equivset

Thanks @Legoktm!

@dbarratt, @Legoktm: Since there's nothing MediaWiki-specific in this library and we want to advertise it for general 3rd party use, would anyone object if I renamed the repo from wikimedia/mediawiki-libs-Equivset to wikimedia/Equivset (similar to wikimedia/DeadlinkChecker which is also a general-purpose library)?

That's fine - I think we did that with the other libraries too.

In T174197#3666636, @kaldari wrote:

@dbarratt, @Legoktm: Since there's nothing MediaWiki-specific in this library and we want to advertise it for general 3rd party use, would anyone object if I renamed the repo from wikimedia/mediawiki-libs-Equivset to wikimedia/Equivset (similar to wikimedia/DeadlinkChecker which is also a general-purpose library)?

No objections from me. I originally wanted the shorted name T174197#3609751. :)

Done: https://github.com/wikimedia/Equivset

dbarratt mentioned this in T177731: Create a new Component Project for Equivset.Oct 9 2017, 3:27 AM

dbarratt mentioned this in T177983: Update AntiSpoof to use new Equivset library.Oct 11 2017, 7:02 PM

dbarratt added a subtask: T177983: Update AntiSpoof to use new Equivset library.

• TBolliger changed the status of subtask T175413: Update AbuseFilter to use new AntiSpoof library from Stalled to Open.Oct 11 2017, 7:05 PM

dbarratt removed a subtask: T177667: Get Equivset Test Coverage to 100%.Oct 20 2017, 11:36 PM

dbarratt added a parent task: T177667: Get Equivset Test Coverage to 100%.

dbarratt removed a subtask: T175413: Update AbuseFilter to use new AntiSpoof library.Oct 20 2017, 11:38 PM

dbarratt added a parent task: T175413: Update AbuseFilter to use new AntiSpoof library.

dbarratt removed a subtask: T177983: Update AntiSpoof to use new Equivset library.

dbarratt added a parent task: T177983: Update AntiSpoof to use new Equivset library.