Page MenuHomePhabricator

Split off AntiSpoof equivset generation and string normalization into its own library
Closed, ResolvedPublic5 Estimated Story Points

Description

It will be a good idea to split off normalizeString and equivset generation out of AntiSpoof and into a separate library. This way other components (or anyone) can make use of it without the dependency on the extension and it will leave AntiSpoof with only one responsibility, username spoofing.

Event Timeline

There are a very large number of changes, so older changes are hidden. Show Older Changes

@dmaza Why do we generate equiveset rather than just modifying equiveset.php?

I'm thinking we could just move to having a static file (either equiveset.php or we could have a language-agnostic equiveset.json).

Alternatively, we could just move out the generation into the library and update it like we are doing now, but I don't see the point of generating the file (unless I'm missing something here).

@dbarratt Based on the current implementation, we generate that file because we keep it in sync with https://www.mediawiki.org/wiki/Extension:AntiSpoof/Equivalence_sets
We also generate equivset.txt which is a more human readable representation of what is being replaced by what, making it easier to catch errors.
Also, some characters look very similar and it is easier to maintain the mapping using the unicode codepoints (which we check before generating equivset.ser and equivset.php)

I still think the fact that we generate four files each time is redundant. Using one file in json format would be sufficient. It should include the Unicode codepoint and the actual character only. The .txt file is kind of excessive; we can have a maintenance script that would generate something like that for those interested, but it doesn't have to be part of the code base.

This is done and on my computer, just waiting for a new repo to be created so I can push it up. ;)

mediawiki/libs/[something] in Gerrit? Did you file a new repository creation request? We probably want to name the library something other than "AntiSpoof" though since that name is already taken.

mediawiki/libs/[something] in Gerrit? Did you file a new repository creation request? We probably want to name the library something other than "AntiSpoof" though since that name is already taken.

I just asked on IRC and haven't gotten a response yet.

I was going to put it on GitHub, something like:
https://github.com/wikimedia/equivset

but it can be in Gerrit too, just needs to end up on Packagist with the update hook. :)

For PHP libraries that are used by MediaWiki where we're the author, we highly recommend Gerrit. I can create mediawiki/libs/EquivSet for you? There will also be a GitHub mirror that will be used to trigger the packagist hook.

For PHP libraries that are used by MediaWiki where we're the author, we highly recommend Gerrit. I can create mediawiki/libs/EquivSet for you? There will also be a GitHub mirror that will be used to trigger the packagist hook.

Works for me. you can just create a clone of Abuse Filter (which is what I started from) and then I can push. I'm not sure if it ought to be EquivSet or Equivset (I went with the latter, but I can change it).

Also, is the "vendor" supposed to be "Wikimedia" or "MediaWiki"? (I used the former, but again I can change it).

Thanks!

For PHP libraries that are used by MediaWiki where we're the author, we highly recommend Gerrit. I can create mediawiki/libs/EquivSet for you? There will also be a GitHub mirror that will be used to trigger the packagist hook.

Works for me. you can just create a clone of Abuse Filter

Er, do you mean AntiSpoof?

I'm not sure if it ought to be EquivSet or Equivset (I went with the latter, but I can change it).

I think set is supposed to be a separate word, so EquivSet?

Also, is the "vendor" supposed to be "Wikimedia" or "MediaWiki"? (I used the former, but again I can change it).

For libraries we use Wikimedia.

Er, do you mean AntiSpoof?

ha! yes.

I'm not sure if it ought to be EquivSet or Equivset (I went with the latter, but I can change it).

I think set is supposed to be a separate word, so EquivSet?

It looks like it's always Equivset in AntiSpoof, so I think we should keep it that way:
https://phabricator.wikimedia.org/diffusion/EANS/browse/master/maintenance/generateEquivset.php

Also, is the "vendor" supposed to be "Wikimedia" or "MediaWiki"? (I used the former, but again I can change it).

For libraries we use Wikimedia.

Great! Thanks!

Change 378206 had a related patch set uploaded (by Dbarratt; owner: Dbarratt):
[mediawiki/libs/Equivset@master] Split off AntiSpoof equivset generation and string normalization into its own library

https://gerrit.wikimedia.org/r/378206

@Legoktm Awesome! There is the change. Would you mind reviewing / merging? Then submit to Packagist and add hook on Github? :)

I still think the fact that we generate four files each time is redundant. Using one file in json format would be sufficient. It should include the Unicode codepoint and the actual character only. The .txt file is kind of excessive; we can have a maintenance script that would generate something like that for those interested, but it doesn't have to be part of the code base.

I disagree, the .txt file helps to confirm changes on the sets. Regarding moving things over to a json file, json_decode() is slower than unserialize(), might not be a significant difference with the file size we are dealing with now but there is that. I do agree that we don't need multiple files (.ser, .php).

Also, as part of this task, it was my understanding that we are still gonna be syncing with https://www.mediawiki.org/wiki/Extension:AntiSpoof/Equivalence_sets. Are we not gonna do that anymore?

@kaldari any thoughts?

Also, as part of this task, it was my understanding that we are still gonna be syncing with https://www.mediawiki.org/wiki/Extension:AntiSpoof/Equivalence_sets. Are we not gonna do that anymore?

Personally, I think syncing with https://www.mediawiki.org/wiki/Extension:AntiSpoof/Equivalence_sets is a bit awkward and unintuitive, especially if we're intending this to be a more general-purpose library that might be useful to people outside of Wikimedia. I would be OK with discontinuing that practice.

@Legoktm why do we start of from another repo? It clutters the history and makes the files that matter hard to review due to all the noise of deleted files.

@Legoktm why do we start of from another repo? It clutters the history and makes the files that matter hard to review due to all the noise of deleted files.

We can do a fresh repo if you want, but I was under the impression that @dbarratt wanted to keep the previous history? Typically we do use fresh repositories though.

@Legoktm why do we start of from another repo? It clutters the history and makes the files that matter hard to review due to all the noise of deleted files.

@Legoktm why do we start of from another repo? It clutters the history and makes the files that matter hard to review due to all the noise of deleted files.

We can do a fresh repo if you want, but I was under the impression that @dbarratt wanted to keep the previous history? Typically we do use fresh repositories though.

I started from AntiSpoof's repo since that's where the equivset is located. I'm not a git history/log neat freak and I'd like for the people who worked on this to maintain the credit. :)

However, if you would really like to start fresh, please clear out the repo and replace it with any empty one.

@dbarratt I think since this is a new library that potentially could be used outside of mediawiki it makes sense to have a clean history. Also, it was my understanding that it will have a basic string normalization like we do on (AntiSpoof::normalizeString)

@dbarratt I think since this is a new library that potentially could be used outside of mediawiki it makes sense to have a clean history. Also, it was my understanding that it will have a basic string normalization like we do on (AntiSpoof::normalizeString)

okie dokie. @Legoktm would you mind wiping out this repo and creating a clean one of the same name?

I'll add the normalization method.

Change 378206 abandoned by Legoktm:
Split off AntiSpoof equivset generation and string normalization into its own library

Reason:
Needs resubmit against clean version of repo

https://gerrit.wikimedia.org/r/378206

OK, done. There's now a repository with no history and just a .gitreview file.

Change 381792 had a related patch set uploaded (by Dbarratt; owner: Dbarratt):
[mediawiki/libs/Equivset@master] Split off AntiSpoof equivset generation and string normalization into its own library

https://gerrit.wikimedia.org/r/381792

dbarratt moved this task from Done to In progress on the Anti-Harassment (AHT Sprint 6) board.

@Legoktm since I don't have access to https://github.com/wikimedia/mediawiki-libs-Equivset could you add it to Packagist with the commit hook?

@dbarratt: I added you and Dayllan as collaborators on the repo.

Change 381792 merged by jenkins-bot:
[mediawiki/libs/Equivset@master] Split off AntiSpoof equivset generation and string normalization into its own library

https://gerrit.wikimedia.org/r/381792

@dbarratt, @Legoktm: Since there's nothing MediaWiki-specific in this library and we want to advertise it for general 3rd party use, would anyone object if I renamed the repo from wikimedia/mediawiki-libs-Equivset to wikimedia/Equivset (similar to wikimedia/DeadlinkChecker which is also a general-purpose library)?

That's fine - I think we did that with the other libraries too.

@dbarratt, @Legoktm: Since there's nothing MediaWiki-specific in this library and we want to advertise it for general 3rd party use, would anyone object if I renamed the repo from wikimedia/mediawiki-libs-Equivset to wikimedia/Equivset (similar to wikimedia/DeadlinkChecker which is also a general-purpose library)?

No objections from me. I originally wanted the shorted name T174197#3609751. :)