Page MenuHomePhabricator

Create a shell script that identifies filters on WMF wikis matching a reg exp pattern
Closed, ResolvedPublic

Description

Based on the thoughtful recommendation by @Bawolff in T191978#5704225 we want a shell script that can be executed by anyone with shell access to WMF production servers. The script should accept a single regular expressions pattern as its input, and return the dbname and filter ID of each filter that matches that reg exp pattern.

This script will be used when Wikimedia-abusefilter-global-maintainers need to identify a list of filters that need to be worked on, in preparation for or in response to a change in AbuseFilter code that would impact those filters.

Event Timeline

@Bawolff do we have any similar maintenance scripts that we could model this after?

While this is doable, not that filters are not stored to ExternalStorage, so a simple query like:

SELECT af_id FROM abuse_filter WHERE af_pattern REGEX 'my_regex'

would suffice.

@Daimona correct. But that query needs to be run against all wmf wiki databases, so having a CLI script that just loops through the databases, runs the query, cleans up the output and returns it in a nice format (e.g. CVS) would be desirable.

An alternative approach is to use MediaWiki's API to fetch all filters for all projects, and search within them on the client side. @Nullzero eluded to it on T262047#6436843 and has done it in the past. The only user groups that can do it are abusefilter helpers and abusefilter maintainers (Nullzero is in the former group, I am in the latter).

This method will be inefficient (more computation happens compared to a series of direct DB queries, and it will certainly take more time because it is over the web). However, the advantage of it is that abusefilter maintainers (whose purpose is to find and update filters that would be impacted by a recent/upcoming change in code) can do their job independently, and without waiting for someone with shell access to run a script. From a data privacy perspective, it does not pose any risk; while it is true that a user will end up downloading the details of every filter on every project, the only users who can do that are the ones who are already trusted with this access.

With those cons and pros in mind, does anyone have a problem if I write a JS gadget that would do this? And if there are no objections, do you have any recommendations as to an easy and reliable way to get a list of all wikis' URLs (en.wikipedia.org , en.wiktionary.org , fr.wikipedia.org , etc.) so that is not hard-coded in the gadget? Finally, where should we put the gadget code? I was thinking on Meta, though it might be a good idea to keep it somewhere non-pubic, just to make it that much less easy for an abuser to stumble on it and try to hack my account :)

An alternative approach is to use MediaWiki's API to fetch all filters for all projects, and search within them on the client side. @Nullzero eluded to it on T262047#6436843 and has done it in the past. The only user groups that can do it are abusefilter helpers and abusefilter maintainers (Nullzero is in the former group, I am in the latter).

This method will be inefficient (more computation happens compared to a series of direct DB queries, and it will certainly take more time because it is over the web). However, the advantage of it is that abusefilter maintainers (whose purpose is to find and update filters that would be impacted by a recent/upcoming change in code) can do their job independently, and without waiting for someone with shell access to run a script. From a data privacy perspective, it does not pose any risk; while it is true that a user will end up downloading the details of every filter on every project, the only users who can do that are the ones who are already trusted with this access.

With those cons and pros in mind, does anyone have a problem if I write a JS gadget that would do this? And if there are no objections, do you have any recommendations as to an easy and reliable way to get a list of all wikis' URLs (en.wikipedia.org , en.wiktionary.org , fr.wikipedia.org , etc.) so that is not hard-coded in the gadget? Finally, where should we put the gadget code? I was thinking on Meta, though it might be a good idea to keep it somewhere non-pubic, just to make it that much less easy for an abuser to stumble on it and try to hack my account :)

I already have a script fragment like that, somewhere. I think I hacked the globaluserinfo api for the wiki urls, since I should have merged accounts on (almost) all wikis:

  1. https://meta.wikimedia.org/w/api.php?action=query&meta=globaluserinfo&guiuser=DannyS712&guiprop=merged
  2. query.globaluserinfo.merged
  3. map each site in the array to just its url

though there is probably a better way

My immediate reaction was: "we should use the Abuse Filter account instead of Danny's account" because, after all, it would be merged on all wikis with abuse filter (and non-existent on all wikis without Abuse Filter, if we had any). But then I realized the account is not named "Abuse Filter" in every wiki; its name is localized.

The next one that comes to mind is "MediaWiki message delivery". This user is on all wikis, I think. See https://meta.wikimedia.org/w/api.php?action=query&meta=globaluserinfo&guiuser=MediaWiki%20message%20delivery&guiprop=merged

My immediate reaction was: "we should use the Abuse Filter account instead of Danny's account" because, after all, it would be merged on all wikis with abuse filter (and non-existent on all wikis without Abuse Filter, if we had any). But then I realized the account is not named "Abuse Filter" in every wiki; its name is localized.

The next one that comes to mind is "MediaWiki message delivery". This user is on all wikis, I think. See https://meta.wikimedia.org/w/api.php?action=query&meta=globaluserinfo&guiuser=MediaWiki%20message%20delivery&guiprop=merged

That would probably be a better idea (or maybe
I'll create a fake special page in javascript with an interface to search filter contents, and use the api to get the contents
Would require access to private filters globally (so I won't be able to fully test the script, since I don't have that access) to use most effectively

I used to use a shell for loop with all.dblist to do these sorts of things (very roughly along the lines of)-

for i in `cat all.dblist` do sql $i -e 'SELECT DATABASE(), foo...';done > output.tsv

There are arguments to mysql client to make it output a tsv without column headers which makes this work better but i dont remember what they are.

@Bawolff you are referencing this all.dblist file, right? I'm not familiar with the file structure of prod servers. Where is the all.dblist file typically located? In the root of the web directory or somewhere else? Asking because your approach above means we need to run the shell script from the same directory.

I tended to do this more from toolforge, so i would download the dblist file first from noc. On prod servers i think its usually located in /srv/mediawiki/dblists, but not 100% sure

How is this as a starting point?

searchaf.sh
#!/bin/bash
#####################################
# Global abuse filter search script #
# Usage: ./searchaf.sh PATTERN      #
# Author: Huji                      #
# Released under MIT License        #
# See Wikimedia Phabricator T262052 #
#####################################

# sanitize user input to prevent accidental SQL injection
pattern=$(echo $1 | sed "s/'/''/g")

# loop through DBs and run the query
# convert the result into comma-separated format
# drop the column heades
for i in `cat all.dblist`
do
	sql $i -e "SELECT DATABASE(), af_id FROM abuse_filter WHERE af_pattern REGEX '$pattern'" | tr '\t' ',' | tail -n +2
done

@Huji If this is going to be run at Wikimedia servers, it would need to be a PHP script (or puppetized bash script, maybe, which would require coordinating with SREs).

[...]

I already have a script fragment like that, somewhere. I think I hacked the globaluserinfo api for the wiki urls, since I should have merged accounts on (almost) all wikis:

  1. https://meta.wikimedia.org/w/api.php?action=query&meta=globaluserinfo&guiuser=DannyS712&guiprop=merged
  2. query.globaluserinfo.merged
  3. map each site in the array to just its url

though there is probably a better way

Indeed there is, simply call https://meta.wikimedia.org/w/api.php?action=sitematrix&format=json&smtype=language :-).

@Urbanecm ok. Then should it be part of the AbuseFilter code?

Yes, inside the maintenance directory.

Change 626022 had a related patch set uploaded (by Huji; owner: Huji):
[mediawiki/extensions/AbuseFilter@master] Introduce searchFilters.php

https://gerrit.wikimedia.org/r/626022

@Urbanecm how is this for a start? You can call it like this: php searchFilters.php --wiki=enwiki --pattern=rmspecials

It is not super sophisticated (haha!) so much so that it doesn't even check if the pattern provided is a valid regexp pattern (you will get a DB error if not). Even though we expect those will shell access to be smart and capable people, it does use addQuotes() to avoid the possibility of accidental SQL injection.

This script assumes that the wiki farm uses $wgConf->wikis as an array of DB names, like WMF does.

Thanks for your offline help on this.

It's cool! I spend some time on testing it with multiple cases, and it seems to work well! I've approved the patch, so we can test at beta, and when the train deploys it, at production. Thanks @Huji!

Huji claimed this task.

Thanks @Urbanecm !

I am closing this task because now we have fulfilled its goal (i.e. to create a shell script). Creating a JS tool via APIs still sounds appealing, but since an in-browser gadget is the ideal incarnation of it and that won't be possible until CORS is enabled on APIs, I think we can table that for now.

Change 626022 merged by jenkins-bot:
[mediawiki/extensions/AbuseFilter@master] Introduce searchFilters.php

https://gerrit.wikimedia.org/r/626022