Page MenuHomePhabricator

Determine the usage of rmspecials on WMF wikis
Closed, ResolvedPublicSep 24 2020

Description

We would like to know how many filters in each wiki are using the rmspecials() function so that we can decide on how to modify this function so that it would not treat white space as a special character.

This would require running a queries on the production servers hence tagging Wikimedia-Site-requests.

Details

Due Date
Sep 24 2020, 10:00 PM

Event Timeline

Huji updated the task description. (Show Details)

@Daimona did we ever document the process for how to engage DBAs to run queries for Wikimedia-abusefilter-global-maintainers tasks? Is "site-requests" the right tag? Or is it DBA?

@Daimona did we ever document the process for how to engage DBAs to run queries for Wikimedia-abusefilter-global-maintainers tasks? Is "site-requests" the right tag? Or is it DBA?

No, T191978 is still open. Unsure about what tag should be used.

Based on information I scraped from last year, rmspecials is used 137 times in 28 wikis. I can rerun the query today if you want. Note that I don't have an access to production servers. Just regular API queries with abusefilter helper right.

@Daimona did we ever document the process for how to engage DBAs to run queries for Wikimedia-abusefilter-global-maintainers tasks? Is "site-requests" the right tag? Or is it DBA?

No, T191978 is still open. Unsure about what tag should be used.

Right, right. T191978#5704225 is what I vaguely remembered. I'll make a task for writing that shell script.

Based on information I scraped from last year, rmspecials is used 137 times in 28 wikis. I can rerun the query today if you want. Note that I don't have an access to production servers. Just regular API queries with abusefilter helper right.

Do you have access to private filters too?

Yes. Private filters are included.

Would you mind running an API-based query again? It'll take us a while to address T262052 and there is no reason to wait for it.

Additionally, would you mind sharing the program you use to run these queries? Perhaps we can create a nimble JS gadget instead of, or in addition to, the shell script described in T262052. That way, abusefiler helpers and abusefilter maintainers can run queries too, not just those with shell access.

Would you mind running an API-based query again? It'll take us a while to address T262052 and there is no reason to wait for it.

I can run a query tonight (PT). What information do you need precisely? A list of sites and filter ids?

Additionally, would you mind sharing the program you use to run these queries? Perhaps we can create a nimble JS gadget instead of, or in addition to, the shell script described in T262052. That way, abusefiler helpers and abusefilter maintainers can run queries too, not just those with shell access.

Sorry, the code is written in Racket. I don't think it's gonna be helpful to you. It also has a bunch of other stuff as a part of my research project that I want to keep private for now.

@Nullzero the reason I asked is because to the best of my knowledge, AbuseFilter does not expose an API end point for its filter search functionality. So I am curious how you are searching the filters. If you provide some guidance, I should be able to write the JS Gadget myself.

Oh, I simply downloaded everything. Here's an excerpt of a request I sent:

(define resp (query `([action . "query"]
                      [list . "abusefilters"]
                      [abfstartid . ,start-id]
                      [abflimit . "500"]
                      [abfshow . "!deleted|enabled"]
                      [abfprop . "id|description|pattern|private"])))

Then, you can either grep or parse+traverse the AST for a more advanced search.

It would make sense to have a tool at toolforge that cron this request regularly (every day?) for a more responsive search. But also note it needs some authentication since private filter details should not be viewable in public.

Got it. That should be doable. All I need to figure out is the best way to get a current list of all wikis' base URLs. The rest is a bunch of API calls.

That said, I'm not going to start working on it right away. I have a few reservations, which I will describe on T262052

Got it. That should be doable. All I need to figure out is the best way to get a current list of all wikis' base URLs. The rest is a bunch of API calls.

What about https://meta.wikimedia.org/w/api.php?action=sitematrix&format=json&smtype=language, @Huji?

That said, I'm not going to start working on it right away. I have a few reservations, which I will describe on T262052

That is great!

I started working on the script, and got the first part down: fetch sitematrix and loop through the wiki URLs. But querying the APIs of each wiki failed, because of CORS restrictions. @Nullzero how did you get around CORS?

@Huji Could you please share an example of your code (minimum failing example please) and the error message? I'll be happy to help!

The problem with CORS as I understand, occurs because you are running it from a site. I scraped information from a script running from my computer directly, so there's no problem.

The problem with CORS as I understand, occurs because you are running it from a site. I scraped information from a script running from my computer directly, so there's no problem.

As stated above, i was trying to make it a Gadget. So yes, I was running it off of a site. First I tried to run it via Meta; that way the action=sitematrix part worked, but the calls to the individual wikis failed. Then I tried to run it off of my own computer (a file on my drive, opened in the browser); this time, even the call to meta failed. I guess I can run it from the command line with node or something, but I'm trying to make a tool that is easy for people without shell access to use, and most people know how to run a script from the browser, but not necessarily from the command line.

@Huji Could you please share an example of your code (minimum failing example please) and the error message? I'll be happy to help!

See P12512. For now, I'd rather keep it private.

[...]

@Huji Could you please share an example of your code (minimum failing example please) and the error message? I'll be happy to help!

See P12512. For now, I'd rather keep it private.

That would be caused by T210790: Allow cross-origin requests by default in the Action API, AFAICS, it is currently not possible to do an cross-origin request for an API. That's going to change, however - see the task linked :-). It should be already possible to make cross-origin requests to the rest API (T262425). It seems this can't be a gadget rn :/.

Now that T262052 has been fixed, I am going to assign this to Urbanecm who I think will be able to fulfill this task in ~2 weeks, when the maintenance script reaches WMF production servers.

Urbanecm changed the subtype of this task from "Task" to "Deadline".Sep 19 2020, 7:29 PM
Urbanecm set Due Date to Sep 24 2020, 10:00 PM.

@Huji Done, but I had to change the script locally to ignore labtestwiki, because wikiadmin user didn't have user rights to that database. That wiki is for tests only, and we can ignore that for now.

Results:

1Script started on Wed 30 Sep 2020 12:01:35 PM UTC
2urbanecm@mwmaint2001:~$ mwscript extensions/AbuseFilter/maintenance/searchFilters.php --wiki=enwiki --pattern 'rmspecial'
3wiki filter
4acewiki 6
5azwiki 25
6bnwiki 5
7ckbwiki 19
8ckbwiki 24
9ckbwiki 25
10commonswiki 10
11commonswiki 21
12commonswiki 37
13commonswiki 44
14cswiki 48
15cswiki 77
16cswiki 90
17cswiki 108
18dawiki 14
19dewiki 66
20dewiki 102
21dewiki 144
22dewiki 209
23dewiki 213
24dewiki 226
25enwiki 4
26enwiki 7
27enwiki 17
28enwiki 20
29enwiki 21
30enwiki 22
31enwiki 23
32enwiki 96
33enwiki 119
34enwiki 137
35enwiki 148
36enwiki 149
37enwiki 154
38enwiki 166
39enwiki 170
40enwiki 188
41enwiki 294
42enwiki 355
43enwiki 414
44enwiki 595
45enwiki 676
46enwiki 764
47enwiki 793
48enwiki 923
49enwiki 941
50enwiki 996
51enwiki 1019
52enwiki 1071
53enwikibooks 31
54eswiki 36
55eswiki 47
56eswiki 55
57eswiktionary 8
58fawiki 202
59fiwiki 82
60frwiki 122
61frwiki 125
62glwiki 11
63hiwiki 82
64huwiki 19
65idwiki 15
66itwiki 43
67itwiki 556
68jawiki 36
69kkwiki 23
70kkwiki 58
71kowiki 4
72kowiki 29
73kowiki 90
74kowiki 101
75kowiki 106
76kowiki 108
77kowiki 123
78kowikisource 6
79metawiki 174
80nlwiki 32
81plwiktionary 2
82ptwiktionary 2
83ptwiktionary 6
84rowiki 55
85rowiki 65
86ruwiki 62
87ruwiki 96
88ruwiki 104
89ruwiki 118
90scowiki 7
91simplewiki 23
92simplewiki 76
93skwiki 29
94svwiki 10
95svwiki 28
96tawiki 9
97tawiki 10
98testwiki 1
99testwiki 6
100testwiki 18
101testwiki 19
102testwiki 37
103testwiki 75
104testwiki 151
105testwiki 153
106testwiki 177
107testwiki 200
108testwiki 203
109ukwiki 48
110usabilitywiki 3
111zhwiki 108
112zhwiki 123
113zhwiki 140
114zhwiki 166
115urbanecm@mwmaint2001:~$ exit
116
117Script done on Wed 30 Sep 2020 12:01:47 PM UTC

Huji moved this task from Later to Backlog on the User-Urbanecm board.
Huji added a subscriber: Billinghurst.

@Urbanecm because it took me so long to review these, @Billinghurst has asked that we rerun the query above.

If it is possible to search by "contains rmspecials AND does not contain remwhitespace" that would be ideal. Otherwise, you can rerun the exact search above, and I can review them by hand. I have a few days of increased availability this and next week and can do it quickly this time.

@Huji I've repeated the search above:

1daimona@mwmaint1002:~$ mwscript extensions/AbuseFilter/maintenance/SearchFilters.php --wiki=enwiki --pattern 'rmspecial'
2wiki filter
3acewiki 6
4azwiki 25
5bnwiki 5
6bnwikisource 4
7ckbwiki 19
8ckbwiki 24
9ckbwiki 25
10commonswiki 10
11commonswiki 21
12commonswiki 37
13commonswiki 44
14cswiki 48
15cswiki 77
16cswiki 90
17cswiki 108
18dawiki 14
19dewiki 66
20dewiki 102
21dewiki 144
22dewiki 209
23dewiki 213
24dewiki 226
25enwiki 4
26enwiki 7
27enwiki 17
28enwiki 20
29enwiki 21
30enwiki 22
31enwiki 23
32enwiki 96
33enwiki 119
34enwiki 137
35enwiki 148
36enwiki 149
37enwiki 154
38enwiki 166
39enwiki 170
40enwiki 188
41enwiki 294
42enwiki 355
43enwiki 414
44enwiki 595
45enwiki 676
46enwiki 764
47enwiki 793
48enwiki 923
49enwiki 941
50enwiki 996
51enwiki 1019
52enwiki 1071
53enwiki 1112
54enwiki 1113
55enwiki 1114
56enwiki 1115
57enwiki 1165
58enwikibooks 31
59enwikinews 38
60enwikiquote 32
61enwikisource 43
62enwiktionary 31
63enwiktionary 48
64eswiki 36
65eswiki 47
66eswiki 55
67eswiki 114
68eswiki 123
69eswiki 125
70eswiktionary 8
71fawiki 202
72fiwiki 82
73frwiki 122
74frwiki 125
75frwiki 367
76glwiki 11
77hiwiki 82
78hrwiki 20
79huwiki 19
80idwiki 12
81idwiki 15
82idwiki 77
83incubatorwiki 29
84itwiki 43
85itwiki 556
86itwiki 602
87jawiki 36
88jawiki 105
89jawiki 151
90jawiki 160
91jawiki 173
92kkwiki 23
93kkwiki 58
94kowiki 4
95kowiki 29
96kowiki 90
97kowiki 101
98kowiki 106
99kowiki 108
100kowiki 123
101kowikisource 6
102metawiki 174
103metawiki 285
104metawiki 288
105nlwiki 32
106plwiktionary 2
107ptwiktionary 2
108ptwiktionary 6
109rowiki 55
110rowiki 65
111ruwiki 62
112ruwiki 96
113ruwiki 104
114ruwiki 118
115scowiki 7
116simplewiki 23
117simplewiktionary 19
118skwiki 29
119svwiki 10
120svwiki 28
121tawiki 9
122tawiki 10
123testwiki 1
124testwiki 6
125testwiki 18
126testwiki 19
127testwiki 37
128testwiki 75
129testwiki 151
130testwiki 153
131testwiki 177
132testwiki 200
133testwiki 203
134trwiki 96
135ukwiki 48
136usabilitywiki 3
137viwiki 45
138viwiki 46
139wikimaniawiki 6
140zhwiki 108
141zhwiki 123
142zhwiki 140
143zhwiki 166

Unfortunately, there isn't an option to exclude filters matching a given pattern. It should be possible to do a set difference with the results above, or run the query manually, but I guess it could have false positives due to filters that use rmwhitespace but not together with rmspecials.

Daimona beated me, was just running it :). You can also use https://search-filters.toolforge.org/ to self-service those searches (and if you have spare time, improvements of that tool to be more flexible with search queries would be appreciated).

That is a cool tool, @Daimona. I have added a review of it to my TODO list. For one thing, I already found a typo worth fixing.

Closing this task as complete again.