Page MenuHomePhabricator

Allow filtering based on tag on Special:NewFiles
Open, LowPublic

Description

As a user trying to find copyvios, the ability to filer images on https://commons.wikimedia.org/wiki/Special:NewFiles?hidepatrolled=1&limit=50&offset= to only display images with e.g. the tag "crosswiki-upload".

Related Objects

Event Timeline

Josve05a raised the priority of this task from to Needs Triage.
Josve05a updated the task description. (Show Details)
Josve05a moved this task to Tasks to follow on the User-Josve05a board.
Josve05a added subscribers: Josve05a, matmarex.
Josve05a set Security to None.

I'm going to work on this, I was first hoping to get some more tools to tag their uploads (T121864). Right now cross-wiki upload is pretty much the only one that does it, and OAuth-based tools since last week when we fixed T121875.

matmarex triaged this task as Medium priority.Jan 21 2016, 9:10 AM
Aklapper renamed this task from Allow filetering based on tag on Special:NewFIles to Allow filtering based on tag on Special:NewFIles.Jan 21 2016, 12:26 PM

This turned out to be a lot nastier than I expected because MySQL is astonishingly dumb, because our data is agonizingly inconsistent between wikis, and the fact that our schema is pretty dumb too definitely doesn't help. This is going to need somebody with more experience than me (tagging with DBA) and I'm not sure if it's really possible to enable on big wikis like Commons for performance reasons without putting unreasonable effort into it. And Special:RecentChanges gets you most of this functionality anyway…

The queries are running into the same issue as T124205 where the optimizer is choosing incredibly bad query plans unless forced otherwise. But this means that we lose the optimizer's flexibility to query differently depending on the data. In this case, the optimal query plan differs for tags with few occurrences (it's fastest to get all log events with the given tag, then filter them to the most recent ones and sort) and tags with many occurrences (it's fastest to get recent log events in the correct order, then filter them to ones with the given tag). So we have to implement that ourselves.

That makes it okay for most queries, but even then there remain cases where the query is just too slow, for tags that are in the middle and don't match either bucket well. I tried the queries this generates and, for example, getting the latest 50 uploads with the 'OAuth CID: 27' tag on Commons took around a minute. I'm not sure if a non-broken optimizer or clever indices could make them any faster.

Since the 'image' table has nothing that can be used to join with 'change_tag', I tried to join via 'logging' first. I got the results above and tried to join via 'recentchanges' instead, which did not help much. I guess both of the tables just have tons of rows which won't match and have to be skipped, and the complicated join makes it worse.

So… I'm going to push what I wrote for review (I spent too much time on it to just delete it :P), but don't hold your breath waiting for it to be merged and deployed. :/

Change 269583 had a related patch set uploaded (by Bartosz Dziewoński):
Special:NewFiles: Allow filtering by change tags and display them

https://gerrit.wikimedia.org/r/269583

Here are some queries this code generates:

Optimized for tags with few occurrences (try this query on enwiki), first tagfilter only, then tagfilter+hidepatrolled:

SELECT /*! STRAIGHT_JOIN */ img_name,img_user,img_timestamp,(SELECT  GROUP_CONCAT(ct_tag SEPARATOR ',')  FROM `change_tag`   WHERE ct_log_id=log_id  ) AS ts_tags  FROM `change_tag` INNER JOIN `logging` ON ((ct_log_id=log_id)) INNER JOIN `image` ON ((log_title = img_name) AND (log_user = img_user) AND (log_timestamp = img_timestamp)) LEFT JOIN `user_groups` ON (ug_group = 'bot' AND (ug_user = img_user))  WHERE (ug_group IS NULL) AND log_namespace = '6' AND log_type = 'upload' AND ct_tag = 'cross-wiki-upload'  ORDER BY img_timestamp DESC LIMIT 51
SELECT /*! STRAIGHT_JOIN */ img_name,img_user,img_timestamp,(SELECT  GROUP_CONCAT(ct_tag SEPARATOR ',')  FROM `change_tag`   WHERE ct_log_id=log_id  ) AS ts_tags  FROM `change_tag` INNER JOIN `logging` ON ((ct_log_id=log_id)) INNER JOIN `image` ON ((log_title = img_name) AND (log_user = img_user) AND (log_timestamp = img_timestamp)) LEFT JOIN `user_groups` ON (ug_group = 'bot' AND (ug_user = img_user)) INNER JOIN `recentchanges` ON ((rc_title = img_name) AND (rc_user = img_user) AND (rc_timestamp = img_timestamp))  WHERE (ug_group IS NULL) AND log_namespace = '6' AND log_type = 'upload' AND ct_tag = 'cross-wiki-upload' AND rc_type = '3' AND rc_log_type = 'upload' AND rc_patrolled = '0' AND rc_namespace = '6'  ORDER BY img_timestamp DESC LIMIT 51

Optimized for tags with many occurrences (try this query on commonswiki), first tagfilter only, then tagfilter+hidepatrolled:

SELECT /*! STRAIGHT_JOIN */ img_name,img_user,img_timestamp,(SELECT  GROUP_CONCAT(ct_tag SEPARATOR ',')  FROM `change_tag`   WHERE ct_log_id=log_id  ) AS ts_tags  FROM `image` LEFT JOIN `user_groups` ON (ug_group = 'bot' AND (ug_user = img_user)) INNER JOIN `logging` ON ((log_title = img_name) AND (log_user = img_user) AND (log_timestamp = img_timestamp)) INNER JOIN `change_tag` ON ((ct_log_id=log_id))  WHERE (ug_group IS NULL) AND log_namespace = '6' AND log_type = 'upload' AND ct_tag = 'cross-wiki-upload'  ORDER BY img_timestamp DESC LIMIT 51
SELECT /*! STRAIGHT_JOIN */ img_name,img_user,img_timestamp,(SELECT  GROUP_CONCAT(ct_tag SEPARATOR ',')  FROM `change_tag`   WHERE ct_log_id=log_id  ) AS ts_tags  FROM `image` LEFT JOIN `user_groups` ON (ug_group = 'bot' AND (ug_user = img_user)) INNER JOIN `logging` ON ((log_title = img_name) AND (log_user = img_user) AND (log_timestamp = img_timestamp)) INNER JOIN `change_tag` ON ((ct_log_id=log_id)) INNER JOIN `recentchanges` ON ((rc_title = img_name) AND (rc_user = img_user) AND (rc_timestamp = img_timestamp))  WHERE (ug_group IS NULL) AND log_namespace = '6' AND log_type = 'upload' AND ct_tag = 'cross-wiki-upload' AND rc_type = '3' AND rc_log_type = 'upload' AND rc_patrolled = '0' AND rc_namespace = '6'  ORDER BY img_timestamp DESC LIMIT 51

matmarex: Welcome to the club: T118186. As a general comment, I think we fall into the trap of giving too many options to the users, and that came back to bite us.

Change 269583 abandoned by Bartosz Dziewoński:
Special:NewFiles: Allow filtering by change tags and display them

Reason:
These queries are probably not going to be performant enough on our big wikis. I'm not working on this at the moment.

https://gerrit.wikimedia.org/r/269583

zhuyifei1999 renamed this task from Allow filtering based on tag on Special:NewFIles to Allow filtering based on tag on Special:NewFiles.May 13 2017, 5:55 PM
zhuyifei1999 subscribed.
MarkTraceur lowered the priority of this task from Medium to Low.Jun 5 2017, 3:11 PM

I wonder if this is still relevant after all the work that has been done with change_tag T185355: Normalize change tag schema

Yeah, it might be possible. It needs some updates to reflect the normalization and updates that happened to the table and also, join decomposition might help a lot here but it should be doable. Anyone to take stab at it?

Change 882158 had a related patch set uploaded (by Matěj Suchánek; author: Matěj Suchánek):

[mediawiki/core@master] Allow filtering by tag on Special:NewFiles

https://gerrit.wikimedia.org/r/882158

The above patch adds an optional single-tag filter (incl. inversion). It uses the recentchanges table (already joined when "Hide patrolled uploads" was selected), which limits the results to the last 30 days.
The following queries (filter for/out uploads with the "cross-wiki-upload" tag, all or unpatrolled only) can be tested for performance:

SELECT /*! STRAIGHT_JOIN */ img_name,img_timestamp,actor_user,actor_name FROM `image` JOIN `actor` ON ((actor_id=img_actor)) LEFT JOIN `user_groups` ON (ug_group = 'bot' AND (ug_user = actor_user) AND (ug_expiry IS NULL OR ug_expiry >= '20230124120909')) JOIN `recentchanges` ON ((rc_title = img_name) AND (rc_actor = img_actor) AND (rc_timestamp = img_timestamp)) JOIN `change_tag` `changetagdisplay` ON ((changetagdisplay.ct_rc_id=rc_id)) WHERE (ug_group IS NULL) AND rc_type = 3 AND rc_log_type = 'upload' AND rc_patrolled = 0 AND rc_namespace = 6 AND changetagdisplay.ct_tag_id = 17 ORDER BY img_timestamp DESC,img_name DESC LIMIT 51
SELECT /*! STRAIGHT_JOIN */ img_name,img_timestamp,actor_user,actor_name FROM `image` JOIN `actor` ON ((actor_id=img_actor)) LEFT JOIN `user_groups` ON (ug_group = 'bot' AND (ug_user = actor_user) AND (ug_expiry IS NULL OR ug_expiry >= '20230124120909')) JOIN `recentchanges` ON ((rc_title = img_name) AND (rc_actor = img_actor) AND (rc_timestamp = img_timestamp)) JOIN `change_tag` `changetagdisplay` ON ((changetagdisplay.ct_rc_id=rc_id)) WHERE (ug_group IS NULL) AND rc_type = 3 AND rc_log_type = 'upload' AND rc_namespace = 6 AND changetagdisplay.ct_tag_id = 17 ORDER BY img_timestamp DESC,img_name DESC LIMIT 51
SELECT /*! STRAIGHT_JOIN */ img_name,img_timestamp,actor_user,actor_name FROM `image` JOIN `actor` ON ((actor_id=img_actor)) LEFT JOIN `user_groups` ON (ug_group = 'bot' AND (ug_user = actor_user) AND (ug_expiry IS NULL OR ug_expiry >= '20230124120909')) JOIN `recentchanges` ON ((rc_title = img_name) AND (rc_actor = img_actor) AND (rc_timestamp = img_timestamp)) LEFT JOIN `change_tag` `changetagdisplay` ON ((changetagdisplay.ct_rc_id=rc_id) AND changetagdisplay.ct_tag_id = 17) WHERE (ug_group IS NULL) AND rc_type = 3 AND rc_log_type = 'upload' AND rc_patrolled = 0 AND rc_namespace = 6 AND (changetagdisplay.ct_tag_id IS NULL) ORDER BY img_timestamp DESC,img_name DESC LIMIT 51
SELECT /*! STRAIGHT_JOIN */ img_name,img_timestamp,actor_user,actor_name FROM `image` JOIN `actor` ON ((actor_id=img_actor)) LEFT JOIN `user_groups` ON (ug_group = 'bot' AND (ug_user = actor_user) AND (ug_expiry IS NULL OR ug_expiry >= '20230124120909')) JOIN `recentchanges` ON ((rc_title = img_name) AND (rc_actor = img_actor) AND (rc_timestamp = img_timestamp)) LEFT JOIN `change_tag` `changetagdisplay` ON ((changetagdisplay.ct_rc_id=rc_id) AND changetagdisplay.ct_tag_id = 17) WHERE (ug_group IS NULL) AND rc_type = 3 AND rc_log_type = 'upload' AND rc_namespace = 6 AND (changetagdisplay.ct_tag_id IS NULL) ORDER BY img_timestamp DESC,img_name DESC LIMIT 51

These queries seem to have the same problem as my attempt from 2016 had: if you query for a tag with very few occurrences, it's going to scan the whole image table trying to find files tagged with it. The example queries run very fast, but if you change ct_tag_id = 17 to ct_tag_id = 99999, they take forever.

Does STRAIGHT_JOIN have anything to do with that? Would something like the fix for T298225 (SpecialRecentChanges::isDenseTagFilter) help?

As a first step, since you're already limiting it to $wgRCMaxAge and sorting by timestamp, having a condition like img_timestamp > $cutOffDateForRecentchanges would probably be an improvement. [Whether or not it is enough of an improvement, I don't know. I also have not tested anything here].

Not having the straight join could mean for rarely used tags, it puts the change tags table first, however there are lots of potential query plans here, and it would probably require careful analysis to determine if removing the straight join would result in a bad one for some query that is practically in use (presumably it was added for a reason originally).

Does STRAIGHT_JOIN have anything to do with that? Would something like the fix for T298225 (SpecialRecentChanges::isDenseTagFilter) help?

Yes and maybe. I forgot about T298225, but that was a very similar problem (except with a tag that applies to almost all edits, rather than almost none), and the solution there is basically the same thing I proposed here in 2016. I guess people trust Tim more than me.

Why not doing a subquery here? something like:

SELECT /*! STRAIGHT_JOIN */ img_name,img_timestamp,actor_user,actor_name FROM `image` JOIN `actor` ON ((actor_id=img_actor)) LEFT JOIN `user_groups` ON (ug_group = 'bot' AND (ug_user = actor_user) AND (ug_expiry IS NULL OR ug_expiry >= '20230124120909')) WHERE (ug_group IS NULL) AND img_name in (select rc_title FROM `recentchanges` LEFT JOIN `change_tag` `changetagdisplay` ON ((changetagdisplay.ct_rc_id=rc_id) AND changetagdisplay.ct_tag_id = 17)  rc_type = 3 AND rc_log_type = 'upload' AND rc_namespace = 6 AND (changetagdisplay.ct_tag_id IS NULL) ORDER BY img_timestamp DESC,img_name DESC LIMIT 51)

(you probably need to move other conditions inside the subquery too.

The condition rc_timestamp = img_timestamp or log_timestamp = img_timestamp does not work always (https://gerrit.wikimedia.org/r/c/mediawiki/core/+/59766)

The current Special:NewFiles is using the image table and every new upload is taken as a new file, while Special:NewPages only take the page creation as new page. Special:NewFiles also shows the latest file based of the current time the special pages is loaded, no "time travel" to take files as new for a specific time in the past if there was a reupload. Just some different jobs.

To get tagfilter the main query must be for the upload log events from recentchanges, after that LocalRepo::findFiles can load the image data.
But that brings in the normal problems of recentchanges: Deletion + undeletion does not restore rc events, revision deleted log entries needs to filter out, how to follow moves.

It would also change the job of the special page in case of reuploads, what should be shown in that case? The first file or the newest file? Or showing both correctly in the timeline, the old one with a note.

Just to bring in another solution: Special:NewPages could show the uploads when rc_new is set to 1 when the file page gets created, that could allow https://commons.wikimedia.org/wiki/Special:NewPages?namespace=6&tagfilter=cross-wiki-upload to be used, but no thumbnail is shown (and the formatted row needs some more love). That would limit it to the first version of a file, possible not the use case wanted. Special:RecentChanges with tagfilter finds also the upload log events, but also without thumbnails.

Some more details about the usage of the special pages needs to be defined: Is Special:NewFiles to show new file versions with thumbs or to show new file pages with thumbs

I will try to summarize what I have learned.
The query planner may start from the image table (iterating rows by img_timestamp in descending order), find the matching row in recentchanges (likely using rc_namespace_title_timestamp or rc_name_type_patrolled_timestamp index) and look up ct_rc_id = rc_id AND ct_tag_id = # in change_tag (also indexed). If the chosen tag is rare (or hasn't been used recently), the scan of the image table will take a lot of time, and the query will time out.

In fact, the STRAIGHT_JOIN option (T124205, T244533, T297731) seems to force the above query plan and causes timeout for rare tags.

In that case, the preferred query plan would be to start from change_tag, then recentchanges via rc_id = ct_rc_id and then lookup image using img_name, img_timestamp, or img_actor_timestamp.

As a first step, since you're already limiting it to $wgRCMaxAge and sorting by timestamp, having a condition like img_timestamp > $cutOffDateForRecentchanges would probably be an improvement. [Whether or not it is enough of an improvement, I don't know. I also have not tested anything here].

This is a good idea. I have tested this in replicas, and it helps the query complete within a reasonable time (a few seconds). Another guidance for the query could be ct_rc_id >= (SELECT MIN(rc_id) FROM recentchanges).
Still, when I remove the STRAIGHT_JOIN option, the query completes instantly. So we should first try to remove it (or bypass it conditionally).