Page MenuHomePhabricator

Allow filtering based on tag on Special:NewFiles
Open, LowPublic

Description

As a user trying to find copyvios, the ability to filer images on https://commons.wikimedia.org/wiki/Special:NewFiles?hidepatrolled=1&limit=50&offset= to only display images with e.g. the tag "crosswiki-upload".

Related Objects

Event Timeline

Josve05a raised the priority of this task from to Needs Triage.
Josve05a updated the task description. (Show Details)
Josve05a moved this task to Tasks to follow on the User-Josve05a board.
Josve05a added subscribers: Josve05a, matmarex.
Josve05a set Security to None.

I'm going to work on this, I was first hoping to get some more tools to tag their uploads (T121864). Right now cross-wiki upload is pretty much the only one that does it, and OAuth-based tools since last week when we fixed T121875.

matmarex triaged this task as Medium priority.Jan 21 2016, 9:10 AM
Aklapper renamed this task from Allow filetering based on tag on Special:NewFIles to Allow filtering based on tag on Special:NewFIles.Jan 21 2016, 12:26 PM

This turned out to be a lot nastier than I expected because MySQL is astonishingly dumb, because our data is agonizingly inconsistent between wikis, and the fact that our schema is pretty dumb too definitely doesn't help. This is going to need somebody with more experience than me (tagging with DBA) and I'm not sure if it's really possible to enable on big wikis like Commons for performance reasons without putting unreasonable effort into it. And Special:RecentChanges gets you most of this functionality anyway…

The queries are running into the same issue as T124205 where the optimizer is choosing incredibly bad query plans unless forced otherwise. But this means that we lose the optimizer's flexibility to query differently depending on the data. In this case, the optimal query plan differs for tags with few occurrences (it's fastest to get all log events with the given tag, then filter them to the most recent ones and sort) and tags with many occurrences (it's fastest to get recent log events in the correct order, then filter them to ones with the given tag). So we have to implement that ourselves.

That makes it okay for most queries, but even then there remain cases where the query is just too slow, for tags that are in the middle and don't match either bucket well. I tried the queries this generates and, for example, getting the latest 50 uploads with the 'OAuth CID: 27' tag on Commons took around a minute. I'm not sure if a non-broken optimizer or clever indices could make them any faster.

Since the 'image' table has nothing that can be used to join with 'change_tag', I tried to join via 'logging' first. I got the results above and tried to join via 'recentchanges' instead, which did not help much. I guess both of the tables just have tons of rows which won't match and have to be skipped, and the complicated join makes it worse.

So… I'm going to push what I wrote for review (I spent too much time on it to just delete it :P), but don't hold your breath waiting for it to be merged and deployed. :/

Change 269583 had a related patch set uploaded (by Bartosz Dziewoński):
Special:NewFiles: Allow filtering by change tags and display them

https://gerrit.wikimedia.org/r/269583

Here are some queries this code generates:

Optimized for tags with few occurrences (try this query on enwiki), first tagfilter only, then tagfilter+hidepatrolled:

SELECT /*! STRAIGHT_JOIN */ img_name,img_user,img_timestamp,(SELECT  GROUP_CONCAT(ct_tag SEPARATOR ',')  FROM `change_tag`   WHERE ct_log_id=log_id  ) AS ts_tags  FROM `change_tag` INNER JOIN `logging` ON ((ct_log_id=log_id)) INNER JOIN `image` ON ((log_title = img_name) AND (log_user = img_user) AND (log_timestamp = img_timestamp)) LEFT JOIN `user_groups` ON (ug_group = 'bot' AND (ug_user = img_user))  WHERE (ug_group IS NULL) AND log_namespace = '6' AND log_type = 'upload' AND ct_tag = 'cross-wiki-upload'  ORDER BY img_timestamp DESC LIMIT 51
SELECT /*! STRAIGHT_JOIN */ img_name,img_user,img_timestamp,(SELECT  GROUP_CONCAT(ct_tag SEPARATOR ',')  FROM `change_tag`   WHERE ct_log_id=log_id  ) AS ts_tags  FROM `change_tag` INNER JOIN `logging` ON ((ct_log_id=log_id)) INNER JOIN `image` ON ((log_title = img_name) AND (log_user = img_user) AND (log_timestamp = img_timestamp)) LEFT JOIN `user_groups` ON (ug_group = 'bot' AND (ug_user = img_user)) INNER JOIN `recentchanges` ON ((rc_title = img_name) AND (rc_user = img_user) AND (rc_timestamp = img_timestamp))  WHERE (ug_group IS NULL) AND log_namespace = '6' AND log_type = 'upload' AND ct_tag = 'cross-wiki-upload' AND rc_type = '3' AND rc_log_type = 'upload' AND rc_patrolled = '0' AND rc_namespace = '6'  ORDER BY img_timestamp DESC LIMIT 51

Optimized for tags with many occurrences (try this query on commonswiki), first tagfilter only, then tagfilter+hidepatrolled:

SELECT /*! STRAIGHT_JOIN */ img_name,img_user,img_timestamp,(SELECT  GROUP_CONCAT(ct_tag SEPARATOR ',')  FROM `change_tag`   WHERE ct_log_id=log_id  ) AS ts_tags  FROM `image` LEFT JOIN `user_groups` ON (ug_group = 'bot' AND (ug_user = img_user)) INNER JOIN `logging` ON ((log_title = img_name) AND (log_user = img_user) AND (log_timestamp = img_timestamp)) INNER JOIN `change_tag` ON ((ct_log_id=log_id))  WHERE (ug_group IS NULL) AND log_namespace = '6' AND log_type = 'upload' AND ct_tag = 'cross-wiki-upload'  ORDER BY img_timestamp DESC LIMIT 51
SELECT /*! STRAIGHT_JOIN */ img_name,img_user,img_timestamp,(SELECT  GROUP_CONCAT(ct_tag SEPARATOR ',')  FROM `change_tag`   WHERE ct_log_id=log_id  ) AS ts_tags  FROM `image` LEFT JOIN `user_groups` ON (ug_group = 'bot' AND (ug_user = img_user)) INNER JOIN `logging` ON ((log_title = img_name) AND (log_user = img_user) AND (log_timestamp = img_timestamp)) INNER JOIN `change_tag` ON ((ct_log_id=log_id)) INNER JOIN `recentchanges` ON ((rc_title = img_name) AND (rc_user = img_user) AND (rc_timestamp = img_timestamp))  WHERE (ug_group IS NULL) AND log_namespace = '6' AND log_type = 'upload' AND ct_tag = 'cross-wiki-upload' AND rc_type = '3' AND rc_log_type = 'upload' AND rc_patrolled = '0' AND rc_namespace = '6'  ORDER BY img_timestamp DESC LIMIT 51

matmarex: Welcome to the club: T118186. As a general comment, I think we fall into the trap of giving too many options to the users, and that came back to bite us.

Change 269583 abandoned by Bartosz Dziewoński:
Special:NewFiles: Allow filtering by change tags and display them

Reason:
These queries are probably not going to be performant enough on our big wikis. I'm not working on this at the moment.

https://gerrit.wikimedia.org/r/269583

zhuyifei1999 renamed this task from Allow filtering based on tag on Special:NewFIles to Allow filtering based on tag on Special:NewFiles.May 13 2017, 5:55 PM
zhuyifei1999 added a subscriber: zhuyifei1999.
MarkTraceur lowered the priority of this task from Medium to Low.Jun 5 2017, 3:11 PM
Marostegui added subscribers: Ladsgroup, Marostegui.

I wonder if this is still relevant after all the work that has been done with change_tag T185355: Normalize change tag schema

Yeah, it might be possible. It needs some updates to reflect the normalization and updates that happened to the table and also, join decomposition might help a lot here but it should be doable. Anyone to take stab at it?