Page MenuHomePhabricator

Allow filtering based on tag on Special:NewFiles
Open, LowPublic

Description

As a user trying to find copyvios, the ability to filer images on https://commons.wikimedia.org/wiki/Special:NewFiles?hidepatrolled=1&limit=50&offset= to only display images with e.g. the tag "crosswiki-upload".

Related Objects

Event Timeline

Josve05a created this task.Jan 20 2016, 8:02 PM
Josve05a raised the priority of this task from to Needs Triage.
Josve05a updated the task description. (Show Details)
Josve05a moved this task to Tasks to follow on the User-Josve05a board.
Josve05a added subscribers: Josve05a, matmarex.
Josve05a updated the task description. (Show Details)Jan 20 2016, 8:04 PM
Josve05a set Security to None.

I'm going to work on this, I was first hoping to get some more tools to tag their uploads (T121864). Right now cross-wiki upload is pretty much the only one that does it, and OAuth-based tools since last week when we fixed T121875.

matmarex triaged this task as Normal priority.Jan 21 2016, 9:10 AM
Aklapper renamed this task from Allow filetering based on tag on Special:NewFIles to Allow filtering based on tag on Special:NewFIles.Jan 21 2016, 12:26 PM
Steinsplitter moved this task from Incoming to Backlog on the Commons board.Jan 22 2016, 5:58 PM

This turned out to be a lot nastier than I expected because MySQL is astonishingly dumb, because our data is agonizingly inconsistent between wikis, and the fact that our schema is pretty dumb too definitely doesn't help. This is going to need somebody with more experience than me (tagging with DBA) and I'm not sure if it's really possible to enable on big wikis like Commons for performance reasons without putting unreasonable effort into it. And Special:RecentChanges gets you most of this functionality anyway…

The queries are running into the same issue as T124205 where the optimizer is choosing incredibly bad query plans unless forced otherwise. But this means that we lose the optimizer's flexibility to query differently depending on the data. In this case, the optimal query plan differs for tags with few occurrences (it's fastest to get all log events with the given tag, then filter them to the most recent ones and sort) and tags with many occurrences (it's fastest to get recent log events in the correct order, then filter them to ones with the given tag). So we have to implement that ourselves.

That makes it okay for most queries, but even then there remain cases where the query is just too slow, for tags that are in the middle and don't match either bucket well. I tried the queries this generates and, for example, getting the latest 50 uploads with the 'OAuth CID: 27' tag on Commons took around a minute. I'm not sure if a non-broken optimizer or clever indices could make them any faster.

Since the 'image' table has nothing that can be used to join with 'change_tag', I tried to join via 'logging' first. I got the results above and tried to join via 'recentchanges' instead, which did not help much. I guess both of the tables just have tons of rows which won't match and have to be skipped, and the complicated join makes it worse.

So… I'm going to push what I wrote for review (I spent too much time on it to just delete it :P), but don't hold your breath waiting for it to be merged and deployed. :/

Change 269583 had a related patch set uploaded (by Bartosz Dziewoński):
Special:NewFiles: Allow filtering by change tags and display them

https://gerrit.wikimedia.org/r/269583

matmarex added a comment.EditedFeb 10 2016, 1:52 AM

Here are some queries this code generates:

Optimized for tags with few occurrences (try this query on enwiki), first tagfilter only, then tagfilter+hidepatrolled:

SELECT /*! STRAIGHT_JOIN */ img_name,img_user,img_timestamp,(SELECT  GROUP_CONCAT(ct_tag SEPARATOR ',')  FROM `change_tag`   WHERE ct_log_id=log_id  ) AS ts_tags  FROM `change_tag` INNER JOIN `logging` ON ((ct_log_id=log_id)) INNER JOIN `image` ON ((log_title = img_name) AND (log_user = img_user) AND (log_timestamp = img_timestamp)) LEFT JOIN `user_groups` ON (ug_group = 'bot' AND (ug_user = img_user))  WHERE (ug_group IS NULL) AND log_namespace = '6' AND log_type = 'upload' AND ct_tag = 'cross-wiki-upload'  ORDER BY img_timestamp DESC LIMIT 51
SELECT /*! STRAIGHT_JOIN */ img_name,img_user,img_timestamp,(SELECT  GROUP_CONCAT(ct_tag SEPARATOR ',')  FROM `change_tag`   WHERE ct_log_id=log_id  ) AS ts_tags  FROM `change_tag` INNER JOIN `logging` ON ((ct_log_id=log_id)) INNER JOIN `image` ON ((log_title = img_name) AND (log_user = img_user) AND (log_timestamp = img_timestamp)) LEFT JOIN `user_groups` ON (ug_group = 'bot' AND (ug_user = img_user)) INNER JOIN `recentchanges` ON ((rc_title = img_name) AND (rc_user = img_user) AND (rc_timestamp = img_timestamp))  WHERE (ug_group IS NULL) AND log_namespace = '6' AND log_type = 'upload' AND ct_tag = 'cross-wiki-upload' AND rc_type = '3' AND rc_log_type = 'upload' AND rc_patrolled = '0' AND rc_namespace = '6'  ORDER BY img_timestamp DESC LIMIT 51

Optimized for tags with many occurrences (try this query on commonswiki), first tagfilter only, then tagfilter+hidepatrolled:

SELECT /*! STRAIGHT_JOIN */ img_name,img_user,img_timestamp,(SELECT  GROUP_CONCAT(ct_tag SEPARATOR ',')  FROM `change_tag`   WHERE ct_log_id=log_id  ) AS ts_tags  FROM `image` LEFT JOIN `user_groups` ON (ug_group = 'bot' AND (ug_user = img_user)) INNER JOIN `logging` ON ((log_title = img_name) AND (log_user = img_user) AND (log_timestamp = img_timestamp)) INNER JOIN `change_tag` ON ((ct_log_id=log_id))  WHERE (ug_group IS NULL) AND log_namespace = '6' AND log_type = 'upload' AND ct_tag = 'cross-wiki-upload'  ORDER BY img_timestamp DESC LIMIT 51
SELECT /*! STRAIGHT_JOIN */ img_name,img_user,img_timestamp,(SELECT  GROUP_CONCAT(ct_tag SEPARATOR ',')  FROM `change_tag`   WHERE ct_log_id=log_id  ) AS ts_tags  FROM `image` LEFT JOIN `user_groups` ON (ug_group = 'bot' AND (ug_user = img_user)) INNER JOIN `logging` ON ((log_title = img_name) AND (log_user = img_user) AND (log_timestamp = img_timestamp)) INNER JOIN `change_tag` ON ((ct_log_id=log_id)) INNER JOIN `recentchanges` ON ((rc_title = img_name) AND (rc_user = img_user) AND (rc_timestamp = img_timestamp))  WHERE (ug_group IS NULL) AND log_namespace = '6' AND log_type = 'upload' AND ct_tag = 'cross-wiki-upload' AND rc_type = '3' AND rc_log_type = 'upload' AND rc_patrolled = '0' AND rc_namespace = '6'  ORDER BY img_timestamp DESC LIMIT 51
matmarex removed matmarex as the assignee of this task.Feb 10 2016, 1:53 AM

matmarex: Welcome to the club: T118186. As a general comment, I think we fall into the trap of giving too many options to the users, and that came back to bite us.

Jdforrester-WMF moved this task from Untriaged to Doing on the Multimedia board.Jul 7 2016, 2:10 PM
Restricted Application added a subscriber: Poyekhali. · View Herald TranscriptJul 7 2016, 2:10 PM
dr0ptp4kt moved this task from Doing to Triaged on the Multimedia board.May 8 2017, 3:31 PM

Change 269583 abandoned by Bartosz Dziewoński:
Special:NewFiles: Allow filtering by change tags and display them

Reason:
These queries are probably not going to be performant enough on our big wikis. I'm not working on this at the moment.

https://gerrit.wikimedia.org/r/269583

matmarex updated the task description. (Show Details)
zhuyifei1999 renamed this task from Allow filtering based on tag on Special:NewFIles to Allow filtering based on tag on Special:NewFiles.May 13 2017, 5:55 PM
zhuyifei1999 added a subscriber: zhuyifei1999.
MarkTraceur lowered the priority of this task from Normal to Low.Jun 5 2017, 3:11 PM
Marostegui added subscribers: Ladsgroup, Marostegui.

I wonder if this is still relevant after all the work that has been done with change_tag T185355: Normalize change tag schema

Restricted Application added a project: Growth-Team. · View Herald TranscriptJan 18 2019, 11:29 AM

Yeah, it might be possible. It needs some updates to reflect the normalization and updates that happened to the table and also, join decomposition might help a lot here but it should be doable. Anyone to take stab at it?

Marostegui edited projects, added User-Marostegui; removed DBA.Jan 20 2019, 3:27 PM