Page MenuHomePhabricator

Identify unhelpful file names on commons
Open, LowPublic

Description

In T177353, we were asked to get a count of files with unhelpful names. To identify unhelpful file names, we can extract the old and new file names from the move log whose change reason is meaningless or ambiguous, and then train a classification model.

Putting this project in the backlog now. I will pick it up when we have some bandwidth.

Event Timeline

Restricted Application added a subscriber: Aklapper. · View Herald Transcript

What is the exact meaning of "unhelpful" in the context of this ticket? What is the goal of this ticket and it's parent (?) ticket T177353: Metrics for SDoC: look at search hits based on which element the search is hitting.

What the MediaWiki Commons community considers "unhelpful" file names can be seen in https://commons.wikimedia.org/wiki/MediaWiki:Titleblacklist. If this is what you are looking for.

What is the exact meaning of "unhelpful" in the context of this ticket? What is the goal of this ticket and it's parent (?) ticket T177353: Metrics for SDoC: look at search hits based on which element the search is hitting.

What the MediaWiki Commons community considers "unhelpful" file names can be seen in https://commons.wikimedia.org/wiki/MediaWiki:Titleblacklist. If this is what you are looking for.

Hello @thiemowmde ! The purpose of T177353 and its parent ticket T174519: [epic] SDoC: Determine baseline for metrics is to figure out a baseline for metrics on Commons in order to measure future successes for the SDC General (SDoC) project. The SDoC team and us (Discovery-Analysis) came up with a list of stuff that would be interesting to measure, and create T177353 and other child tickets (see T174519 for more details). There is a exploratory nature in this work: some metrics in the list are clearly defined, while some -- for example, what is the exact meaning of "unhelpful" -- are not. Any ideas and comments are very welcome!

The Titleblacklist is used to block certain file names (generic, spam, etc.) through mw:Extension:Title blacklist when users try to upload files with these invalid names. However, regular expression is not perfect and there are still some files with "unhelpful" names got uploaded -- e.g. File:Img-071129152243-0001.png and those in the move log whose change reason is meaningless or ambiguous, which now requires human to identify. That's why I'm thinking about using a machine learning model to help identify these files.

That sounds great. Thanks a lot for the additional insight!

Hi @chelsyx ,

Check this notebook, apparently the number of white spaces is a pretty good indicator of the filename quality.

kzimmerman subscribed.

Moving to icebox; Chelsy had thought about maybe continuing this in a volunteer capacity