We are interested in categorizing different types of /reasons for deletions of uploaded media files (how: based on analysis of a sample of filed deletion requests). Once we understand the main reasons, and a rough proportion of deletion types, we can identify most problematic ones and prioritize improvements focused on minimizing their in-flow.

This is part of Design research on Commons. We would first do a programmatic analysis and then ask the design research for qualitative analysis on top.

USeful infformation about the baselines for uploads and some deletion request ratios can be found in comments here

Step 1: Preliminary analysis

  • Which data can we get about a deletion request? Before proceeding to the sampling and analyses, send an example with all data we can get to Sneha and Alexandra for review and discussion about which data to include in the analysis

Step 2: Analysis a sample
Retrieve a random sample of 1000 deletion requests over the last year and try to categorise based on the following parameters:

  • Type of deletion request (speedy or regular)
  • Time to resolve (less than 1 week, 1 week to 1 month, 1 month to 3 months, 3 months+, haven't been resolved)
  • Reasons - see reasons in this write-up. Implementation note: Reasons for deletion requests should have tags, so can probably use those

Questions we want to answer:
Share/% of each deletion class
What are the reasons most commonly reported within in each class
Is there any correlation between e.g. time to close and specific reasons?

Step 3: We would like to ensure that the analysis is representative and not biased to the latest 1000 deletion requests. As such, we would like to run the same analysis for several historical samples to minimize bias.

Preliminary analysis

Here's a sample of 100 Commons pages that got deleted between 2022-05-01 and 2023-06-01 by non-bot users
The deletion event edit message (comment_text field) seems like a relevant piece of information that enables the analysis of coarse-grained deletion types/classes and fine-grained reasons.

  • interval: 13 months
  • start date: 2022-05-01
  • end date: 2023-06-01
  • total rows: 1.3 M (1,285,839)
  • total deleted revisions: 1.3 M (1,278,527)
  • total deleted pages (counted via page ID): 497 k (497,106)
  • total deleted pages (counted via page title): 489 k (488,890)
  • total distinct deletion edit messages: 154 k (154,017)
  • data lake query:
    1SELECT page_id, revision_id, page_title, c.comment_text
    2FROM wmf.mediawiki_history mh
    3LEFT JOIN wmf_raw.mediawiki_logging l ON mh.page_id = l.log_page
    4LEFT JOIN wmf_raw.mediawiki_private_comment c ON l.log_comment_id = c.comment_id
    5WHERE mh.snapshot = '2023-05' AND c.snapshot = '2023-05' AND l.snapshot = '2023-05'
    6AND event_timestamp >= '2022-05-01' AND event_timestamp < '2023-06-01'
    7AND event_entity = 'revision'
    8AND event_type = 'create'
    9AND log_type = 'delete'
    10AND log_action = 'delete'
    11AND page_namespace= 6
    12AND page_is_redirect IS NULL
    13AND page_is_deleted
    14AND mh.wiki_db = 'commonswiki' AND c.wiki_db = 'commonswiki' AND l.wiki_db = 'commonswiki'
    15AND SIZE(event_user_is_bot_by) <= 0
    16AND SIZE(event_user_is_bot_by_historical) <= 0
  • input: deletion requests archive
  • interval: 13 months
  • start date: 2022-05-01
  • end date: 2023-06-01
  • total requests closed with a deletion: 68 k (68,071)

First sample analysis

NOTE: according to the policy, deletion requests should not be filed for speedy deletions. However, a deletion request and a deletion event edit message can specify different reasons. For instance, see this request VS its deletion event, where a regular request is actually closed as a speedy one. This introduces a mix of deletion types, which contradicts the official procedures. Therefore, we limit the analysis to deletion requests and classify them as speedy or regular merely based on their resolution time.
  • input: deletion requests dataset as above
  • speedy deletion threshold: 7 days
  • % of each deletion class:
    • 38 % speedy (379)
    • 62 % regular (621), of which:
      • 62 % (384) 1 week to 1 month
      • 23 % (141) 1 to 3 months
      • 15 % (96) 3+ months
  • most commonly reported reasons:
    • the top speedy reasons seem related to the project scope, a very broad topic that encompasses more specific reasons
    • the top regular reasons seem related to copyright violation, which can break down into more specific ones, typically freedom of panorama in this case
  • correlation between time to close and reasons: TODO

Speedy deletion requests

Analysis scale up

  • input: deletion requests dataset merged with deleted pages dataset
  • total requests: 53 k (53,021)
  • resolution time buckets:
    1. up to 1 week - 38 % (20,242)
    2. 1 week to 1 month - 37 % (19,777)
    3. 1 to 3 months - 15 % (7,936)
    4. 3+ months - 10 % (5,066)
  • top 10 wikilinks shared by all buckets:
  • top 10 wikilinks unique to each buckets:
    1. COM:NOTHOST - Commons is not a free Web host
    2. none
    3. COM:TOO UK - United Kingdom's threshold of originality
    4. COM:PCP - precautionary principle
  • top 10 words shared by all buckets:
    • copyright
    • uploader - typically related to either not own work or mistaken uploads
  • top 10 words unique to each bucket:
    1. educational, logo, personal, quality, uploaded
    2. possible
    3. free, see
    4. author, de, initially, tagged

Top reasons taxonomy

NOTE: this is a manually built attempt to classify top reasons as emerged from the analysis above.
  • copyright violation
    • derivative work
    • freedom of panorama
      • by country
    • threshold of originality
      • logo
      • Google maps
      • album cover
      • screenshot
      • poster
      • banner
      • book
    • not own work
    • non-free license
    • inquiry to volunteer response team
  • not suitable for work
    • not educational
    • nudity
    • penis
  • not a free Web host
    • personal use
    • unused file
    • selfie
    • low quality
  • deletion requested by the uploader -
    • mistake
    • better version available
  • duplicate
    • down-scaled
    • lower quality

Viable reasons frequency

We count how many wikilinks or full opening reason messages contain given keywords that are likely to trigger the above reasons.
Focus is on those that can be implemented as viable targets for automatic classifiers.
The table is sorted in descending order of full message percentages.

NOTE: wikilink percentages are based on 20 k (20,294) wikilinks extracted from opening reasons, full message percentages are based on 53 k total opening reasons.
reasonwikilink %totalfull message %totalcontains
freedom of panorama203,99294,866fop or freedom of panorama
album cover~011.6841album
not suitable for work35891.3702penis or vulva or vagina or nudity

For the sake of completeness, we also report the following reasons:

reasonwikilink %totalfull message %totalcontains
derivative work36972.51,324dw or derivative
not a free Web host12641.4738host
threshold of originality24651.2625too or threshold

Deletion requests for multiple files

We run the analysis over an extended dataset that includes deletion requests for multiple files.

Top 10 opening reasons

11) wikilinks
3 "Commons:Project scope": 10256,
4 "COM:FOP": 5680,
5 "COM:SCOPE": 3423,
6 "COM:DW": 1792,
7 "COM:NOTHOST": 1300,
8 "COM:VRT": 1217,
9 "COM:OTRS": 934,
10 "COM:WEBHOST": 875,
11 "COM:TOO": 708,
12 "COM:TOYS": 661,
152) words
17 "copyright": 22789,
18 "scope": 18008,
19 "unused": 17085,
20 "educational": 11116,
21 "logo": 10569,
22 "uploader": 10218,
23 "use": 9404,
24 "license": 9256,
25 "personal": 8915,
26 "source": 8064,
293) clusters
30Cluster 0: uploader per unlikely exif small images data missing web resolutions
31Cluster 1: scope unused personal educational value project useful logo svg notability
32Cluster 2: copyrighted per logo banner still likely artwork copyright images book
33Cluster 3: copyright violation uploader see infringement still holder source status permission
34Cluster 4: license source free non cc evidence author permission pd website
35Cluster 5: used private album personal drawing page wikipedia article self wiki
36Cluster 6: de es la en trabajo propio wikipedia claimed logos needed
37Cluster 7: quality low resolution bad poor unused unlikely useful better scope
38Cluster 8: tagged initially uploaded use logo wikimedia permission educational pd fop
39Cluster 9: copyvio possible googlemaps cover album uploader logo picture screenshot author

Viable reasons frequency

  • total wikilinks: 73k (73,167)
  • total opening reason messages: 170k (170,448)
reasonwikilink %totalfull message %totalcontains
freedom of panorama1410,4058.113,846fop or freedom of panorama
album cover~0301.93,182album
not suitable for work1.39730.71,216penis or vulva or vagina or nudity


  • The analysis is quite consistent with the previous dataset:
    • freedom of panorama still ranks first, despite being a little less represented (-0.9%)
    • logo still ranks second and gains +2%
    • book now ranks third
  • deletion requests for multiple files may be very large, e.g., this one accounts for 57k files

