**Overview**
We are interested in categorizing different types of /reasons for deletions of uploaded media files (how: based on analysis of a sample of filed deletion requests). Once we understand the main reasons, and a rough proportion of deletion types, we can identify most problematic ones and prioritize improvements focused on minimizing their in-flow.
This is part of [[ https://docs.google.com/document/d/1qBWhT5O-47P8xzRxoLpjZ9UaoUzSAt5ZAOnrZwX8nqs/edit | Design research on Commons ]]. We would first do a programmatic analysis and then ask the design research for qualitative analysis on top.
USeful infformation about the baselines for uploads and some deletion request ratios can be found in comments here https://phabricator.wikimedia.org/T337466
**Requirements**
Step 1: Preliminary analysis
- Which data can we get about a deletion request? **Before proceeding to the sampling and analyses**, send an example with all data we can get to Sneha and Alexandra for review and discussion about which data to include in the analysis
Step 2: Analysis a sample
Retrieve a random sample of 1000 deletion requests over the last year and try to categorise based on the following parameters:
- Type of deletion request (speedy or regular)
- Time to resolve (less than 1 week, 1 week to 1 month, 1 month to 3 months, 3 months+, haven't been resolved)
- Reasons - see reasons in this [[ https://docs.google.com/document/d/1jxyyui4onla8cO0ub0Zrw9fKApHmulu_f0Tf8xENqlY/edit | write-up ]]. Implementation note: Reasons for deletion requests should have tags, so can probably use those
Questions we want to answer:
Share/% of each deletion class
What are the reasons most commonly reported within in each class
Is there any correlation between e.g. time to close and specific reasons?
Step 3: We would like to ensure that the analysis is representative and not biased to the latest 1000 deletion requests. As such, we would like to run the same analysis for several historical samples to minimize bias.
---
==Preliminary analysis==
Here's a sample of 100 Commons pages that got deleted between `2022-05-01` and `2023-06-01` by **non-bot** users, CC @AUgolnikova-WMF, @Sneha.
The deletion event **edit message** (`comment_text` field) seems like a relevant piece of information that enables the analysis of coarse-grained deletion types/classes and fine-grained reasons.
{P49526}
===Deleted pages dataset===
- interval: 13 months
- start date: `2022-05-01`
- end date: `2023-06-01`
- total rows: **1.3 M** (1,285,839)
- total deleted revisions: **1.3 M** (1,278,527)
- total deleted pages (counted via page **ID**): **497 k** (497,106)
- total deleted pages (counted via page **title**): **489 k** (488,890)
- total **distinct** deletion edit messages: **154 k** (154,017)
- data lake query: {P49529}
- most frequent (**> 1000 times**) edit messages: {P49530}
===Deletion requests dataset===
- input: [deletion requests archive](https://commons.wikimedia.org/wiki/Commons:Deletion_requests/Old)
- interval: 13 months
- start date: `2022-05-01`
- end date: `2023-06-01`
- total requests **closed with a deletion**: **68 k** (68,071)
---
==First sample analysis==
NOTE: according to the [policy](https://commons.wikimedia.org/wiki/Commons:Deletion_policy#Overview_of_procedures), deletion requests should **not** be filed for speedy deletions. However, a deletion request and a deletion event edit message can specify **different** reasons. For instance, see [this request](https://commons.wikimedia.org/wiki/Commons:Deletion_requests/File:ASDAWARRIORS.jpg) VS [its deletion event](https://commons.wikimedia.org/wiki/File:ASDAWARRIORS.jpg), where a **regular** request is actually closed as a **speedy** one. This introduces a mix of deletion types, which contradicts the official procedures. Therefore, we limit the analysis to deletion requests and classify them as speedy or regular merely based on their resolution time.
- input: deletion requests dataset as above
- speedy deletion threshold: **7 days**
- % of each deletion class:
- **38 %** speedy (379)
- **62 %** regular (621), of which:
- **62 %** (384) 1 week to 1 month
- **23 %** (141) 1 to 3 months
- **15 %** (96) 3+ months
- most commonly reported reasons:
- the top **speedy** reasons seem related to the [project scope](https://commons.wikimedia.org/wiki/Commons:Project_scope), a very broad topic that encompasses more specific reasons
- the top **regular** reasons seem related to **copyright violation**, which can break down into more specific ones, typically [freedom of panorama](https://commons.wikimedia.org/wiki/Commons:Freedom_of_panorama) in this case
- correlation between time to close and reasons: TODO
===Speedy deletion requests===
Dataset at https://docs.google.com/spreadsheets/d/1aajH1XI4Gd5HPjOTDBV3j6hYmJGQIJsz3zaGUqAngew/edit?usp=sharing
{P49611}
{P49612}
{P49615}
===Regular deletion requests===
Dataset at https://docs.google.com/spreadsheets/d/1BT7oFNUHPFrgr65Wo6ZHrcYYqHTL49fkBm6plnIVhNw/edit?usp=sharing
{P49613}
{P49614}
{P49616}
---
==Analysis scale up==
- input: deletion requests dataset **merged** with deleted pages dataset
- total requests: **53 k** (53,021)
- resolution time buckets:
1. up to 1 week - **38 %** (20,242)
2. 1 week to 1 month - **37 %** (19,777)
3. 1 to 3 months - **15 %** (7,936)
4. 3+ months - **10 %** (5,066)
- top 10 wikilinks **shared** by all buckets:
- `COM:DW` - [derivative works](https://commons.wikimedia.org/wiki/Commons:Derivative_works)
- `COM:FOP` - [freedom of panorama](https://commons.wikimedia.org/wiki/Commons:Freedom_of_panorama)
- `COM:SCOPE` - [project scope](https://commons.wikimedia.org/wiki/Commons:Project_scope)
- `COM:VRT` - [volunteer response team](https://commons.wikimedia.org/wiki/Commons:Volunteer_Response_Team)
- top 10 wikilinks **unique** to each buckets:
1. `COM:NOTHOST` - [Commons is not a free Web host](https://commons.wikimedia.org/wiki/Commons:What_Commons_is_not#Wikimedia_Commons_is_not_your_personal_free_web_host)
2. none
3. `COM:TOO UK` - [United Kingdom's threshold of originality](https://commons.wikimedia.org/wiki/Commons:Copyright_rules_by_territory/United_Kingdom#Threshold_of_originality)
4. `COM:PCP` - [precautionary principle](https://commons.wikimedia.org/wiki/Commons:Project_scope/Precautionary_principle)
- top 10 words **shared** by all buckets:
- `copyright`
- `uploader` - typically related to either **not own work** or **mistaken uploads**
- top 10 words **unique** to each bucket:
1. `educational`, `logo`, `personal`, `quality`, `uploaded`
2. `possible`
3. `free`, `see`
4. `author`, `de`, `initially`, `tagged`
===Up to 1 week===
{P49794}
===1 week to 1 month===
{P49795}
===1 to 3 months===
{P49796}
===3+ months===
{P49797}
===Top reasons taxonomy===
NOTE: this is a manually built attempt to classify top reasons as emerged from the analysis above.
- copyright violation
- derivative work
- freedom of panorama
- by country
- threshold of originality
- logo
- Google maps
- album cover
- screenshot
- poster
- banner
- book
- not own work
- non-free license
- inquiry to volunteer response team
- not suitable for work
- not educational
- nudity
- penis
- not a free Web host
- personal use
- unused file
- selfie
- low quality
- deletion requested by the uploader -
- mistake
- better version available
- duplicate
- down-scaled
- lower quality
===Viable reasons frequency===
We count how many wikilinks or full opening reason messages contain given keywords that are likely to trigger the above reasons.
Focus is on those that can be implemented as viable targets for automatic classifiers.
The table is sorted in descending order of full message percentages.
NOTE: wikilink percentages are based on **20 k** (20,294) wikilinks extracted from opening reasons, full message percentages are based on **53 k** total opening reasons.
| **reason** | **wikilink %** | **total** | **full message %** | **total** | **contains** |
| freedom of panorama | 20 | 3,992 | 9 | 4,866 | `fop` or `freedom of panorama` |
| logo | 0.8 | 172 | 5 | 2507 | `logo` |
| screenshot | 0.09 | 18 | 1.8 | 975 | `screenshot` |
| duplicate | N.A. | N.A. | 1.7 | 918 | `duplicate` |
| album cover | 1 | ~0 | 1.6 | 841 | `album` |
| not suitable for work | 3 | 589 | 1.3 | 702 | `penis` or `vulva` or `vagina` or `nudity` |
| poster | 1 | 216 | 1 | 571 | `poster` |
| book | 7 | ~0 | 0.9 | 475 | `book` |
| banner | 2 | ~0 | 0.3 | 188 | `banner` |
For the sake of completeness, we also report the following reasons:
| **reason** | **wikilink %** | **total** | **full message %** | **total** | **contains** |
| derivative work | 3 | 697 | 2.5 | 1,324 | `dw` or `derivative`
| not a free Web host | 1 | 264 | 1.4 | 738 | `host` |
| threshold of originality | 2 | 465 | 1.2 | 625 | `too` or `threshold`