Page MenuHomePhabricator

[XL] Analysis of deletion requests on Commons
Closed, ResolvedPublic

Description

Overview

We are interested in categorizing different types of /reasons for deletions of uploaded media files (how: based on analysis of a sample of filed deletion requests). Once we understand the main reasons, and a rough proportion of deletion types, we can identify most problematic ones and prioritize improvements focused on minimizing their in-flow.

This is part of Design research on Commons. We would first do a programmatic analysis and then ask the design research for qualitative analysis on top.

USeful infformation about the baselines for uploads and some deletion request ratios can be found in comments here https://phabricator.wikimedia.org/T337466

Requirements
Step 1: Preliminary analysis

  • Which data can we get about a deletion request? Before proceeding to the sampling and analyses, send an example with all data we can get to Sneha and Alexandra for review and discussion about which data to include in the analysis

Step 2: Analysis a sample
Retrieve a random sample of 1000 deletion requests over the last year and try to categorise based on the following parameters:

  • Type of deletion request (speedy or regular)
  • Time to resolve (less than 1 week, 1 week to 1 month, 1 month to 3 months, 3 months+, haven't been resolved)
  • Reasons - see reasons in this write-up. Implementation note: Reasons for deletion requests should have tags, so can probably use those

Questions we want to answer:
Share/% of each deletion class
What are the reasons most commonly reported within in each class
Is there any correlation between e.g. time to close and specific reasons?

Step 3: We would like to ensure that the analysis is representative and not biased to the latest 1000 deletion requests. As such, we would like to run the same analysis for several historical samples to minimize bias.


Preliminary analysis

Here's a sample of 100 Commons pages that got deleted between 2022-05-01 and 2023-06-01 by non-bot users, CC @AUgolnikova-WMF, @Sneha.
The deletion event edit message (comment_text field) seems like a relevant piece of information that enables the analysis of coarse-grained deletion types/classes and fine-grained reasons.

1page_id,revision_id,page_title,comment_text
2127315800,722243287,31225-Dresden-1993-Staatliche_Kunstsammlung_Der_Brückenbau-Brück_&_Sohn_Kunstverlag.jpg,No ticket permission since 2 January 2023
3126726333,715789419,Naser_Khader_500.jpg,Copyright violation; see [[Commons:Licensing]] ([[COM:CSD#F1|F1]]): Promo/press photo
4118374219,659222172,Smart_City_2019.jpg,Media without a license
5132457086,768426239,Lakeside_(Standoff_2).png,per [[Commons:Deletion requests/File:Lakeside (Standoff 2).png]]
6120636346,675298139,Lerkarskår_-OBM_FS10949-3-_(27628185184).jpg,"Exact or scaled-down duplicate: [[:File:Lerkar, terra sigillata -OBM FS10949-1- (27628177074).jpg]]"
7121823206,685279519,Andrzej-konrad-piasecki.jpg,per [[Commons:Deletion requests/File:Andrzej-konrad-piasecki.jpg]]
8128823104,733493626,Acoso.png,[[COM:SS|Screenshot]] of non-free content ([[COM:CSD#F3|F3]])
9117574960,653073541,Popély_Gyula_választáis_plakát_2022_Podmaniczky_utca.jpg,per [[Commons:Deletion requests/File:Popély Gyula választáis plakát 2022 Podmaniczky utca.jpg]]
10121378468,680108089,Kościół_św._Krzysztofa_w_Tuszynie-Lesie.jpg,"[[COM:L|copyright violation]], pictures from random sources on the web"
11129408378,738849061,Muoivn.jpg,No permission since 19 March 2023
1293604368,710741390,Sean_Reyes_220_8-18-14_(2).jpg,No permission since 29 November 2022
13118620866,700667579,Fabien_Roussel_affiche_présidentielle_2022_IMG_6931.jpg,per [[Commons:Deletion requests/File:Fabien Roussel affiche présidentielle 2022 IMG 6931.jpg]]
14132503704,768946277,Fort_McMurray_shipyard_(30661480992).jpg,On second thought... G7
15128489869,730687678,European_Championships_2022-08-17_Junior_Men_Podium_training_Pommel_horse_(Norman_Seibert)_-_DSC_4110.jpg,per [[Commons:Deletion requests/Files in Category:Pommel horse at 2022 European Championships in Artistic Gymnastics]]
16129682573,741611773,Obrázek_streamera_Ninja.jpg,[[Commons:CSD#F10|CSD F10]] ([[Commons:WEBHOST|personal photos]] out of [[COM:SCOPE]]
17124570014,699014925,Bataanhermosa.jpg,per [[Commons:Deletion requests/Files uploaded by Ploreky]]
18120950585,678360786,Nena_in_2019.jpg,No permission since 28 July 2022
19113510820,663421053,COMC_V3_D276_Letter_T.jpg,Exact or scaled-down duplicate: [[:File:COMC V1 D063 Letter T.jpg]]
20129441273,739048403,F20230214ES-0007_(52689497003).jpg,per [[Commons:Deletion requests/File:F20230214ES-0007 (52689497003).jpg]]
21121485282,680893253,"Baer_Field_Airport,_Fort_Wayne,_Ind._--_Section_of_barracks,_1,000_man_mess_hall,_hanger_and_control_tower,_first_reception_building,_administration_building,_hospital_unit.jpg",per [[Commons:Deletion requests/Files uploaded by Red-tailed hawk]] speedy criteria F8
22123728865,694586986,東尾修.jpg,[[Commons:Licensing]]: promo/press photo
23131598469,760206703,Хризалид.jpg,[[COM:L|Copyright violation]]: album cover
24127389421,722822130,Azure_Way_Museum_28.jpg,[[COM:VRT|No permission]] since 14 January 2023
25127903416,726074994,Shizenkan_University.jpg,Copyright violation; see [[Commons:Licensing]] ([[COM:CSD#F1|F1]]): Non-trivial university logo
26121321835,679632541,AuroraRF.jpg,"[[COM:NETCOPYRIGHT|Copyright violation]], found elsewhere on the web and unlikely to be own work ([[COM:CSD#F1|F1]])"
27125867782,709507665,Philip_Austin_Brooks_Playing_Pro.jpg,No license since 26 November 2022
28120173600,676711392,Mazhi_Girlfriend_album_song_jpg.jpg,Copyright violation; see [[Commons:Licensing]] ([[COM:CSD#F1|F1]]): Album cover
29122359847,685091446,Salvador_Antonio_Valdés_Mesa.jpg,created by abuser
3098420015,669296012,PM_084170_E_Girona.jpg,Exact or scaled-down duplicate: [[:File:Girona PM 084170 E.jpg]]
31126381737,715390705,Prince_Luiz_of_Orléans_Braganza_(1938-2022).png,No permission since 16 December 2022
32127758898,724997123,Janakinath_Das.jpg,[[COM:WEBHOST|Personal photo]] by non-contributors ([[COM:CSD#F10|F10]])
3371800680,698476677,Symphonic_poem_for_clarinet_“harmonie-tellos”.ogg,per [[Commons:Deletion requests/Files of User:Oliver Castaño Mallorca]]
3442410812,738712230,Abbildungen_der_in_Deutschland_und_den_angrenzenden_gebieten_vorkommenden_grundformen_der_orchideenarten_(1904)_(16500559517).jpg,Exact or scaled-down duplicate: [[:File:Abbildungen der in Deutschland und den angrenzenden gebieten vorkommenden grundformen der orchideenarten (Pl 32 Gymnadenia con) (6021567131).jpg]]
35105101491,709939494,Vega_Schools.jpg,Spam. ([[COM:SPEEDY]])
36127550494,734914802,SR13263_(50192919286).jpg,Commons:Deletion requests/Files in Category:West Virginia government's COVID-19 press conference (2020-08-28)
3763288558,754449289,Badiyadka_Town.jpg,Exact or scaled-down duplicate: [[:File:Badiyadka Town Night View.jpg]]
38118475064,660087398,"Амидов_Обид,_Мансуров_Джалолиддин,_&_Зарпуллаев_Джавохир.pdf",Failed [[COM:LR|license review]]; non-free license ([[COM:CSD#F4|F4]])
39128007584,727069496,5-Mirage_F-1_fighter-F-1_جنگنده_میراژ.jpg,[[Commons:Deletion requests/Files uploaded by Seyed haji]]: Clearly not a work by Fars News
40119643576,680842520,Mcdan.jpg,per [[COM:SPEEDY]]
41127211936,721401348,Catholic_Archdiocese_of_Tabora_coat_of_arms.png,No license since 17 January 2023
42128194770,730282795,Fotografia_De_Real_Vh.jpg,per [[Commons:Deletion requests/File:Fotografia De Real Vh.jpg]]
43121265917,679092914,BNHR_logo.jpg,Media missing permission as of 30 July 2022
44119150035,663549122,Wendell_P._Williams.webp,as per [[COM:SPEEDY]].
45118369026,659176777,Chupim_bebybartho3.jpg,No permission since 21 June 2022
46121539435,689208184,Erb_Přehořovští_z_Kvasejovic.png,No license since 16 September 2022
47122945885,690128404,Mukhtar_musheer.jpg,[[COM:CSD#F10|CSD F10]] (personal photos by non-contributors)
48128391395,729957887,"Stokely_""Kwame_Ture""_Carmichael.jpg",[[COM:L|Copyright violation]]: https://www.gettyimages.in/detail/news-photo/black-activist-stokely-carmichael-aka-kwame-ture-nr-his-news-photo/50477422
49118953469,662499510,How_do_you_upgrade_your_Fiji_Airways_ticket_to_business_class_Complete_Process!.jpg,content created as advertisement/G10
50119687077,668022913,Milanstor.jpg,Content intended as [[COM:V|vandalism]] ([[COM:CSD#G3|G3]]): Mass deletion of pages added by [[Special:Contributions/Piče Vyjebaný k0k0t skurvený mu najebem fakt|Piče Vyjebaný k0k0t skurvený mu najebem fakt]]
51129590514,740802300,Greta_Thunberg_(Music_composed_by_elduendesuarez_AMRS_Sgae_60.142).wav,author's request
52120333314,673455056,Rishton_pottery_product_13.jpg,per [[Commons:Deletion requests/File:Rishton pottery product 13.jpg]]
53122390401,685475091,Eagle_Site.png,No license since 28 August 2022
54113513317,693872244,"Briefkopf._""Der_Westfale,_Verlag_und_Druckerei_Kettler_und_Co."".png","per [[Commons:Deletion requests/File:Briefkopf. ""Der Westfale, Verlag und Druckerei Kettler und Co."".png]]"
55122435822,685558822,Outdoor_2009_(3820483806).jpg,"Accidental upload. [[User:Tm|Tm]] ([[User talk:Tm|<span class=""signature-talk"">Diskussion</span>]]) 01:28, 29 August 2022 (UTC)"
56121927281,683337360,Veroniqueestie.jpg,per [[Commons:Deletion requests/File:Veroniqueestie.jpg]]
57121756012,682499362,Centro_Sambil_Maracaibo.jpg,"Copyright violation, see [[Commons:Licensing]]"
58113974114,658299556,CEC_Bank.svg,[[Commons:Deletion requests/Files uploaded by Daniel.9]]: Logos that exceed [[:COM:TOO]]. Some are re-uploads (e.g. [[:Commons:Deletion requests/File:Sensiblu logo.svg]]).
59120497287,683342226,ၶူးဝႃႈဝုၼ်းၸုမ်ႉလႄႈၶုၼ်ႁေႃၶမ်းမိူင်းပူႇတၢၼ်ႇ.jpg,per [[Commons:Deletion requests/Files uploaded by Seng Ku Hurng]]
60117902647,658880277,Kasazi.jpg,[[Commons:CSD#F10|CSD F10]] ([[Commons:WEBHOST|personal photos]] out of [[COM:SCOPE]]
61123854178,694862213,Pepsi_mf_burning.jpg,"Missing [[COM:EI|essential information]] such as [[COM:L|license]], [[COM:PERMISSION|permission]] or source ([[COM:CSD#F5|F5]])"
6267165475,731241965,Yugoslavia-AirForce-OR-7a_(1943–1947).svg,"wrong insingia based on Yugoslav People Army military law from 1946--Snake bgd (talk) 01:00, 9 February 2023 (UTC)"
63126186532,711539340,平林大翔.jpg,[[COM:L|Copyright violation]]: This file not CC. [https://hirabayashihiroto.amebaownd.com/pages/5759010/static Source]
64123133165,689667204,محمود_عزت_في_شبابه.png,per [[Commons:Deletion requests/Files uploaded by Ahmed11224]]
65117494953,659053741,Chansonnier_école_de_Jean_Pucelle_Manuscrit_Monpellier_JMR.pdf,per [[Commons:Deletion requests/Files uploaded by François Malo-Renault]]
66125569856,707150688,Poster-CC0-00-Bible-kabbalah-Torah-love-food-light-work-happy-health-god-gift-magic-universal-World-spirituality-soul-israel-jerusalem-comics-Joy-Hebrew-211Cm-Good.png,per [[Commons:Deletion requests/Files uploaded by Bible-Torah-Kabbalah-Love-Light-World-Peace-Spirituality-Gifts-English-Hebrew-Art]]
67114465880,710163611,Md._Rafiq_Azam.jpg,[[COM:WEBHOST|Personal photo]] by non-contributors ([[COM:CSD#F10|F10]])
68125085350,704214491,Hanna_Brack.jpg,per [[Commons:Deletion requests/Files uploaded by PeHoGa]]
69121640403,681861297,OUTSIDE_VIEW_OF_ANO_LIOSIA_ARENA.jpg,per [[COM:SPEEDY]]
70120389737,673947662,玉置_恭一.jpg,[[COM:L|Copyright violation]]: https://www.facebook.com/452620951520344/photos/a.843034059145696/868805056568596/?__tn__=%2CO*F
71109143098,675160524,Jonathan_global.jpg,"Missing [[COM:EI|essential information]] such as [[COM:L|license]], [[COM:PERMISSION|permission]] or source ([[COM:CSD#F5|F5]])"
72107051247,728676573,Nuovo_Logo_AID.jpg,[[COM:L|Copyright violation]]: from https://www.agenziaindustriedifesa.it/ not in cc4
7353641959,726379595,"General_view,_Holyhead,_Wales_LOC_3751638709.jpg","Exact or scaled-down duplicate: [[:File:General view, Holyhead, Wales-LCCN2001703488.jpg]]"
74129136821,736757825,チャン.jpg,[[COM:L|Copyright violation]]: source isn't CC.
75128867058,735166105,Caodaism_21_54_10_765000.jpeg,Copyright violation; see [[COM:Licensing|Commons:Licensing]] ([[COM:CSD#F1|F1]])
76126561183,714548934,Karo_Mhammadi_02.jpg,[[COM:DW|Derivative work]] of non-free content ([[COM:CSD#F3|F3]]) - https://www.instagram.com/p/CXO5pkAs4hA
77113949396,710689189,Fractal_wallpaper_3333333333333333332009-03-09.png,per [[Commons:Deletion requests/Files uploaded by Abyssal]]
7830277285,743757563,Minuteman_III_ICBM_shoots_out_of_the_silo.jpg,Exact or scaled-down duplicate: [[:File:LGM-30G Minuteman III test launch.jpg]]
79127222965,721495829,FB_IMG_16703600611611928Arabic_learning.jpg,No [[COM:Source|source]] specified since 1 January 2023
80107691700,720950353,ChriTheGamer_Streamer.jpg,No permission since 29 December 2022
81129440384,739095776,Fung_Chih_Chiang.jpg,[[COM:VRT|No permission]] since 9 March 2023
82118775559,661699044,EPP_Congress_Rotterdam_-_Day_1_(52113123686).jpg,Exact or scaled-down duplicate: [[:File:EPP Congress Rotterdam - Day 1 (52113098446).jpg]]
83129536046,739935038,Hardbody_(1407359965).jpg,Exact or scaled-down duplicate: [[:File:Young woman walking along Memorial Union terrace-19Sept2007.jpg]]
84122646813,686610168,Hydra_2.png,DENY and https://commons.wikimedia.org/w/index.php?title=Commons:Administrators%27_noticeboard&oldid=686647697#User:Pedrorexspeculanary
85119520631,666789810,Trubnaya_metro_station.jpg,[[COM:CSD#G7|CSD G7]] (author or uploader request deletion)
8698446006,683423657,PM_049514_F_Douaumont.jpg,"Exact or scaled-down duplicate: [[:File:Douaumont, Le Memorial de la Tranchée des Baïonettes PM 49514.jpg]]"
87119225437,664231574,Football_League_Centenary_Trophy.png,per [[Commons:Deletion requests/Files uploaded by RicardoSilvaRDM]]
88119383086,665727077,Saint_Xenia_of_Saint_Peterberg.png,Exact or scaled-down duplicate: [[:File:Saint Xenia of St Petersburg by Alexander Prostev.jpg]]
89128956733,734855768,Arona_(Music_composed_by_elduendesuarez_AMRS_Sgae_60.142).wav,author's request
9045572282,746667717,Schoolentrance.jpg,"Exact or scaled-down duplicate: [[:File:Indian Language School main entrance, Nigeria.jpg]]"
91118477026,660101257,-Friends_Forever_-Life_Long_-Friends_4_life_-Together_-Friends_-Reunion_-Memories.jpg,copyvio or advert or out of scope
92124018544,733791595,Silvano_Giganti_1.jpg,per [[Commons:Deletion requests/Files uploaded by Silvano Giganti12]]
93129093020,736173358,ELPF_2022_-_eFootball_23_-_São_Paulo_-_SP_-_Brasil.jpg,No permission since 27 February 2023
94124845866,701545541,配音員李涵菲.jpg,No permission since 2 November 2022
95121769854,696580297,Mitsubishi_Diesel_generator_mobile_emergency_electric_power.jpg,"Copyright violation, see [[Commons:Licensing]]"
96131772192,761953189,Garh_Kundar_fort_1st_view.jpg,[[COM:CSD#G4|CSD G4]] (recreation of content previously deleted per community consensus)&#58; [[Commons:Deletion requests/Files uploaded by Aarya.xd]]
97118088169,656819645,Kamen_Rider_ZI-O.jpeg,[[Commons:Licensing]]: poster
98129139608,736778534,Neetho_Release_Poster.jpg,"Copyright violation, see [[Commons:Licensing]]"
9951860483,744576464,P041414AL-0019_(14145646897).jpg,Exact or scaled-down duplicate: [[:File:First Lady hugged family pets.jpg]]
100125632719,707578454,New_Arrivals_Booklet_(14051666721).jpg,[[COM:CSD#G7|CSD G7]] (author or uploader request deletion)
101130052876,744257128,Topless_model_with_blanket.jpg,per [[Commons:Deletion requests/Files uploaded by 0okm9ijn0987]]

Deleted pages dataset

  • interval: 13 months
  • start date: 2022-05-01
  • end date: 2023-06-01
  • total rows: 1.3 M (1,285,839)
  • total deleted revisions: 1.3 M (1,278,527)
  • total deleted pages (counted via page ID): 497 k (497,106)
  • total deleted pages (counted via page title): 489 k (488,890)
  • total distinct deletion edit messages: 154 k (154,017)
  • data lake query:
    1SELECT page_id, revision_id, page_title, c.comment_text
    2FROM wmf.mediawiki_history mh
    3LEFT JOIN wmf_raw.mediawiki_logging l ON mh.page_id = l.log_page
    4LEFT JOIN wmf_raw.mediawiki_private_comment c ON l.log_comment_id = c.comment_id
    5WHERE mh.snapshot = '2023-05' AND c.snapshot = '2023-05' AND l.snapshot = '2023-05'
    6AND event_timestamp >= '2022-05-01' AND event_timestamp < '2023-06-01'
    7AND event_entity = 'revision'
    8AND event_type = 'create'
    9AND log_type = 'delete'
    10AND log_action = 'delete'
    11AND page_namespace= 6
    12AND page_is_redirect IS NULL
    13AND page_is_deleted
    14AND mh.wiki_db = 'commonswiki' AND c.wiki_db = 'commonswiki' AND l.wiki_db = 'commonswiki'
    15AND SIZE(event_user_is_bot_by) <= 0
    16AND SIZE(event_user_is_bot_by_historical) <= 0
  • most frequent (> 1000 times) edit messages:
    1{
    2 "[[COM:WEBHOST|Personal photo]] by non-contributors ([[COM:CSD#F10|F10]])": 37774,
    3 "Copyright violation, see [[Commons:Licensing]]": 29742,
    4 "per [[Commons:Deletion requests/Files uploaded by Dendermonde vroeger & nu]]": 26670,
    5 "[[COM:CSD#F10|CSD F10]] (personal photos by non-contributors)": 23522,
    6 "Commons:Deletion requests/Files in Category:West Virginia government's COVID-19 press conference (2020-08-28)": 18590,
    7 "Content created as [[COM:ADVERT|advertisement]] ([[COM:CSD#G10|G10]])": 17553,
    8 "as per [[COM:SPEEDY]].": 15738,
    9 "[[Commons:CSD#F10|CSD F10]] ([[Commons:WEBHOST|personal photos]] out of [[COM:SCOPE]]": 15089,
    10 "Copyright violation; see [[COM:Licensing|Commons:Licensing]] ([[COM:CSD#F1|F1]])": 14735,
    11 "per [[COM:SPEEDY]]": 12302,
    12 "Missing [[COM:EI|essential information]] such as [[COM:L|license]], [[COM:PERMISSION|permission]] or source ([[COM:CSD#F5|F5]])": 11942,
    13 "[[COM:NETCOPYRIGHT|Copyright violation]], found elsewhere on the web and unlikely to be own work ([[COM:CSD#F1|F1]])": 11510,
    14 "Copyright violation; see [[Commons:Licensing]] ([[COM:CSD#F1|F1]])": 10411,
    15 "[[COM:CSD#G10|CSD G10]] (files and pages created as advertisements)": 8727,
    16 "[[Commons:Licensing]]: promo/press photo": 8267,
    17 "[[COM:NETCOPYRIGHT|Copyright violation]], no indication of a [[COM:L|free license]] on the source site ([[COM:CSD#F1|F1]])": 7852,
    18 "per [[Commons:Deletion requests/Files uploaded by Нео Фициал]]": 7581,
    19 "Copyright violation; see [[Commons:Licensing]] ([[COM:CSD#F1|F1]]): Non-free logo above [[Commons:Threshold of originality|threshold of originality]]": 6693,
    20 "Author or uploader requested deletion of recently created, unused content ([[COM:CSD#G7|G7]])": 5792,
    21 "as per [[COM:NETCOPYVIO]].": 5467,
    22 "created by abuser": 5399,
    23 "[[COM:CSD#G7|CSD G7]] (author or uploader request deletion)": 5065,
    24 "Copyright violation; see [[Commons:Licensing]] ([[COM:CSD#F1|F1]]): Promo/press photo": 4484,
    25 "per [[Commons:Deletion requests/Files uploaded by Jeromeyuchien]]": 4092,
    26 "[[Commons:Licensing]]: non-trivial logo": 4022,
    27 "Uploads from Pexels.com needing speedy deletion": 3596,
    28 "No license since 17 January 2023": 3536,
    29 "per [[Commons:Deletion requests/Files uploaded by Mariagata1959]]": 3490,
    30 "No license since 16 May 2023": 3351,
    31 "per [[Commons:Deletion requests/Files uploaded by שמחה קבלה]]": 3258,
    32 "per [[Commons:Deletion requests/Files uploaded by Abyssal]]": 3217,
    33 "No ticket permission achieved for more than 30 days": 2758,
    34 "No license since 2 November 2022": 2531,
    35 "[[COM:SS|Screenshot]] of non-free content ([[COM:CSD#F3|F3]])": 2523,
    36 "Copyright violation; see [[COM:Licensing|Commons:Licensing]]": 2178,
    37 "Failed [[COM:LR|license review]]; non-free license ([[COM:CSD#F4|F4]])": 2153,
    38 "Uploads by Fæ needing speedy deletion": 2150,
    39 "Nonsense ([[COM:CSD#G1|G1]])": 2086,
    40 "per [[Commons:Deletion requests/Files uploaded by Hyun616]]": 2051,
    41 "[[COM:DW|Derivative work]] of non-free content ([[COM:CSD#F3|F3]])": 2019,
    42 "Copyright violation; see [[Commons:Licensing]] ([[COM:CSD#F1|F1]]): Movie poster": 1960,
    43 "[[Commons:Licensing]]: poster": 1934,
    44 "Content intended as [[COM:V|vandalism]] ([[COM:CSD#G3|G3]])": 1861,
    45 "[[COM:L|Copyright violation]]:": 1790,
    46 "[[COM:CSD#G2|CSD G2]] (unused and implausible, or broken redirect)": 1770,
    47 "Media without a license": 1716,
    48 "[[Commons:Screenshot]]: game": 1664,
    49 "[[COM:FAIRUSE|Fair use]] material is not permitted on Wikimedia Commons ([[COM:CSD#F2]])": 1587,
    50 "per [[Commons:Deletion requests/Files uploaded by Free-New-Books]]": 1542,
    51 "Recreation of content deleted per community consensus ([[COM:CSD#G4|G4]])": 1542,
    52 "per [[Commons:Deletion requests/Files uploaded by Epicurasian]]": 1536,
    53 "per [[Commons:Deletion requests/Files uploaded by JND AMD]]": 1533,
    54 "Insufficient or doubtful author or [[COM:L|license]]; [[COM:VRTS|VRTS validation]] required ([[COM:CSD#F1|F1]])": 1472,
    55 "Copyright violation; see [[Commons:Licensing]] ([[COM:CSD#F1|F1]]): [[:File:Jakov Milatovic.jpg]]": 1466,
    56 "No license since 23 October 2022": 1441,
    57 "No permission since 23 October 2022": 1402,
    58 "No license since 7 May 2023": 1401,
    59 "Copyright violation; see [[Commons:Licensing]] ([[COM:CSD#F1|F1]]): Non-trivial logo": 1392,
    60 "per [[Commons:Deletion requests/Files in Category:Pommel horse at 2022 European Championships in Artistic Gymnastics]]": 1319,
    61 "[[Commons:Licensing]]: album cover": 1261,
    62 "Media uploaded without a license as of 2023-03": 1214,
    63 "per [[Commons:Deletion requests/Files uploaded by Red-tailed hawk]] speedy criteria F8": 1193,
    64 "author's request": 1168,
    65 "No permission since 22 October 2022": 1163,
    66 "[[COM:WEBHOST]]. Personal images by non contributors.": 1147,
    67 "copyvio or advert or out of scope": 1128,
    68 "No source since 22 February 2023": 1119,
    69 "[[COM:L|Copyright violation]]: [[COM:CSD#F1]], Possible copyright violation: No evidence of a free license at the claimed source.": 1117,
    70 "per [[COM:SPEEDY]].": 1079,
    71 "[[Commons:Licensing]]: advertisement": 1079,
    72 "per [[Commons:Deletion requests/Files in Category:Floor exercise at 2022 European Championships in Artistic Gymnastics]]": 1056,
    73 "per [[Commons:Deletion requests/Files in Category:Parallel bars at 2022 European Championships in Artistic Gymnastics]]": 1019,
    74 "Exact or scaled-down duplicate ([[COM:CSD#F8|F8]])": 1016,
    75 "No permission since 21 May 2023": 1008
    76}

Deletion requests dataset

  • input: deletion requests archive
  • interval: 13 months
  • start date: 2022-05-01
  • end date: 2023-06-01
  • total requests closed with a deletion: 68 k (68,071)

First sample analysis

NOTE: according to the policy, deletion requests should not be filed for speedy deletions. However, a deletion request and a deletion event edit message can specify different reasons. For instance, see this request VS its deletion event, where a regular request is actually closed as a speedy one. This introduces a mix of deletion types, which contradicts the official procedures. Therefore, we limit the analysis to deletion requests and classify them as speedy or regular merely based on their resolution time.
  • input: deletion requests dataset as above
  • speedy deletion threshold: 7 days
  • % of each deletion class:
    • 38 % speedy (379)
    • 62 % regular (621), of which:
      • 62 % (384) 1 week to 1 month
      • 23 % (141) 1 to 3 months
      • 15 % (96) 3+ months
  • most commonly reported reasons:
    • the top speedy reasons seem related to the project scope, a very broad topic that encompasses more specific reasons
    • the top regular reasons seem related to copyright violation, which can break down into more specific ones, typically freedom of panorama in this case
  • correlation between time to close and reasons: TODO

Speedy deletion requests

Dataset at https://docs.google.com/spreadsheets/d/1aajH1XI4Gd5HPjOTDBV3j6hYmJGQIJsz3zaGUqAngew/edit?usp=sharing

1{
2 "[[COM:FOP|FoP]]": 15,
3 "[[COM:FOP|freedom of panorama]]": 13,
4 "[[COM:SCOPE]]": 9,
5 "[[Commons:Project scope]]": 8,
6 "[[COM:PENIS]]": 6,
7 "[[COM:WEBHOST]]": 5,
8 "[[COM:DW|Derivative work]]": 4,
9 "[[COM:SCOPE|project scope]]": 3,
10 "[[Commons:Project scope|Out of scope]]": 3,
11 "[[Commons:Nudity#New uploads]]": 3
12}

1{
2 "scope": 54,
3 "unused": 47,
4 "copyright": 41,
5 "copyvio": 25,
6 "quality": 24,
7 "personal": 24,
8 "educational": 23,
9 "uploaded": 22,
10 "license": 21,
11 "useful": 19
12}

1Cluster 0: license free commons per upload graphic deleted non united pd
2Cluster 1: scope unused personal use diagram text project fantasy article flag
3Cluster 2: uploaded mistake wrong better already version accidentally accident designer picture
4Cluster 3: unused quality useful low personal unlikely svg logo promotional trivial
5Cluster 4: copyvio possible googlemaps source nowiki googleearth screenshot photos stated cover
6Cluster 5: copyright violation united states 2d uploader evidence free taken deleted
7Cluster 6: educational value scope unused artwork use purpose personal without self
8Cluster 7: copyrighted initially tagged ukraine france banner south korea painting license
9Cluster 8: right publish bottom incorrect chemical structure left watermark us carbons
10Cluster 9: facebook redirect broken claimed building uploader obsolete modern per incorrectly

Regular deletion requests

Dataset at https://docs.google.com/spreadsheets/d/1BT7oFNUHPFrgr65Wo6ZHrcYYqHTL49fkBm6plnIVhNw/edit?usp=sharing

1{
2 "[[COM:FOP|freedom of panorama]]": 18,
3 "[[COM:FOP|FoP]]": 10,
4 "[[COM:SCOPE]]": 7,
5 "[[COM:POSTER]]": 5,
6 "[[Special:Contributions/61.120.241.1|61.120.241.1]]": 4,
7 "[[COM:VRT]]": 4,
8 "[[Commons:Nudity#New uploads]]": 4,
9 "[[Commons:Project scope|project scope]]": 3,
10 "[[COM:TOO]]": 3,
11 "[[Commons:Project scope]]": 3
12}

1{
2 "copyright": 101,
3 "license": 43,
4 "uploader": 41,
5 "commons": 37,
6 "tagged": 36,
7 "source": 36,
8 "initially": 35,
9 "violation": 35,
10 "scope": 35,
11 "copyvio": 33
12}

1Cluster 0: license evidence cc free source logo sa released change copyright
2Cluster 1: copyright violation still author died permission uploader domain artist public
3Cluster 2: tagged initially mistaken apologies accept location clearly upload please mistake
4Cluster 3: scope personal unused logo project value educational also used quality
5Cluster 4: copyvio possible cover album artist copyviol logo according internet tineye
6Cluster 5: quality low can high structure chemical poor bad replaced non
7Cluster 6: wikipedia org title en index php article vandalism see oldid
8Cluster 7: ist bild kein eigenes werk dieses aber sein es der
9Cluster 8: copyrighted unused uploader commons source non uploaded permission use author
10Cluster 9: united states 2d emirates arab fop 3d japan artwork graphic


Analysis scale up

  • input: deletion requests dataset merged with deleted pages dataset
  • total requests: 53 k (53,021)
  • resolution time buckets:
    1. up to 1 week - 38 % (20,242)
    2. 1 week to 1 month - 37 % (19,777)
    3. 1 to 3 months - 15 % (7,936)
    4. 3+ months - 10 % (5,066)
  • top 10 wikilinks shared by all buckets:
  • top 10 wikilinks unique to each buckets:
    1. COM:NOTHOST - Commons is not a free Web host
    2. none
    3. COM:TOO UK - United Kingdom's threshold of originality
    4. COM:PCP - precautionary principle
  • top 10 words shared by all buckets:
    • copyright
    • uploader - typically related to either not own work or mistaken uploads
  • top 10 words unique to each bucket:
    1. educational, logo, personal, quality, uploaded
    2. possible
    3. free, see
    4. author, de, initially, tagged

Up to 1 week

11) wikilinks
2{
3 "COM:FOP": 1184,
4 "Commons:Project scope": 794,
5 "COM:SCOPE": 481,
6 "COM:DW": 234,
7 "COM:PENIS": 181,
8 "COM:VRT": 145,
9 "COM:PS": 130,
10 "Commons:Nudity#New uploads": 122,
11 "COM:NOTHOST": 100,
12 "COM:TOO": 93
13}
14
152) words
16{
17 "scope": 2714,
18 "copyright": 1948,
19 "unused": 1862,
20 "uploaded": 1263,
21 "personal": 1213,
22 "copyvio": 1114,
23 "uploader": 1003,
24 "logo": 983,
25 "quality": 898,
26 "educational": 897
27}
28
293) clusters
30Cluster 0: license source free cc evidence thus invalid copyright pd us
31Cluster 1: scope project personal text educational used content value non map
32Cluster 2: copyright violation uploader see still source metadata infringement holder permission
33Cluster 3: unused personal scope educational value non logo notable private project
34Cluster 4: uploaded mistake accidentally wrong version wikipedia already wikimedia better use
35Cluster 5: copyvio possible googlemaps cover album uploader picture logo exif screenshot
36Cluster 6: de united la 2d states que arab emirates por ser
37Cluster 7: quality useful low unused unlikely svg logo trivial educationally poor
38Cluster 8: uploader person per used permission use source non see upload
39Cluster 9: copyrighted tagged initially banner logo artwork poster still screenshot probably

1 week to 1 month

11) wikilinks
2{
3 "COM:FOP": 931,
4 "Commons:Project scope": 466,
5 "COM:SCOPE": 410,
6 "COM:DW": 225,
7 "COM:VRT": 192,
8 "COM:TOYS": 104,
9 "Commons:Nudity#New uploads": 96,
10 "COM:PS": 90,
11 "COM:POSTER": 80,
12 "COM:PENIS": 77
13}
14
152) words
16{
17 "copyright": 2966,
18 "scope": 1542,
19 "copyvio": 1464,
20 "unused": 1320,
21 "license": 1282,
22 "uploader": 1259,
23 "violation": 1084,
24 "source": 1033,
25 "copyrighted": 1014,
26 "possible": 923
27}
28
293) clusters
30Cluster 0: de uploaded per permission duplicate united see author use can
31Cluster 1: copyvio possible cover album logo googlemaps picture uploader author artist
32Cluster 2: unused personal scope educational logo value useful private svg trivial
33Cluster 3: copyrighted banner logo still materials likely per usa screen poster
34Cluster 4: scope project unused personal text educational content better self used
35Cluster 5: uploader copyright author unlikely permission metadata photographer deleted created original
36Cluster 6: license source cc free change evidence pd copyright invalid thus
37Cluster 7: copyright violation tagged initially died still artist status source see
38Cluster 8: quality low useful resolution unlikely poor better unused exif high
39Cluster 9: non notable free unused person logo scope private project personal

1 to 3 months

11) wikilinks
2{
3 "COM:FOP": 708,
4 "COM:VRT": 110,
5 "Commons:Project scope": 95,
6 "COM:SCOPE": 93,
7 "COM:DW": 88,
8 "COM:TOYS": 59,
9 "COM:POSTER": 43,
10 ":COM:VRT": 42,
11 "COM:TOO": 26,
12 "COM:TOO UK": 25
13}
14
152) words
16{
17 "copyright": 1252,
18 "license": 630,
19 "uploader": 600,
20 "source": 550,
21 "permission": 500,
22 "copyrighted": 460,
23 "copyvio": 430,
24 "violation": 400,
25 "see": 388,
26 "free": 387
27}
28
293) clusters
30Cluster 0: permission uploaded modern building mistake author source uae needed user
31Cluster 1: uploader permission copyright photographer original subject author copyvio evidence license
32Cluster 2: copyright violation see still holder permission source died years status
33Cluster 3: per non free author fop freedom panorama delete description original
34Cluster 4: tagged initially user procedural also permission br nomination nom span
35Cluster 5: copyrighted united arab emirates disney banner book logo images derivative
36Cluster 6: copyvio de 2d version united pd fop possible use graphic
37Cluster 7: license source duplicate free evidence cc pd published author unknown
38Cluster 8: scope quality low project useful used better poor resolution unlikely
39Cluster 9: unused logo useful svg trivial personal scope non notable copyvio

3+ months

11) wikilinks
2{
3 "COM:VRT": 65,
4 "COM:FOP": 41,
5 "COM:DW": 38,
6 "COM:TOO": 35,
7 ":pt:Usuário:Érico/Commons": 26,
8 "COM:PCP": 24,
9 "COM:SCOPE": 24,
10 "Special:Contributions/2A01:CB00:2A5:B300:C83A:A18E:DD31:7C94": 24,
11 "Commons:Project scope": 18,
12 "Special:Contributions/220.246.140.228": 17
13}
14
152) words
16{
17 "copyright": 922,
18 "tagged": 449,
19 "initially": 432,
20 "license": 422,
21 "source": 401,
22 "uploader": 356,
23 "de": 346,
24 "copyrighted": 293,
25 "permission": 289,
26 "author": 268
27}
28
293) clusters
30Cluster 0: scope quality unused low personal educational value poor resolution text
31Cluster 1: copyvio logo possible copyrighted threshold free likely originality screenshot probably
32Cluster 2: pd evidence us died publication source copyright years enough date
33Cluster 3: copyrighted permission uploader uploaded per see use taken author fop
34Cluster 4: tagged initially published user delete procedural nomination another please krg
35Cluster 5: copyright violation still uploader source years author see holder artist
36Cluster 6: es foto der keine nicht ist eigenes werk kein urheber
37Cluster 7: de la en des le du auteur une est il
38Cluster 8: duplicate quality jpg recording wikimedia wiki pronunciation low also version
39Cluster 9: license source cc free author permission evidence uploader wrong sa

Top reasons taxonomy

NOTE: this is a manually built attempt to classify top reasons as emerged from the analysis above.
  • copyright violation
    • derivative work
    • freedom of panorama
      • by country
    • threshold of originality
      • logo
      • Google maps
      • album cover
      • screenshot
      • poster
      • banner
      • book
    • not own work
    • non-free license
    • inquiry to volunteer response team
  • not suitable for work
    • not educational
    • nudity
    • penis
  • not a free Web host
    • personal use
    • unused file
    • selfie
    • low quality
  • deletion requested by the uploader -
    • mistake
    • better version available
  • duplicate
    • down-scaled
    • lower quality

Viable reasons frequency

We count how many wikilinks or full opening reason messages contain given keywords that are likely to trigger the above reasons.
Focus is on those that can be implemented as viable targets for automatic classifiers.
The table is sorted in descending order of full message percentages.

NOTE: wikilink percentages are based on 20 k (20,294) wikilinks extracted from opening reasons, full message percentages are based on 53 k total opening reasons.
reasonwikilink %totalfull message %totalcontains
freedom of panorama203,99294,866fop or freedom of panorama
logo0.817252507logo
screenshot0.09181.8975screenshot
duplicateN.A.N.A.1.7918duplicate
album cover~011.6841album
not suitable for work35891.3702penis or vulva or vagina or nudity
poster12161571poster
book~070.9475book
banner~020.3188banner

For the sake of completeness, we also report the following reasons:

reasonwikilink %totalfull message %totalcontains
derivative work36972.51,324dw or derivative
not a free Web host12641.4738host
threshold of originality24651.2625too or threshold

Deletion requests for multiple files

We run the analysis over an extended dataset that includes deletion requests for multiple files.

Top 10 opening reasons

11) wikilinks
2{
3 "Commons:Project scope": 10256,
4 "COM:FOP": 5680,
5 "COM:SCOPE": 3423,
6 "COM:DW": 1792,
7 "COM:NOTHOST": 1300,
8 "COM:VRT": 1217,
9 "COM:OTRS": 934,
10 "COM:WEBHOST": 875,
11 "COM:TOO": 708,
12 "COM:TOYS": 661,
13}
14
152) words
16{
17 "copyright": 22789,
18 "scope": 18008,
19 "unused": 17085,
20 "educational": 11116,
21 "logo": 10569,
22 "uploader": 10218,
23 "use": 9404,
24 "license": 9256,
25 "personal": 8915,
26 "source": 8064,
27}
28
293) clusters
30Cluster 0: uploader per unlikely exif small images data missing web resolutions
31Cluster 1: scope unused personal educational value project useful logo svg notability
32Cluster 2: copyrighted per logo banner still likely artwork copyright images book
33Cluster 3: copyright violation uploader see infringement still holder source status permission
34Cluster 4: license source free non cc evidence author permission pd website
35Cluster 5: used private album personal drawing page wikipedia article self wiki
36Cluster 6: de es la en trabajo propio wikipedia claimed logos needed
37Cluster 7: quality low resolution bad poor unused unlikely useful better scope
38Cluster 8: tagged initially uploaded use logo wikimedia permission educational pd fop
39Cluster 9: copyvio possible googlemaps cover album uploader logo picture screenshot author

Viable reasons frequency

  • total wikilinks: 73k (73,167)
  • total opening reason messages: 170k (170,448)
reasonwikilink %totalfull message %totalcontains
freedom of panorama1410,4058.113,846fop or freedom of panorama
logo17327.212,399logo
book~0632.44,008book
screenshot~0741.93,303screenshot
album cover~0301.93,182album
duplicate~021.52,512duplicate
poster0.42930.81,327poster
not suitable for work1.39730.71,216penis or vulva or vagina or nudity
banner~090.2332banner

Conclusion

  • The analysis is quite consistent with the previous dataset:
    • freedom of panorama still ranks first, despite being a little less represented (-0.9%)
    • logo still ranks second and gains +2%
    • book now ranks third
  • deletion requests for multiple files may be very large, e.g., this one accounts for 57k files

Related Objects

Event Timeline

Hi @AUgolnikova-WMF, can you please associate one or more active project tags with this task (via the Add Action...Change Project Tags dropdown)? That will allow to see a task when looking at project workboards or searching for tasks in certain projects, and get notified about a task when watching a related project tag. Thanks!

Cparle renamed this task from Analysis of deletion requests on Commons to [XL] Analysis of deletion requests on Commons.Jun 28 2023, 4:37 PM
Cparle updated the task description. (Show Details)
mfossati changed the task status from Open to In Progress.Jul 4 2023, 8:32 AM
mfossati claimed this task.

Hi, I am answering this because I believe a workmate of mine was pinged about it, but I was out of office. Deletion information is sadly in a really bad state due to pending big refactoring in file metadata (T28741). It is very prone to having bugs, losing metadata and other issues, usually noted by Commons users.

Because work done for backing up media files, I did a bit of archeology on the code and database structure, but it is too complex to reflect here, I can provide documents or go in a call for you for details. Other than the logs (which won't be complete), the data of deleted images is distrubuted between the filearchive and oldimage tables, and won't contain the full lifecycle. In addition to that, hard-deleted files haven't been tracked until 2022 on backups metadata. Even in those cases, image metadata in general has thousands of incorrect or never existing references to images (around 20-30K of the 100 mill).

Regarding access tips, I will let the DBAs comment on that, but feel free to contact me by email or phabricator if you want more info.

Hi, For doing complex and long-running db queries in production, you can use analytics replicas: https://wikitech.wikimedia.org/wiki/Analytics/Systems/MariaDB

Thanks @jcrespo and @Ladsgroup for your feedback.

Other than the logs (which won't be complete), the data of deleted images is distrubuted between the filearchive and oldimage tables

I'm looking for actual files, not metadata. Those tables only contain metadata, right? Or am I missing something?
For instance, given the deleted fa_name = Fic!geno.jpg (fa_sha1 = 0c73qbwufaiywkelxhlkdtryq0t166k1) from filearchive, the goal is to retrieve the actual file, I guess from the Swift object storage.

Hi, For doing complex and long-running db queries in production, you can use analytics replicas: https://wikitech.wikimedia.org/wiki/Analytics/Systems/MariaDB

Sure thing, that's what I'm using.

I'm looking for actual files, not metadata. Those tables only contain metadata, right? Or am I missing something?

That's right, but I understood you were primarily interested on its properties and deletion reasons based on the text in this task. In any case, in order to retrieve them, you will need to know its container and path, and that will be on the metadata. Just having its title on deletion is not enough- as the files could be restored, moved and deleted multiple times (there is no single unique identifier of each file data). Downloading those files, while those are not private data, they are considered non-public and should never leave the production network (and be careful to delete any local copies after analysis). You will have to use mediawiki for download.

Alternative, we can lend you resources from the backup cluster, as it has a saner and less broken metadata and storing logic, if that simplifies things (e.g. you can search files by sha1 hash, title, etc.).

I'm looking for actual files, not metadata. Those tables only contain metadata, right? Or am I missing something?

That's right, but I understood you were primarily interested on its properties and deletion reasons based on the text in this task.

Yeah sorry this ticket is quite big. I should have referenced T339224: Retrieve actual data from Commons deletion requests instead.

Alternative, we can lend you resources from the backup cluster, as it has a saner and less broken metadata and storing logic, if that simplifies things (e.g. you can search files by sha1 hash, title, etc.).

That would be great, indeed. Please let me know the next steps I have to take. Maybe create a request ticket?

That would be great, indeed. Please let me know the next steps I have to take. Maybe create a request ticket?

Please do. On the ticket you last referenced, you mention: "The deletion requests sample analysis would greatly benefit from actual data, not just deletion logs" that would not be enough for approving an SRE-Access-Requests, which that would be equivalent to. Please note you don't have to explain the analysis you are doing, but the technical details side (e.g. completely made up example: I will be downloading 2 million files, with a concurrency of 4, run this script that counts the number of times the red color is used on an image- and then delete them. In the case of deleted files, someone also -not sure who (if security, wiki admins, T&S) - should be able to have an ultimate saying on what are the actual limits on what exactly can be made public about those.

Apologies the process is not smooth or easy but in the case of backup access in particular, that is not something we have done before and it is in the most inaccessible part of the cluster and without existing automations or procedures outside of pure root/recovery access- so depending on what you need there may be some things that we may have to either do for you (for example, the easiest way would be for you. you give us a list of files, we setup a host, disk or vm for analyzing them on your own, and we recover the files for you and then you do your analysis on that host) or implement speciall access for it.

So the more info you can give us, the better we can serve you. 0:-) Please create such a ticket -for transparency purposes-, add me as CC and say we are in contact to see how to proceed (sometimes also other can chime in on the best ways to solve a need).

I forgot, if you can also provide a schedule when that access is needed, it would be ideal (feel free to be generous in your timeline, e.g. plan for delays by making it longer than expected).

Moving to design QA for further discussion before closing.

Pasting the latest analysis here for convenience.

Viable reasons frequency

  • total wikilinks: 73k (73,167)
  • total opening reason messages: 170k (170,448)
reasonwikilink %totalfull message %totalcontains
freedom of panorama1410,4058.113,846fop or freedom of panorama
logo17327.212,399logo
book~0632.44,008book
screenshot~0741.93,303screenshot
album cover~0301.93,182album
duplicate~021.52,512duplicate
poster0.42930.81,327poster
not suitable for work1.39730.71,216penis or vulva or vagina or nudity
banner~090.2332banner

Conclusion

  • The analysis is quite consistent with the previous dataset:
    • freedom of panorama still ranks first, despite being a little less represented (-0.9%)
    • logo still ranks second and gains +2%
    • book now ranks third
  • deletion requests for multiple files may be very large, e.g., this one accounts for 57k files