Page MenuHomePhabricator

[Spike] Evaluate text in title that is likely to result in a deletion request
Closed, ResolvedPublic

Description

Certain text, when used in a filename or description, suggests that the uploaded file is more likely to be deleted

Evaluate whether this is a useful tool to flag uploads

Suggested text to search for

Lego

Penis
Scrotum
Erection
Buttocks
Cock ring
Coitus
Penile
Nude
Vagina
Masturbation
Vulva
Testicle

logo

Selfie

Louvre pyramid
Burj khalifa
Eiffel Tower night
Atomium

Event Timeline

I did some prelimary work on this while finishing up T344060. Results below. Note this is only for filenames, it's a bit more work to extract descriptions

Uploads in 2022

379015 unique filenames uploaded in 2022 have been deleted

TotalDeleted% deletion rate% of total deletions
Filename containing "logo"4510510033222.6
Filename containing "lego"2025025<0.1
Filename containing "selfie"27910337<0.1
Filename containing "nude"74717924<0.1
Filename containing sexual words166768541<0.1
Filename containing famous buildings above22100<0.1

The notebook used to gather these numbers is here https://gitlab.wikimedia.org/cparle/notebooks/-/blob/main/T347963.ipynb

The table below is also from the notebook. There are 46106 deletions (out of 379015, so 12%) where the filename contains a word that

  • has a >20% deletion rate
  • is contained in the filename of >1000 deleted files

... which suggests to me this is worth investigating more deeply

wordtotal_filenames_containing_worddeleted_filenames_containing_worddeletion_rateproportion_of_deletions
png380054628640.165408073589542540.1658615094389404
colorado36529112910.30909688192942590.029790377689537354
logo45105100330.222436536969293860.026471247839795257
webp1477878030.52801461632155910.020587575689616507
report2674259010.22066412385012340.015569304644934896
annual2289049150.21472258628221930.012967824492434336
work1429133200.232314043803792590.008759547775154017
flag2198830380.138166272512279420.008015513897866838
twemoji13358030140.84189944134078210.007952191865757291
gif1536126760.174207408371850780.007060406580214503
leaves859324960.290468986384266270.0065854913393929
photo1875021850.116533333333333340.005764943339973352
nr1003421230.211580625872035080.005601361423690355
альбом198419480.98185483870967740.005139638272891574
agricultural767518720.243908794788273630.004939118504544675
extension628817400.27671755725190840.004590847327942166
церковь501516310.325224327018943160.004303259765444639
poster358416020.446986607142857150.00422674564331227
agent572715730.274663872882835670.0041502315211799
union1443715590.107986423772251860.004113293669115998
cover570815010.262964260686755440.003960265424851259
performance558714800.264900662251655650.0039048586467554055
trials377912620.33395078062979620.0033296835217603523
dr973912040.123626655714139030.0031766552774956137
larimer291211890.408310439560439550.0031370790074271467
file447011660.26085011185682330.003076395393322164
pm593311620.195853699646047540.003065841721303906
reclame116311380.97850386930352540.003002519689194359
making430311210.260515919126191040.002957666583116763
better346811140.321222606689734740.002939197657084812
200d580111070.190829167384933630.0029207287310528606
decisions286710620.37042204394837810.00280199992084746
america868310490.120810779684440860.0027677004867881216
AUgolnikova-WMF renamed this task from Evaluate text in title and description that is likely to result in a deletion request to [Spike] Evaluate text in title and description that is likely to result in a deletion request.Oct 24 2023, 3:31 PM
Cparle renamed this task from [Spike] Evaluate text in title and description that is likely to result in a deletion request to [Spike] Evaluate text in title that is likely to result in a deletion request.Dec 5 2023, 9:44 AM

Ok having done quite a bit of futile digging I've discovered that neither template links nor wikitext for deleted revisions is stored in the data lake (or dumps), and if we want to find patterns in descriptions/templates related to deletion requests then we'll need to get the wikitext from external storage

So ... closing this ticket, and will add another to do with getting the wikitext for deleted revisions/pages