Page MenuHomePhabricator

[L] Metrics for UW improvements
Closed, ResolvedPublic

Description

What we need as a metric is:

  • proportion of uploads via UW that have a deletion request containing the file created within 30 days of upload, by month.
  • proportion of uploads via UW that have a deletion request that mentions copyright containing the file created within 30 days of upload, by month
  • MVP would just be a percentage calculated at the end of the following month (so we'd have March's data at the end of April) that's easily available to the team

I'd suggest we do this by running the extract_deletion_requests.py script, plus another script to process the extracted DR data plus extra data from the data lake, via a cronjob. The scripts should send an alert to sd-alerts@lists.wikimedia.org with the number we want

We'll need to do an initial generation of data over at least the last year, so we can compare year-on-year variation by month as well as month-to-month variation

Once that's up and running we can then (in another ticket) build on it to gather other data (like post-30-days DRs, comparisons with other upload methods, etc)

Relevant tickets
Note that DR ratio baselines without the copyright reason was calculated in https://phabricator.wikimedia.org/T337466
Reasons for deletion requests were calculated in https://phabricator.wikimedia.org/T340546
Baseline data gathering T349380

Event Timeline

MarkTraceur renamed this task from Metrics for UW improvements to [L] Metrics for UW improvements.Oct 18 2023, 4:17 PM

I'm using the following to match copyright violations

opening_reason LIKE '%copyright violation%' OR
    opening_reason LIKE '%copyvio%' OR
    opening_reason like '%logo%' OR
    opening_reason like '%no license%' OR
    opening_reason like '%no permission%' OR
    opening_reason like '%Commons:Copyright_rules_by_subject_matter%' OR
    opening_reason like '%Commons:Licensing%' OR
    opening_reason like '%COM:BOOK%' OR
    opening_reason like '%COM:CSD#F1%' OR
    opening_reason like '%COM:CSD#F2%' OR
    opening_reason like '%COM:CSD#F3%' OR
    opening_reason like '%COM:CSD#F4%' OR
    opening_reason like '%COM:CSD#F5%' OR
    opening_reason like '%COM:CSD#F6%' OR
    opening_reason like '%COM:DW%' OR
    opening_reason like '%COM:EI%' OR
    opening_reason like '%COM:FAIRUSE%' OR
    opening_reason like '%COM:L%' OR
    opening_reason like '%COM:PERMISSION%' OR
    opening_reason like '%COM:NETCOPYVIO%' OR
    opening_reason like '%COM:PCP%' OR
    opening_reason like '%COM:POSTER%' OR
    opening_reason like '%COM:TOO%' OR
    opening_reason like '%COM:TOYS%' OR
    closing_reason LIKE '%copyright violation%' OR
    closing_reason LIKE '%copyvio%' OR
    closing_reason like '%logo%' OR
    closing_reason like '%no license%' OR
    closing_reason like '%no permission%' OR
    closing_reason like '%Commons:Copyright_rules_by_subject_matter%' OR
    opening_reason like '%Commons:Licensing%' OR
    closing_reason like '%COM:BOOK%' OR
    closing_reason like '%COM:CSD#F1%' OR
    closing_reason like '%COM:CSD#F2%' OR
    closing_reason like '%COM:CSD#F3%' OR
    closing_reason like '%COM:CSD#F4%' OR
    closing_reason like '%COM:CSD#F5%' OR
    closing_reason like '%COM:CSD#F6%' OR
    closing_reason like '%COM:DW%' OR
    closing_reason like '%COM:EI%' OR
    closing_reason like '%COM:FAIRUSE%' OR
    closing_reason like '%COM:L%' OR
    closing_reason like '%COM:PERMISSION%' OR
    closing_reason like '%COM:NETCOPYVIO%' OR
    closing_reason like '%COM:PCP%' OR
    closing_reason like '%COM:POSTER%' OR
    closing_reason like '%COM:TOO%' OR
    closing_reason like '%COM:TOYS%'

@mfossati does this seem reasonable?

Here are my suggestions:

  1. match against deletion request opening reasons, closing reasons, and file deletion reasons (AKA edit messages or revision comments), See data lake query to extract file deletion reasons
  2. expand all wikilink shorthands, such as COM:DW = COM:DERIV = Commons:Derivative_works. Note that COM prefixes always expand to Commons
  3. compile the list by looking at wikilinks and words frequencies, plus word clusters from T340546: [XL] Analysis of deletion requests on Commons. Most frequent are:
    • COM:FOP = COM:PANO = Commons:Freedom_of_panorama
    • COM:SS = COM:SCREENSHOT[S] = Commons:Screenshots
    • COM:ALBUM = Commons:ALBUM

You can find attached the latest analysis result.

Ok so I've been through the analysis, and tbh it's not always clear when something is a copyvio. Even when you have the text "copyright" in a DR it doesn't necessarily mean that the file is being deleted because of copyright issues - sometimes all of a user's uploads will get deleted and copyright will be mentioned in the DR along with COM:SCOPE, so we don't know that individual images are copyvios.

What we need for the metric is a consistent (rather than exact) model of a copyvio so that we can measure how copyvio numbers change over time, so maybe this is good enough?

Query for copyvio is below.

Here's where opening_reason and closing_reason come from

Screenshot 2023-10-20 at 12-45-48 Commons Deletion requests_Archive_2023_09_02 - Wikimedia Commons.png (160×960 px, 62 KB)

And here's where deletion_comment comes from

Screenshot 2023-10-20 at 12-52-30 Commons Deletion requests_File Leodie Joséphine Estelle BRAND.jpg Revision history - Wikimedia Commons.png (116×1 px, 43 KB)

Note that I've deliberately left our freedom of panorama, because our current UW changes aren't aimed at that

opening_reason LIKE '%copyright%' OR
    opening_reason LIKE '%copyvio%' OR
    opening_reason like '%logo%' OR
    opening_reason like '%license%' OR
    opening_reason like '%permission%' OR
    
    opening_reason like '%COM:ALBUM%' OR
    opening_reason like '%Commons:ALBUM%' OR
    opening_reason like '%COM:BOOK%' OR
    opening_reason like '%Commons:BOOK%' OR
    opening_reason like '%Commons:Copyright_rules_by_subject_matter%' OR
    (opening_reason like '%COM:CSD#F1%' AND opening_reason not like '%COM:CSD#F10%') OR
    opening_reason like '%COM:CSD#F2%' OR
    opening_reason like '%COM:CSD#F3%' OR
    opening_reason like '%COM:CSD#F4%' OR
    opening_reason like '%COM:CSD#F5%' OR
    opening_reason like '%COM:CSD#F6%' OR
    opening_reason like '%COM:DW%' OR
    opening_reason like '%Commons:DW%' OR
    opening_reason like '%Commons:Derivative_works%' OR
    opening_reason like '%COM:EI%' OR
    opening_reason like '%Commons:Essential_information%' OR
    opening_reason like '%COM:FAIRUSE%' OR
    opening_reason like '%Commons:FAIRUSE%' OR
    opening_reason like '%Commons:Fair_use%' OR
    opening_reason like '%COM:L%' OR
    opening_reason like '%Commons:Licensing%' OR
    opening_reason like '%COM:PERMISSION%' OR
    opening_reason like '%Commons:PERMISSION%' OR
    opening_reason like '%Commons:Permission%' OR
    opening_reason like '%COM:NETCOPYVIO%' OR
    opening_reason like '%Commons:NETCOPYVIO%' OR
    opening_reason like '%COM:PCP%' OR
    opening_reason like '%Commons:PCP%' OR
    opening_reason like '%Commons:Project_scope/Precautionary_principle%' OR    
    opening_reason like '%COM:POSTER%' OR
    opening_reason like '%Commons:POSTER%' OR
    opening_reason like '%COM:SS%' OR
    opening_reason like '%Commons:SS%' OR
    opening_reason like '%COM:SCREENSHOT%' OR
    opening_reason like '%Commons:SCREENSHOT%' OR
    opening_reason like '%COM:Screenshot%' OR
    opening_reason like '%Commons:Screenshots%' OR
    opening_reason like '%COM:TOO%' OR
    opening_reason like '%Commons:TOO%' OR
    opening_reason like '%Commons:Threshold_of_originality%' OR
    opening_reason like '%COM:TOYS%' OR
    opening_reason like '%Commons:TOYS%' OR
    
    
    closing_reason LIKE '%copyright%' OR
    closing_reason LIKE '%copyvio%' OR
    closing_reason like '%logo%' OR
    closing_reason like '%license%' OR
    closing_reason like '%permission%' OR
    
    closing_reason like '%COM:ALBUM%' OR
    closing_reason like '%Commons:ALBUM%' OR
    closing_reason like '%COM:BOOK%' OR
    closing_reason like '%Commons:BOOK%' OR
    closing_reason like '%Commons:Copyright_rules_by_subject_matter%' OR
    (closing_reason like '%COM:CSD#F1%' AND closing_reason not like '%COM:CSD#F10%') OR    
    closing_reason like '%COM:CSD#F2%' OR
    closing_reason like '%COM:CSD#F3%' OR
    closing_reason like '%COM:CSD#F4%' OR
    closing_reason like '%COM:CSD#F5%' OR
    closing_reason like '%COM:CSD#F6%' OR
    closing_reason like '%COM:DW%' OR
    closing_reason like '%Commons:DW%' OR
    closing_reason like '%Commons:Derivative_works%' OR
    closing_reason like '%COM:EI%' OR
    closing_reason like '%Commons:Essential_information%' OR
    closing_reason like '%COM:FAIRUSE%' OR
    closing_reason like '%Commons:FAIRUSE%' OR
    closing_reason like '%Commons:Fair_use%' OR
    closing_reason like '%COM:L%' OR
    closing_reason like '%Commons:Licensing%' OR
    closing_reason like '%COM:PERMISSION%' OR
    closing_reason like '%Commons:PERMISSION%' OR
    closing_reason like '%Commons:Permission%' OR
    closing_reason like '%COM:NETCOPYVIO%' OR
    closing_reason like '%Commons:NETCOPYVIO%' OR
    closing_reason like '%COM:PCP%' OR
    closing_reason like '%Commons:PCP%' OR
    closing_reason like '%Commons:Project_scope/Precautionary_principle%' OR    
    closing_reason like '%COM:POSTER%' OR
    closing_reason like '%Commons:POSTER%' OR
    closing_reason like '%COM:SS%' OR
    closing_reason like '%Commons:SS%' OR
    closing_reason like '%COM:SCREENSHOT%' OR
    closing_reason like '%Commons:SCREENSHOT%' OR
    closing_reason like '%COM:Screenshot%' OR
    closing_reason like '%Commons:Screenshots%' OR
    closing_reason like '%COM:TOO%' OR
    closing_reason like '%Commons:TOO%' OR
    closing_reason like '%Commons:Threshold_of_originality%' OR
    closing_reason like '%COM:TOYS%' OR
    closing_reason like '%Commons:TOYS%' OR
    
    
    deletion_comment LIKE '%copyright%' OR
    deletion_comment LIKE '%copyvio%' OR
    deletion_comment like '%logo%' OR
    deletion_comment like '%license%' OR
    deletion_comment like '%permission%' OR
    
    deletion_comment like '%COM:ALBUM%' OR
    deletion_comment like '%Commons:ALBUM%' OR
    deletion_comment like '%COM:BOOK%' OR
    deletion_comment like '%Commons:BOOK%' OR
    deletion_comment like '%Commons:Copyright_rules_by_subject_matter%' OR
    (deletion_comment like '%COM:CSD#F1%' AND deletion_comment not like '%COM:CSD#F10%') OR
    deletion_comment like '%COM:CSD#F2%' OR
    deletion_comment like '%COM:CSD#F3%' OR
    deletion_comment like '%COM:CSD#F4%' OR
    deletion_comment like '%COM:CSD#F5%' OR
    deletion_comment like '%COM:CSD#F6%' OR
    deletion_comment like '%COM:DW%' OR
    deletion_comment like '%Commons:DW%' OR
    deletion_comment like '%Commons:Derivative_works%' OR
    deletion_comment like '%COM:EI%' OR
    deletion_comment like '%Commons:Essential_information%' OR
    deletion_comment like '%COM:FAIRUSE%' OR
    deletion_comment like '%Commons:FAIRUSE%' OR
    deletion_comment like '%Commons:Fair_use%' OR
    deletion_comment like '%COM:L%' OR
    deletion_comment like '%Commons:Licensing%' OR
    deletion_comment like '%COM:PERMISSION%' OR
    deletion_comment like '%Commons:PERMISSION%' OR
    deletion_comment like '%Commons:Permission%' OR
    deletion_comment like '%COM:NETCOPYVIO%' OR
    deletion_comment like '%Commons:NETCOPYVIO%' OR
    deletion_comment like '%COM:PCP%' OR
    deletion_comment like '%Commons:PCP%' OR
    deletion_comment like '%Commons:Project_scope/Precautionary_principle%' OR    
    deletion_comment like '%COM:POSTER%' OR
    deletion_comment like '%Commons:POSTER%' OR
    deletion_comment like '%COM:SS%' OR
    deletion_comment like '%Commons:SS%' OR
    deletion_comment like '%COM:SCREENSHOT%' OR
    deletion_comment like '%Commons:SCREENSHOT%' OR
    deletion_comment like '%COM:Screenshot%' OR
    deletion_comment like '%Commons:Screenshots%' OR
    deletion_comment like '%COM:TOO%' OR
    deletion_comment like '%Commons:TOO%' OR
    deletion_comment like '%Commons:Threshold_of_originality%' OR
    deletion_comment like '%COM:TOYS%' OR
    deletion_comment like '%Commons:TOYS%'

Ok so we decided not to automate this seeing as it's easy just to run the notebook

Spreadsheet updated https://docs.google.com/spreadsheets/d/1_baSI_SO4GeCA8U1qDkR32ayptVPzQB_6IWbxYpKLMw/edit?usp=sharing

Also notebook updated to be easier to run https://gitlab.wikimedia.org/cparle/notebooks/-/blob/main/T348845.ipynb

Closing ticket