Page MenuHomePhabricator

[Spike] Check if we can use metadata for freedom of panorama flagging
Closed, ResolvedPublic

Description

As per https://phabricator.wikimedia.org/T340546 one of the main reasons for deletions seems to be freedom of panorama. This spike is to explore whether a potential low hanging fruit of checking EXIF location and comparing against known DB of such locations (to later provide warnings) would be possible with existing metadata.

Check:

  • whether enough FOP deletions are actually of the nature that it is feasible to maintain a blocklist (i.e. most locations are not unique)
  • whether enough uploads have the relevant EXIF data

Event Timeline

When proposing this, I was thinking along the lines of:

  • build an on-wiki list of GPS coordinates for which violations are likely that Commons admins can maintain
  • compare against EXIF data from new uploads:
    • with GPSLatitude & GPSLongitude (or GPSPosition) we know the camera position, and with GPSDestBearing (or GPSImgDirection) alongside FOV we can determine whether one of those coordinates is potentially in sight
  • warn if match is found

Note: before we actually build this, it would be worth doublechecking (I guess that would be this spike!):

  • whether enough FOP deletions are actually of the nature that it is feasible to maintain a blocklist (i.e. most locations are not unique)
  • whether enough uploads have the relevant EXIF data

Preliminary findings from the spike (building on T340546)

  • between Jan 2021 and Sept 2023 there were 22k files deleted that had GPS coords in their metadata
  • these can be clustered into just under 500 geographical clusters with epsilon = 1km
  • around 15k of the files deleted in that time are in clusters that are associated with at least one deletion request that mentions freedom of panorama
  • total files deleted in that time is ~600k, so we're looking at around 2.5% of deletions
    • note that only a small proportion of files in a cluster will be FoP violations
    • note also that the deletion requests dataset covers DRs for single files, and we don't currently have a way to intersect files-with-coords with DRs that cover multiple files (e.g. when a whole category is deleted)

(See here for the code https://gitlab.wikimedia.org/cparle/notebooks/-/blob/main/T344060.ipynb)


Next step

The proportion of files with GPS coords seems pretty low, so the whole GPS approach doesn't seem very promising ... however we probably still have less data than we need. Here's some more data we need to gather:

  • all files subject to deletion requests - including page title (and possibly captions and descriptions), opening/closing reason, DR inception date, deletion date, GPS coords (if available), and whether they were uploaded via VE (if possible)
  • centroids of all the clusters
  • numbers of non-deleted files uploaded in time period X that are within Y distance of the centroid of a cluster associated with a lot of deletions, and proportions of deleted/non-deleted
  • an indication of how many files would be flagged if we implemented this for each cluster (or particular clusters) over time period X, and how many of those have actually been deleted

Possible next steps

Having gathered the dataset in the first bullet point above we might be able to classify DRs, and create a model from other data (like the filename) to predict how likely an image is to result in different kinds of DRs

Some extra incidental data about deletions:

All time

97784975 live files, 7030603 deleted files
Deletion rate ~6.7%
In total on commons there are 77183 live files uploaded via VE, and 19359 deleted files that were uploaded via VE, so deletion rate is ~20%
In total on commons there are 1654705 File pages in total tagged with 'crosswiki-upload', and 643733 of those have been deleted, so deletion rate is ~39%

Speedy deletions
Note that an admin does not need to add a deletion request to a file that obviously needs deletion. In the period Sept 1 2022 - Sept 1 2023 650521 files were deleted, 414671 without deletion requests

Sept 1 2022 - Sept 1 2023

11213503 unique filenames uploaded, 334226 deleted
Deletion rate ~3%

Visual editor
In the period Sept 1 2022 - Sept 1 2023 there were 9808 unique filenames uploaded via VE, and 3296 of those have been deleted
Current deletion rate for VE is ~25%, so substantially higher than the background deletion rate, but <1% of deleted files were uploaded with VE

Cross-wiki uploads
In the period Sept 1 2022 - Sept 1 2023 there were 218959 unique filenames uploaded crosswiki and 48897 of those have been deleted
So around ~22% of crosswiki uploads are getting deleted
Crosswiki uploads are 2% of total uploads in this period, but account for 15% of total deletions

Conclusion

  • between Jan 1 2021 and Sept 1 2023 there were 1394606 unique filenames deleted
  • ~22k of these had GPS metadata
  • these can be clustered into ~500 clusters with epsilon = 1km
  • only 335 of the clustered deleted files had "freedom of panorama" or "FoP" in their deletion request text

... which suggests that flagging files for possible FoP violations based on GPS metadata is very unlikely to be useful for commons moderators


The notebook used to gather the deletion data is here https://gitlab.wikimedia.org/cparle/notebooks/-/blob/main/T344060.ipynb
The output containing the deleted files, their related deletion requests, GPS coords and cluster ids, and flags indicating whether a file was uploaded via UW, VE or crosswiki, is on hdfs at /user/cparle/deleted_image_data.parquet