Page MenuHomePhabricator

Metrics for SDoC: look at querying databases
Closed, ResolvedPublic6 Story Points

Description

Querying databases

  • How many: mpegs, pngs, ogg, etc
    • Benefits to being able to answer question re: filetypes (example)
      • Validate future tools (currently focused on images)
      • Push to focus more on other media types (e.g. audio, video)
      • Would be good to show “here’s how many files that are images and why it’s okay for us to focus on those for now
  • After MP3 upload is implemented
    • Track usage
      • Tell the story on how it went / how it’s going
    • Track organic growth rate of uploads (historical trends)
      • By file type (including 3D (STL), vector formats, etc)
      • Use case: Is there a desire to add / create more functionality for certain file types based on the growth of those file types
      • Don’t include files uploaded and then deleted before a certain threshold or uploaded by an account that was deleted (e.g. spam bots)
        • Some organizations look at how many of their files were deleted - due to image quality
        • What is the minimal viable upload quality
          • "Statements” (metadata) that were added after upload - compare that with deletions??
          • Would be great to be able to say “this is the metadata you should have if you want to ensure a good outcome”
          • Might be best shown with queries - rather than a dashboard (ongoing metric tracking)
          • What is the ‘median’ survival time of metadata? (What’s the number used for new articles?)
    • How many files are getting deleted? F10148687
      • copyright violations (Use case: creation of auto-copyright violation tools)
      • OTRS
    • Average time to deletion?
    • How many people are involved in flagging for deletion/deleting files

Event Timeline

debt created this task.Oct 3 2017, 11:41 PM
debt updated the task description. (Show Details)
mpopov claimed this task.Oct 11 2017, 3:59 PM
mpopov moved this task from Needs triage to Current work on the Discovery-Analysis board.
mpopov moved this task from Backlog to In progress on the Discovery-Analysis (Current work) board.
mpopov set the point value for this task to 6.

Reasons for files deleted in 2017:

debt updated the task description. (Show Details)Oct 11 2017, 9:42 PM
mpopov updated the task description. (Show Details)Oct 11 2017, 9:57 PM

Time-to-deletion:

  • Most copyright-related deletions happen within 1 day of upload across almost all media types, with the exception of 'drawing' (SVGs)
  • A lot of audio files are deleted within 1 minute or 1 week of upload
  • Half of all images and PDFs deleted were deleted within 1 month of upload for non-copyright reasons
mpopov updated the task description. (Show Details)Oct 11 2017, 11:28 PM
debt added a comment.EditedOct 12 2017, 5:48 PM

Something to look at in the future, @Ramsey-WMF: maybe we should add in a dropdown or some other method to give users that delete media to use the same words/terms for the deletion ("copyright" vs "copyright vio" vs "copyright violation" vs "____") so that we can track/measure in a better, more automated fashion.

mpopov added a comment.EditedOct 13 2017, 6:18 PM

Historical trends

Monthly upload counts

Fixed grammar:


Previous:

Cumulative upload counts

Distribution of file formats

Treemap (sans jpg/jpegs because holy moley there's 37M of those and that's more than all the others combined):

Fixed grammar:


Previous:

Total files uploaded to Commons (as of right now) by extension:

mediaextensionuploads
audioogg773305
audiooga6180
audioflac6140
audiomid4993
audiowav3512
audioopus410
docspdf354765
docsdjvu60524
imagejpg/jpeg36918799
imagepng2268026
imagesvg1176530
imagetif/tiff807921
imagegif153959
imagexcf1008
imagewebp95
videoogv66610
videowebm41161
mpopov updated the task description. (Show Details)Oct 13 2017, 6:18 PM

Growth of number of deleters over time:

How many users deleted N-many files:

mpopov updated the task description. (Show Details)EditedOct 13 2017, 7:42 PM
mpopov moved this task from In progress to Done on the Discovery-Analysis (Current work) board.

Queries & data uploaded to https://github.com/wikimedia-research/SDoC-Initial-Metrics

All figures from this ticket: https://github.com/wikimedia-research/SDoC-Initial-Metrics/tree/master/T177356

Moving this into 'Done' as I don't think there's anything left to do on this one.

debt updated the task description. (Show Details)Oct 16 2017, 6:33 PM

@Ramsey-WMF, @Abit, and @Capt_Swing - can you take a look at the findings in this ticket and let us know if this satisfies the concerns or would you need / want more clarification?

@debt I am quite satisfied :). I'll need to go over some of this with the Multimedia team to verify, but so far this looks solid. Thanks to @mpopov for the speedy work. Very enlightening.

debt awarded a token.Oct 17 2017, 9:11 PM
debt moved this task from Needs review to Done on the Discovery-Analysis (Current work) board.
debt closed this task as Resolved.Oct 19 2017, 8:00 PM