Page MenuHomePhabricator

Metrics for SDoC: look at querying databases
Closed, ResolvedPublic6 Estimated Story Points

Assigned To
Authored By
debt
Oct 3 2017, 11:41 PM
Referenced Files
F10188503: deleter_activity.png
Oct 13 2017, 7:31 PM
F10188497: cumulative_deleters.png
Oct 13 2017, 7:31 PM
F10187368: monthly_uploads.png
Oct 13 2017, 6:22 PM
F10187371: treemap_uploads.png
Oct 13 2017, 6:22 PM
F10187336: monthly_uploads.png
Oct 13 2017, 6:18 PM
F10187334: treemap_uploads.png
Oct 13 2017, 6:18 PM
F10187339: cumulative_uploads.png
Oct 13 2017, 6:18 PM
F10165222: licensing.png
Oct 12 2017, 6:55 PM
Tokens
"Party Time" token, awarded by debt.

Description

Querying databases

  • How many: mpegs, pngs, ogg, etc
    • Benefits to being able to answer question re: filetypes (example)
      • Validate future tools (currently focused on images)
      • Push to focus more on other media types (e.g. audio, video)
      • Would be good to show “here’s how many files that are images and why it’s okay for us to focus on those for now
  • After MP3 upload is implemented
    • Track usage
      • Tell the story on how it went / how it’s going
    • Track organic growth rate of uploads (historical trends)
      • By file type (including 3D (STL), vector formats, etc)
      • Use case: Is there a desire to add / create more functionality for certain file types based on the growth of those file types
      • Don’t include files uploaded and then deleted before a certain threshold or uploaded by an account that was deleted (e.g. spam bots)
        • Some organizations look at how many of their files were deleted - due to image quality
        • What is the minimal viable upload quality
          • "Statements” (metadata) that were added after upload - compare that with deletions??
          • Would be great to be able to say “this is the metadata you should have if you want to ensure a good outcome”
          • Might be best shown with queries - rather than a dashboard (ongoing metric tracking)
          • What is the ‘median’ survival time of metadata? (What’s the number used for new articles?)
    • How many files are getting deleted? F10148687
      • copyright violations (Use case: creation of auto-copyright violation tools)
      • OTRS
    • Average time to deletion?
    • How many people are involved in flagging for deletion/deleting files

Event Timeline

debt updated the task description. (Show Details)
mpopov moved this task from Needs triage to Current work on the Discovery-Analysis board.
mpopov moved this task from Backlog to In progress on the Discovery-Analysis (Current work) board.
mpopov set the point value for this task to 6.

Reasons for files deleted in 2017:

deletion_reasons.png (1×2 px, 607 KB)

Time-to-deletion:

time-to-deletion.png (1×2 px, 347 KB)

  • Most copyright-related deletions happen within 1 day of upload across almost all media types, with the exception of 'drawing' (SVGs)
  • A lot of audio files are deleted within 1 minute or 1 week of upload
  • Half of all images and PDFs deleted were deleted within 1 month of upload for non-copyright reasons

Something to look at in the future, @Ramsey-WMF: maybe we should add in a dropdown or some other method to give users that delete media to use the same words/terms for the deletion ("copyright" vs "copyright vio" vs "copyright violation" vs "____") so that we can track/measure in a better, more automated fashion.

copyright.png (899×455 px, 146 KB)

licensing.png (793×456 px, 123 KB)

Historical trends

Monthly upload counts

Fixed grammar:

monthly_uploads.png (1×2 px, 444 KB)

Previous:
monthly_uploads.png (1×2 px, 445 KB)

Cumulative upload counts

cumulative_uploads.png (1×2 px, 333 KB)

Distribution of file formats

Treemap (sans jpg/jpegs because holy moley there's 37M of those and that's more than all the others combined):

Fixed grammar:

treemap_uploads.png (1×1 px, 159 KB)

Previous:
treemap_uploads.png (1×1 px, 161 KB)

Total files uploaded to Commons (as of right now) by extension:

mediaextensionuploads
audioogg773305
audiooga6180
audioflac6140
audiomid4993
audiowav3512
audioopus410
docspdf354765
docsdjvu60524
imagejpg/jpeg36918799
imagepng2268026
imagesvg1176530
imagetif/tiff807921
imagegif153959
imagexcf1008
imagewebp95
videoogv66610
videowebm41161

Growth of number of deleters over time:

cumulative_deleters.png (450×900 px, 49 KB)

How many users deleted N-many files:

deleter_activity.png (900×1 px, 98 KB)

mpopov moved this task from In progress to Done on the Discovery-Analysis (Current work) board.

Queries & data uploaded to https://github.com/wikimedia-research/SDoC-Initial-Metrics

All figures from this ticket: https://github.com/wikimedia-research/SDoC-Initial-Metrics/tree/master/T177356

Moving this into 'Done' as I don't think there's anything left to do on this one.

@Ramsey-WMF, @Abit, and @Capt_Swing - can you take a look at the findings in this ticket and let us know if this satisfies the concerns or would you need / want more clarification?

@debt I am quite satisfied :). I'll need to go over some of this with the Multimedia team to verify, but so far this looks solid. Thanks to @mpopov for the speedy work. Very enlightening.