Background
In T177356, we computed several metrics related to file types and deletion behaviors on Wikimedia Commons as of Oct 2017, using filearchive and image table in commonswiki database. The codebase for this analysis is on GitHub.
Objective
Your task is to answer the following questions:
- The distribution of file types and extensions. (we've talked about the image_media_type variable is not very reliable, is there a workaround?)
- Cumulative upload counts and newly uploads per month by file extension. (The definition of "newly uploads" is an open question. We could count files that never got deleted until today, or files that didn't get deleted within a month, or other definition that make sense.)
- The proportion of files got deleted within a month after uploaded. How does this deletion rate look like over time?
- Number of deleters (users who have deleted at least one file) over time
- How many files each user has deleted?
- Time to deletion, broken up by file type and reason for deletion (copyright violation vs other)
- [Optional] Reasons for deletion
Feel free to use different data or presentation to answer the question, or figure out additional insights that were not in the original report.