Page MenuHomePhabricator

File type and deletion metrics on Wikimedia Commons (Redux)
Closed, ResolvedPublic

Description

Background

In T177356, we computed several metrics related to file types and deletion behaviors on Wikimedia Commons as of Oct 2017, using filearchive and image table in commonswiki database. The codebase for this analysis is on GitHub.

Objective

Your task is to answer the following questions:

  • The distribution of file types and extensions. (we've talked about the image_media_type variable is not very reliable, is there a workaround?)
  • Cumulative upload counts and newly uploads per month by file extension. (The definition of "newly uploads" is an open question. We could count files that never got deleted until today, or files that didn't get deleted within a month, or other definition that make sense.)
  • The proportion of files got deleted within a month after uploaded. How does this deletion rate look like over time?
  • Number of deleters (users who have deleted at least one file) over time
  • How many files each user has deleted?
  • Time to deletion, broken up by file type and reason for deletion (copyright violation vs other)
  • [Optional] Reasons for deletion

Feel free to use different data or presentation to answer the question, or figure out additional insights that were not in the original report.

Event Timeline

@chelsyx First draft complete. See initial codebase and output: https://github.com/MeganNeisler/SDoC-Baseline-Metrics-Redux/tree/master/T186575. I'm still working on adding an analysis of the reasons for file deletion but I've finished addressing the primary questions. Let me know if you have any feedback or suggestions.

@MNeisler good job! I like how you wrote your findings. Some initial feedback about the visualizations:

  • density plot (as is) is not an appropriate visualization type for the variables in the second-to-last figure ("Monthly deletions of newly uploaded files")
    • specifically, the density in that plot represents the probability of a file (that was deleted within 1 month, for example) to be in a specific month
    • so the two types of deletions ("within 1 month" vs "after 1 month") aren't comparable in the way you want
    • you can show what you're trying to say by stacking the two in a way that fills in the area (see examples below)
  • avoid having raw labels like "TRUE"/"FALSE" in legends – one way to fix it is something like this:
monthly_deletions$delete_in_month %<>% factor(c(TRUE, FALSE), c("Within 1 month of upload", "After 1 month since upload"))
# then change the fill label to "Deleted"

Notice how the graph is easier to read.

I recommend one of the following possible alternatives that show how the proportion of files deleted within 1 month vs after 1 month changes over time:

monthly_deletions %>%
  group_by(upload_month, delete_in_month) %>%
  tally %>%
  ggplot(aes(x = upload_month, y = n, fill = delete_in_month)) +
  geom_area(position = "fill", color = "black") +
  geom_hline(yintercept = 0.5, linetype = "dashed") +
  scale_y_continuous(labels = scales::percent_format()) +
  scale_x_date(date_labels = "%Y", date_breaks = "1 year", date_minor_breaks = "1 year") +
  wmf::theme_min() +
  labs(
    fill = "Deleted", x = "Date", y = "Proportion of monthly deletions",
    title = "Monthly deletions of newly uploaded files",
    subtitle = "Includes only files deleted as of 2018-02-01"
  )

monthly_deletions %>%
  ggplot(aes(x = upload_month, fill = delete_in_month)) +
  geom_histogram(
    bins = length(unique(monthly_deletions$upload_month)),
    position = "fill"
  ) +
  geom_hline(yintercept = 0.5, linetype = "dashed") +
  scale_y_continuous(labels = scales::percent_format()) +
  scale_x_date(date_labels = "%Y", date_breaks = "1 year", date_minor_breaks = "1 year") +
  wmf::theme_min() +
  labs(
    fill = "Deleted", x = "Date", y = "Proportion of monthly deletions",
    title = "Monthly deletions of newly uploaded files",
    subtitle = "Includes only files deleted as of 2018-02-01"
  )

monthly_deletions %>%
  ggplot(aes(x = upload_month, fill = delete_in_month)) +
  geom_density(position = "fill") +
  geom_hline(yintercept = 0.5, linetype = "dashed") +
  scale_y_continuous(labels = scales::percent_format()) +
  scale_x_date(date_labels = "%Y", date_breaks = "1 year", date_minor_breaks = "1 year") +
  wmf::theme_min() +
  labs(
    fill = "Deleted", x = "Date", y = "Proportion of monthly deletions",
    title = "Monthly deletions of newly uploaded files",
    subtitle = "Includes only files deleted as of 2018-02-01"
  )

Seeing data files uploaded together with code always makes me worried so I checked that the RData files in the data subdirectories did not contain PII. No problems there! :)

Thanks @mpopov for the review!

@MNeisler Good job with the update! I've put some comments through PR: https://github.com/MeganNeisler/SDoC-Baseline-Metrics-Redux/pull/1.

Thanks @mpopov and @chelsyx for the review and feedback! See current updated codebase and output: https://github.com/MeganNeisler/SDoC-Baseline-Metrics-Redux/tree/master/T186575.

There are a couple additional next steps I can work on (e.g calculating and plotting the CI (confidence interval or credible interval) for plots of the proportion of files deleted within a month) but going to focus on T187827 for now. Let me know if you have other changes or comments.

Looks good! I don't think CI stuff is that important here, so shall we move it into Done?