Page MenuHomePhabricator

Write function for extracting image changes from diffs on the cluster
Closed, ResolvedPublic

Description

Details

Goal: Python UDF that could be called from a Jupyter notebook (via wmfdata) with the following details:

  • Input: two wikitext strings from the wikitext_history Hive table.
  • Output: list of media files that were added/removed/altered (alternatively, just a simple count of media changes). This will include all media types -- not just images but also gifs, videos, etc.
  • Likely depends on mwparserfromhell but possibly just uses custom regexes if mwparserfromhell is too bulky for our purposes. I have a more general-purpose library for analyzing diffs but that's much bulkier than what we need for this task.

Types of Images in Wikitext

Using https://en.wikipedia.org/wiki/People_of_the_Dominican_Republic as an example, here are the different ways in which images are added to wikitext and how to handle:

In-line file (include)

Standard file syntax in-line in the wikitext that inserts an image in a section (along with a caption and parameters that control size/position). We want to detect these and they are easy to detect as they take the form of wikilinks with a set number of possible prefixes and can be reliably extracted.

=== Independence ===
[[File:Dominican Republic Physiography.jpg|thumb|Map of the [[Dominican Republic]]]]
Santo Domingo attained independence as the Dominican Republic in 1844.
...

Main infobox media (include)

Images within an infobox template often lacks the starting/closing brackets and instead are just the filenames with the appropriate parameter name. While these are harder to detect (they aren't wikilinks), we could have some special code to look for a set of known file extensions within templates so they are likely detectable with high accuracy.

{{Infobox ethnic group
|group   = Dominicans<br /><small>''Dominicanos''</small>
|flag = Flag of the Dominican Republic.svg
...

Content-specific Icons (exclude)

Little icons -- e.g., commonly country flags -- can pepper an article but thankfully they are generally transcluded via templates without explicit filenames. For example, the same infobox above also has a bunch of country flags that are included via the flagcountry template like this:

...
|region1      = {{flagcountry|Dominican Republic}}
|pop1         = 9,341,916 {{small|(2017)}}<ref name=ENI-2017/>
...

This means that searching for images in the wikitext will generally miss these icons, which I think is desired behavior in this case. If there are any particular common icon templates though that we want to track, this would be possible without much effort.

Incidental Icons (exclude)

Images for e.g, PDF files or to indicate that the file is a stub also can pepper an article. These are generally also added via templates (though less directly) and so won't be included in this detection approach, which I think is the desired behavior. For example, the cite web template below inserts a PDF image due to the PDF external link.

{{cite web |url=https://www.oecd-ilibrary.org/the-dominican-republic-s-migration-landscape_5jft8jw5w7wc.pdf?itemId=%2Fcontent%2Fcomponent%2F9789264276826-6-en |title= The Dominican Republic's migration landscape|date=2017 |website=www.oecd-ilibrary.org |format=PDF|access-date=2020-11-10}}</ref>

Exceptions

We will likely want some exceptions to the above categories -- e.g., sometimes standard icons are included in-line as when providing contextual information for a table (example). To sort these and other potential edge-cases out, we'll likely also want to exclude images that appear in over k articles (where we might want to look at the data to tune k to be reasonable).

Event Timeline

Note that there are many images-to-be-excluded that are used directly rather than in templates. E.g. the checkmarks in the first paragraph in this article. I think most of those could be filtered out by excluding images used across a certain # or % of pages.

@Isaac will this include media types other than images? Our notifications will include all media types (images, video, etc) though we expect the vast majority of suggestions to be images.

Note that there are many images-to-be-excluded that are used directly rather than in templates. E.g. the checkmarks in the first paragraph in this article. I think most of those could be filtered out by excluding images used across a certain # or % of pages.

Good point and thanks for the example -- I'll add that to the description!

will this include media types other than images?

Yes. On enwiki, the prefix list is File, Image, and Media, which should cover all mediatypes. When it comes to detecting media added to infoboxes, we'll just have to be careful when building the list of file extensions -- i.e. not just .png and .jpg but also .gif and .webm etc.

Ok -- first pass at the code for doing this work. @cchen there's an example notebook that analyzes Russian Wikipedia edits from December 2021. I filtered out IP editors and bots. It runs in under an hour but I think much of that is the cost of scanning mediawiki_history and not so much the actual processing of the wikitext so I expect (hope) that scaling it up to more months / wikis would be straightforward. Notebook: https://github.com/geohci/miscellaneous-wikimedia/blob/master/media-changes/media_changes.ipynb

Summary:

+-------+------------+---------+---------+-----------+-----------+------------+--------------+
|wiki_db|was_reverted|num_edits|pct_edits|num_editors|pct_editors|images_added|images_removed|
+-------+------------+---------+---------+-----------+-----------+------------+--------------+
|ruwiki |false       |12042    |4.032    |2025       |20.034     |18124       |8199          |
|ruwiki |true        |611      |0.205    |344        |3.403      |977         |706           |
+-------+------------+---------+---------+-----------+-----------+------------+--------------+
For December 2021 user edits to Russian Wikipedia articles (bots and anons excluded):
* 4.24% of all edits that month changed images
* 4.8% of edits that changed images were reverted
* ~20% of editors who edited that month had at least one edit that affected images in the article
* In non-reverted edits, 18124 images were added and 8199 were removed. In practice, some of these were likely minor changes like this: https://wiki-topic.toolforge.org/media-changes?lang=ru&revid=118246882

@CBogen @matthiasmullie @SWakiyama @Miriam and any others: if you want to test out the code for identifying media changes, you can use this interface to test the approach used in the notebook: https://wiki-topic.toolforge.org/media-changes?lang=en&revid=160195773
If you just want to see what file types it identifies in the current revision of an article, there's no fancy interface but this API will spit that out: https://media-changes.wmcloud.org/api/v1/media-list?lang=en&title=People_of_the_Dominican_Republic

Specifically, for the types listed in the description:

  • In-line file: I identify all links in the wikitext (brackets with text in between) and then filter them down to ones that start with any media prefixes (Media:, Image:, File:, plus any language-specific aliases)
  • Main infobox media: I also do a pass through the article to identify any likely file names based on known media extensions. These are unioned into the in-line file links found. If we're worried about false positives, it might be possible to restrict to just searching within e.g., templates and <gallery> tags but that would add some more overhead and I haven't seen any false positives yet so I assume they are rare enough to not affect statistics. Importantly, the extensions that I'm checking for are (case-insensitive; feel free to nominate others to add if I missed common ones):
# https://commons.wikimedia.org/wiki/Commons:File_types
IMAGE_EXTENSIONS = ['.jpg', '.png', '.svg', '.gif']
VIDEO_EXTENSIONS = ['.ogv', '.webm', '.mpg', '.mpeg']
AUDIO_EXTENSIONS = ['.ogg', '.mp3', '.mid', '.webm', '.flac', '.wav']
  • Icons: as promised, icons like checkmarks/flags/PDF symbols that are inserted via templates are not captured unless someone specifically adds the filename -- i.e. I catch Flag of the Dominican Republic.svg but not {{flagcountry|Dominican Republic}}
  • Exceptions: the code doesn't implement this check yet but it should just be a matter of scanning the imagelinks table for a given wiki and identifying common filenames to filter out.

Thank you @Isaac! I will test it out on target wikis.

@Isaac I ran the code on Russian and Portuguese Wikipedia and was able to get the baselines stats for media additions. I also able to use this on Cebuano Wikipedia to count the images added by Lsj-bot, which the images were all added to the image gallery.

Also for media extensions, I pulled the following from mediawiki_image table and add some to the list:

IMAGE_EXTENSIONS: ['.jpg', '.png', '.svg', '.gif', '.jpeg', '.tif', '.bmp', '.webp', '.xcf']
VIDEO_EXTENSIONS = ['.ogv', '.webm', '.mpg', '.mpeg']
AUDIO_EXTENSIONS = ['.ogg', '.mp3', '.mid', '.webm', '.flac', '.wav', '.oga']

I ran the code on Russian and Portuguese Wikipedia and was able to get the baselines stats for media additions. I also able to use this on Cebuano Wikipedia to count the images added by Lsj-bot, which the images were all added to the image gallery.

That's awesome! Did you stick to a single month's worth of data or scale up? If so, how long did it take (not a big deal if you don't remember)?

Also for media extensions, I pulled the following from mediawiki_image table and add some to the list:

Thanks for sharing this back -- I'll update my code as well.

I ran the code on Russian and Portuguese Wikipedia and was able to get the baselines stats for media additions. I also able to use this on Cebuano Wikipedia to count the images added by Lsj-bot, which the images were all added to the image gallery.

That's awesome! Did you stick to a single month's worth of data or scale up? If so, how long did it take (not a big deal if you don't remember)?

i ran a single month's worth of data, but for both two wikis at the same time. it took a little less than 1 hour.

i ran a single month's worth of data, but for both two wikis at the same time. it took a little less than 1 hour.

@cchen Awesome, that's good to hear! I'm resolving this task then but don't hesitate to reach out if you notice any bugs / edge cases or have any other questions. I'm more than happy to do any code review etc. too for the analysis as needed / desired.