Details
Goal: Python UDF that could be called from a Jupyter notebook (via wmfdata) with the following details:
- Input: two wikitext strings from the wikitext_history Hive table.
- Output: list of media files that were added/removed/altered (alternatively, just a simple count of media changes). This will include all media types -- not just images but also gifs, videos, etc.
- Likely depends on mwparserfromhell but possibly just uses custom regexes if mwparserfromhell is too bulky for our purposes. I have a more general-purpose library for analyzing diffs but that's much bulkier than what we need for this task.
Types of Images in Wikitext
Using https://en.wikipedia.org/wiki/People_of_the_Dominican_Republic as an example, here are the different ways in which images are added to wikitext and how to handle:
In-line file (include)
Standard file syntax in-line in the wikitext that inserts an image in a section (along with a caption and parameters that control size/position). We want to detect these and they are easy to detect as they take the form of wikilinks with a set number of possible prefixes and can be reliably extracted.
=== Independence === [[File:Dominican Republic Physiography.jpg|thumb|Map of the [[Dominican Republic]]]] Santo Domingo attained independence as the Dominican Republic in 1844. ...
Main infobox media (include)
Images within an infobox template often lacks the starting/closing brackets and instead are just the filenames with the appropriate parameter name. While these are harder to detect (they aren't wikilinks), we could have some special code to look for a set of known file extensions within templates so they are likely detectable with high accuracy.
{{Infobox ethnic group |group = Dominicans<br /><small>''Dominicanos''</small> |flag = Flag of the Dominican Republic.svg ...
Content-specific Icons (exclude)
Little icons -- e.g., commonly country flags -- can pepper an article but thankfully they are generally transcluded via templates without explicit filenames. For example, the same infobox above also has a bunch of country flags that are included via the flagcountry template like this:
... |region1 = {{flagcountry|Dominican Republic}} |pop1 = 9,341,916 {{small|(2017)}}<ref name=ENI-2017/> ...
This means that searching for images in the wikitext will generally miss these icons, which I think is desired behavior in this case. If there are any particular common icon templates though that we want to track, this would be possible without much effort.
Incidental Icons (exclude)
Images for e.g, PDF files or to indicate that the file is a stub also can pepper an article. These are generally also added via templates (though less directly) and so won't be included in this detection approach, which I think is the desired behavior. For example, the cite web template below inserts a PDF image due to the PDF external link.
{{cite web |url=https://www.oecd-ilibrary.org/the-dominican-republic-s-migration-landscape_5jft8jw5w7wc.pdf?itemId=%2Fcontent%2Fcomponent%2F9789264276826-6-en |title= The Dominican Republic's migration landscape|date=2017 |website=www.oecd-ilibrary.org |format=PDF|access-date=2020-11-10}}</ref>
Exceptions
We will likely want some exceptions to the above categories -- e.g., sometimes standard icons are included in-line as when providing contextual information for a table (example). To sort these and other potential edge-cases out, we'll likely also want to exclude images that appear in over k articles (where we might want to look at the data to tune k to be reasonable).