Page MenuHomePhabricator

[XL] mw-add-media and mw-remove-media tags are added to edits without changes in media
Closed, ResolvedPublicBUG REPORT

Description

Issue
After T266067 added the mw-add-media and mw-remove-media tags, we see quite a few edits tagged even when there are no changes in media. Examples as follows.

This affects T265771 as it means the statistics from these tags are inaccurate.

What should have happened instead:
The edits mentioned above should not be tagged.

Event Timeline

CBogen renamed this task from mw-add-media and mw-remove-media tags are added to edits without changes in media to [XL] mw-add-media and mw-remove-media tags are added to edits without changes in media.Jul 14 2021, 4:39 PM
CBogen assigned this task to matthiasmullie.
CBogen moved this task from To Do to In Progress on the Image-Suggestions board.

It looks like these problems are mostly caused by the use of templates/modules that include images.

Imagine a page that includes a template that includes a file. The file in template is then changed.

  1. it changes on all pages that transclude this template, but we can't track those uses: there are no corresponding edits on all of those pages to tag
  2. the next edit on a page that transcludes said template (regardless of which kind of edit it is) will likely be tagged with one of these change tags, since that's when we discover that the images are different from what they used to be

That said, I suspect that, for most usages of files that are transcluded in such way on a lot of pages, we don't really even care about their usage (they're most likely icons or generic filler, not "content")
If that assumption is correct, #1 is not much of an issue (though it could theoretically become one if modules were to start automatically including relevant images, e.g. based on interwiki/wikibase relationships)

Which leaves us with #2: edits incorrectly being tagged as media changes, when the changes actually happened inside a template.
Sadly, we can't simply ignore all content from templates: e.g. enwiki makes heavy use of {{Infobox Artwork}}, where the infobox template is used to render the image (valuable content media that I assume we're interesting in tracking) by passing the image file as a parameter.

The only solutions I can think of ATM, are:

  • re-parse the content of the old revision: this should generate the current list of media used on the page (with old content, but new media via templates) that would be a better (or additional) source to compare the new media with. Downside: this is a pretty expensive computation (and was already implemented & reverted previously)
  • compare only old & new media where the filename is also present in the old & new wikitext (that essentially assumes all relevant images will explicitly be typed out in wikitext in some form (template param, file link, ...)). Downside: might have a small amount of false positives & negatives.

@CBogen before I proceed, can you confirm my assumptions that:

  • we don't care to track transcluded media changes beyond the template page (i.e. changes show up for the template edit, but not for any page that includes such template)
  • it's acceptable that we may not be able to pick up certain relevant media usages that don't require an edit (e.g. some scripted, automatically retrieved image that could change under certain conditions)
  • a minor amount of false positives/negatives is acceptable if the other options ends up not being feasible

@CBogen before I proceed, can you confirm my assumptions that:

  • we don't care to track transcluded media changes beyond the template page (i.e. changes show up for the template edit, but not for any page that includes such template)

My understanding based on your comment is that this means that we won't track any changes to images in info boxes (like the heavily used {{Infobox Artwork}} on enwiki). Is that true? That may still be okay, but we need to clearly acknowledge it in our dashboards and anytime we use this data.

  • it's acceptable that we may not be able to pick up certain relevant media usages that don't require an edit (e.g. some scripted, automatically retrieved image that could change under certain conditions)

Same answer as above - I think it's okay, but we need to acknowledge it.

  • a minor amount of false positives/negatives is acceptable if the other options ends up not being feasible

Can you give an example of what types of false positive/negatives we'd see?

cc @nettrom_WMF and @cchen because we chatted about this at our last analytics meeting and they might have some thoughts.

@CBogen before I proceed, can you confirm my assumptions that:

  • we don't care to track transcluded media changes beyond the template page (i.e. changes show up for the template edit, but not for any page that includes such template)

My understanding based on your comment is that this means that we won't track any changes to images in info boxes (like the heavily used {{Infobox Artwork}} on enwiki). Is that true? That may still be okay, but we need to clearly acknowledge it in our dashboards and anytime we use this data.

No, this would still track those, since those would be changes in the page that shows the infobox; not the template (see additional clarification below)

  • it's acceptable that we may not be able to pick up certain relevant media usages that don't require an edit (e.g. some scripted, automatically retrieved image that could change under certain conditions)

Same answer as above - I think it's okay, but we need to acknowledge it.

  • a minor amount of false positives/negatives is acceptable if the other options ends up not being feasible

Can you give an example of what types of false positive/negatives we'd see?

See clarification below.

cc @nettrom_WMF and @cchen because we chatted about this at our last analytics meeting and they might have some thoughts.

It all depends.

  • A template can simply include a static file (e.g. some icon). If they were to swap that for another file, that's probably be an edit of (only) the template page. This would impact (and change the icon on) all pages that transclude the template, but they will not require an edit in any of those pages.
  • A template can include a variable file (e.g. the infobox template accepts a parameter for the image). If one wants to change that file, that would not require an edit to the template, but it'd be an edit to the page that transcludes the template (changing the parameter to another filename)
  • Through modules, templates can also include some logic to figure out which file to display when. This could range from "if some parameter matches 'USA', show this flag icon" to "if today is Monday, show this image" to "show the image of the wikidata item linked to this page". (though I don't think this complex stuff currently happens much), We have no way of telling these apart, and given that it's scripted and doesn't always require an edit before it changes, we can't really track these anyway.

I assume we're not interested in tracking mass media usage changes like swapping out an icon that is used on thousands of pages. Right?
Template usage seems like a good enough proxy, I suspect. I think we'll want to ignore mass changes, and only include media changes triggered by changes in the content of the parent page.

False positives/negatives could be variable file usages changes caused by changes in the content. E.g.:

  • there's a list of countries, they all have flag icons (which are automatically added through some templates)
  • if a new country is added to the list, that would probably be tracked as "add-media"
  • if some existing country's flag ends up being changed in the template, it should no (longer) be tracked

Does this all make sense?

It all depends.

  • A template can simply include a static file (e.g. some icon). If they were to swap that for another file, that's probably be an edit of (only) the template page. This would impact (and change the icon on) all pages that transclude the template, but they will not require an edit in any of those pages.
  • A template can include a variable file (e.g. the infobox template accepts a parameter for the image). If one wants to change that file, that would not require an edit to the template, but it'd be an edit to the page that transcludes the template (changing the parameter to another filename)
  • Through modules, templates can also include some logic to figure out which file to display when. This could range from "if some parameter matches 'USA', show this flag icon" to "if today is Monday, show this image" to "show the image of the wikidata item linked to this page". (though I don't think this complex stuff currently happens much), We have no way of telling these apart, and given that it's scripted and doesn't always require an edit before it changes, we can't really track these anyway.

Got it!

I assume we're not interested in tracking mass media usage changes like swapping out an icon that is used on thousands of pages. Right?
Template usage seems like a good enough proxy, I suspect. I think we'll want to ignore mass changes, and only include media changes triggered by changes in the content of the parent page.

Yes, agreed.

False positives/negatives could be variable file usages changes caused by changes in the content. E.g.:

  • there's a list of countries, they all have flag icons (which are automatically added through some templates)
  • if a new country is added to the list, that would probably be tracked as "add-media"
  • if some existing country's flag ends up being changed in the template, it should no (longer) be tracked

Does this all make sense?

It makes sense to me and seems like something we can acknowledge and accept as a shortcoming of this tool (and something to note in the dashboards when we use it).

Change 722849 had a related patch set uploaded (by Matthias Mullie; author: Matthias Mullie):

[mediawiki/core@master] More secure comparison for media change tags

https://gerrit.wikimedia.org/r/722849

I had a quick look at the patch, and I have a number of questions. Quoting from the commit message:

The previous code works mostly fine, except that changes in transcluded templates don't invoke another LinksUpdate.php run for all callers, and the imagelinks entries that are used to compare media usage, may drift out of sync.

Editing a template causes all pages that use the template to be re-parsed. If that doesn't happen, that's a pretty massive bug.

However, in some cases, re-parsing all callers of the template make take a long time (for templates with millions of callers, this can take hours). So if there is an edit in the meantime, it may be mis-tagged as described here. It should be possible to detect whether the page is "dirty" by checking page_touched against page_links_updated I think.

Can you confirm that in the cases where this happened, the edit to the template was shortly (minutes) before the edit to the page?

To fix this, we'll figure out whether we're in a scenario where this could occur (is there a previous revision, are templates in use, are there potential changes in media usage)

Not sure this optimization will help much. Nearly all "real" pages where media changes happen will have previous revisions and templates.

It would be better to detect whether the page in question was "dirty" when the edit happened by checking page_touched/page_links_updated.

and verify by parsing the old edit once more.

LinksUpdate is triggered during the edit process. At this time, we should be able to get hold of a previously cached ParserOutput for the old revision of the page, without the need to re-parse.

Cparle added subscribers: matthiasmullie, Cparle.
  • confirmed that (at least locally) LinksUpdate.php is run on pages containing the template when a template is updated
  • if it hasn't run for some reason (e.g. the example @daniel 's example above) then we will get the problem described in the ticket
  • @matthiasmullie 's patch fixes the problem

working on updating the patch based on Daniel's suggestions ...

Editing a template causes all pages that use the template to be re-parsed. If that doesn't happen, that's a pretty massive bug.

That wasn't the case on my local setup, but I may simply not have waited long enough or had config issues. It looks like @Cparle already confirmed that this is indeed usually working just fine.

Not sure this optimization will help much. Nearly all "real" pages where media changes happen will have previous revisions and templates.

Comparing previous revisions & templates will indeed only have a minor impact, but checking for "media usage" (based on imagelinks) changes should make a massive change compared to all edits.
And thanks for the suggestion to check for a dirty state - that should narrow it down even more.

Ok so quick update. The approach in this patch

https://gerrit.wikimedia.org/r/c/mediawiki/core/+/722849/7/

of creating/adding the tags in LinksUpdate.php is incorrect, because LinksUpdate is called after the page has been saved but before the links have been updated, so the links are by definition out of date and we can't tell if they're out of date because of this edit or another edit - page_touched will always be >= page_links_updated

Probably would be better to decide whether to tag the edit before it gets saved. SpamBlacklist and AbuseFilter looks at diffs in external links - check how they do it

Here's some technical thoughts on the current implementation we've been exploring. I have additional concerns that may make these irrelevant anyway. See next comment.


Ok so quick update. The approach in this patch

https://gerrit.wikimedia.org/r/c/mediawiki/core/+/722849/7/

of creating/adding the tags in LinksUpdate.php is incorrect, because LinksUpdate is called after the page has been saved but before the links have been updated, so the links are by definition out of date and we can't tell if they're out of date because of this edit or another edit - page_touched will always be >= page_links_updated

How is this incorrect? LinksUpdate.php indeed runs before updating the links, but it is the thing updating those links (and generating them), so it knows the latest links.

I assume the "incorrect" part refers to the ability to compare page_touched vs page_links_updated?
I don't think that really matters all that much, though: at that point, we've already established that the old media & new media are different. All we learn from it is whether the old data is valid or not.
That old data may have been stale, but the new data surely isn't (just got parsed)
So possible scenarios:

  1. changes detected in new imagelinks vs. valid old imagelinks: we parse the old version to reconfirm changes
  2. changes detected in new imagelinks vs. stale old imagelinks: we parse the old version to reconfirm changes
  3. no changes detected in new imagelinks vs. valid old imagelinks: we don't do anything, but that's ok because there were no actual changes
  4. no changes detected in new imagelinks vs. stale old imagelinks: we don't do anything, but that's *likely* fine; this would also happen when there's a significantly delayed imagelinks update (probably in a template) whose changes are "reverted" (because new imagelinks match the stale version) in a regular page edit. Even if this were to happen, it may even be debatable whether this should register as image change or not.

Knowing for sure whether the old imagelinks (in db) are fully up to date matters to the 1st and 4th scenario:

  • 4th because we could rule out the chance of not recording a change in media when in face there is one, in this unlikely scenario. This is something we can live with, IMO.
  • 1st because we can simply skip an extra parse if we know we don't need to do it again. It would be great it we could optimize this. Unfortunately, I don't think we can, and with the other checks in place, this already only happens when there are actual changes in media usage (i.e. nowhere near the order of magnitude of *all edits*)

IMO, what we had originally works well enough, and isn't too expensive. It's only 1 additional (old edit) parse for every (likely) media-changing edit, and that parse might even be in cache anyway.

That said, we may want to consider dropping the idea of parsing altogether - see next comment.

Probably would be better to decide whether to tag the edit before it gets saved. SpamBlacklist and AbuseFilter looks at diffs in external links - check how they do it

SpamBlacklist (via onEditFilter hook) simply runs regular expressions against the full wikitext contents of a page.
AbuseFilter is more convoluted, though:

Basically, AbuseFilter does a "simpler version" of what we need here, it doesn't even check or care whether old externallinks are still valid. But that's ok in their case.
It doesn't usually have to care much about whether or not old external links are stale, because it doesn't usually interfere with how it works:

  • if something were to be added via a template (where such external links update propagation may be severely lagging), it would likely already prevent the template edit from going through in the first place
  • and even if it passed that stage, Abusefilter would still find that an offending new link was added because the only way it could get it wrong is if the lagging externallinks in DB already matched that link (which is also extremely unlikely to be a case that happens, because Abusefilter likely would've prevented that earlier edit as well)

While we've been working on this task, we've come to realise that "tracking media changes in articles" is not as clear-cut a definition as we'd hoped.

Ignoring the technical difficulties ATM; here are a few conceptual questions about what we even want to track, because I'm afraid we may not be able to get what we really want. Here are a few things to consider.

  1. We are *not* able to track *every* change in media via edit tags, because sometimes media changes in transcluded templates, which leaves no edit on the transcluding page. We may only be able to track additions/deletions of transcluded media when their templates get added/removed from the page, not when the contents of the template change (but we're probably not really interested in those for the most part)
  2. While we can track additions/deletions of transcluded media, I'm not sure we even want to track those. In the majority of cases, they're stuff like this, where we're simply gaining an icon, UI element, or other gratuitous image use that's mostly not very relevant for the article.
  3. That said, templates often also render relevant stuff - e.g. whatever is in infobox (or any other standardised interface) is parsed in there via templates...
  4. And while templates render both relevant & irrelevant images, the same is true for the individual pages themselves - especially more obscure pages/namespaces, where fewer people have had a chance to go over the contents and "standardise away" media used as interface elements. But even on high-profile pages, some irrelevant stuff remains...

So...

  • we can continue to pursue the existing approach of using parsed data, which includes additions/deletions of transcluded media (many of which will be irrelevant, e.g. icons)
  • we can implement a rough image-extraction thing on only the page's wikitext, ignoring any image in templates (many of which are useful, e.g. things in infobox)
  • and regardless of how we chose to deal with templates (and manage to get that implementation right), we're unlikely to come up with data that accurately represents meaningful image use, without applying a ton of filters (e.g. removing images used on many pages, or below certain dimensions, or only within certain namespaces) that's probably too complex to live in an edit tag.

So, questions:

  • do we want to continue to attempt to track media changes via edit tags? (= are the limitations/caveats/dirty data described above acceptable)
  • if yes, do we want to include (insofar we can) or ignore media additions/deletions in templates?

FYI here are a few random examples of icons or UI elements (both via template transclusion, or directly used on-page)

Screenshot 2021-12-23 at 14.07.00.png (169×224 px, 22 KB)

Screenshot 2021-12-23 at 13.54.21.png (218×167 px, 23 KB)

Screenshot 2021-12-23 at 14.01.54.png (150×246 px, 25 KB)

Screenshot 2021-12-23 at 13.57.47.png (47×457 px, 16 KB)

Screenshot 2021-12-23 at 13.56.16.png (22×57 px, 4 KB)

Screenshot 2021-12-23 at 13.54.35.png (26×1 px, 21 KB)

Screenshot 2021-12-23 at 14.03.23.png (577×142 px, 41 KB)

Screenshot 2021-12-23 at 13.56.41.png (122×280 px, 20 KB)

Thanks for this analysis, @matthiasmullie ! You're right that this is problematic. We don't want to track icons and UI elements, but we definitely do want to track image additions in infoboxes - and it sounds like we can't have it both ways with this approach. But at the same time it's critical that we implement a way to track these image additions both for monitoring the success of our image suggestions approaches and for reporting back to the SDAW grant.

@Miriam or @cchen, do you have any ideas of other approaches we could use to track image additions? @Miriam I know especially that you and your team have thought about this when creating the image suggestions algorithm.

@MMiller_WMF , how is the Growth team tracking additions of images in articles via the newcomer task?

but we definitely do want to track image additions in infoboxes

This rules out the option of a simple wikitext parse - we definitely do need the parsed imagelinks.


Note that we can still capture these edit tags (well, there are some technical hurdles as well, but I believe we'll overcome them) - but they're extremely noisy and will require at least another step to filter out all icons etc. (and I'm not sure that the way that data is collected via edit tags is one that would allow such filtering: since it doesn't contain detail about the actual images that were changed, it may be hard to figure out whether the image(s) was noise...)

Depending on the exact data that we're after, it may make more sense for someone to write a script that runs periodically that does something along these lines:

  • looks at all data in the imagelinks tables
  • filters out image that are used more than a certain threshold (this would filter out most generic things, e.g. flags, barnstars, ui icons, ...)
  • maybe filter out images by dimensions (to weed out most remaining icons), though that might not be feasible (would probably take awhile to fetch dimensions for all files)
  • store that data somewhere
  • and compare it over time

I'm not familiar enough with analytics infrastructure to determine how feasible this even is, or whether it'd even give us the data we desire - it's just another thought...

Ugh. I just remembered that the reason we wanted these edit tags, was not (or not only) to track media usage per se, but (at least partly) to figure out where people are making these changes (VE, bots, mobile, ...)
We're probably stuck with edit tags, then, since part of that question (*where*) requires cross-referencing with other edit tags...
I'm not sure whether we can even clean up that data, though.

Unless we figure out a way to filter these edit tags afterwards, I guess that leaves us with:

  • capture all media additions/removals via templates (which includes many, many icons etc, but also valuable stuff like infobox images) and accept the noise
  • capture only images rendered on that specific page (probably less noisy in terms of iconography, but will also lose some valuable content) and accept the remaining noise & gaps

And... I'm not sure we can simply "accept the noise", since it's probably not evenly distributed across edit tools and may skew the data we're hoping to get in the first place: I suspect that certain tools lend themselves more to introducing such noise than others (e.g. bots/wikitext edits will probably add more templates than VE/mobile edits)

I guess it might be possible to create a bot to add tags later on, via the api, that would work a little similar to the script I just described in an earlier comment:

I guess that could work, but unless we have significant pre-existing infrastructure to accomplish parts of this, I'm not sure that this is feasible: it's both a lot of work, and resource-intensive to run & maintain.

Hi @matthiasmullie @CBogen,

Just chiming in to share previous work on detecting icons.

We did a bit of research in T268350 to understand what the best way to detect icons from the tables in the Data Lake was. Image size is not ideal, as the image table describes the original resolution of the file on Commons (not the resolution in the Wikipedia article), and for icons or logos this can be as big as normal article illustrations.

The best solution we found was to filter out images based on how often they appear in a Wiki, based on the imagelinks table.

For the image suggestion task, we had a function to automatically detect icons as images that are used in more than X pages in a given Wiki, where X depends on a formula implemented as the get_threshold function in the original code: https://github.com/mirrys/ImageMatching/blob/main/algorithm.ipynb.

It has been working quite well so far, and we could easily generate lists of candidate icons for each Wiki, based on that function and a Hive query!

@CBogen @matthiasmullie -- for the Growth team's "add an image" task, we are using an edit tag that explicitly tracks whether an edit was done through our workflow: "#Suggested: add images". In other words, I don't think our tag is trying to detect whether an image was added, but rather whether any edit was done while the user was in the structured task. And in general, because of the way the workflow is constructed, the only edit the user can do is to add an image. That said, there is an exception that sometimes causes us trouble: users in our workflow have a way to escape the workflow to the visual editor to make an arbitrary edit. That also causes the edit tag to get placed, but we can't tell it's not an image being added. If this were solved, it would be beneficial for the Growth team's work.

Here's the task where we implemented our tag: T293166: Add an image: edit tag

CC @Tgr in case I described this incorrectly.

Yeah, we added a mechanism to VisualEditor where the client can send additional data and extensions can react to it via a hook; the custom client that handles structured edits sets a flag indicating the task type, and the backend adds the corresponding tag. (This is pretty robust; we had a previous mechanism which resulted in the bugs @MMiller_WMF describes.) We don't try to detect anything about the edit; the user is basically restricted to clicking on buttons, so the client-side logic can easily keep track of what kind of changes are done to the article.

Re: the original topic, I don't think you have any other option than re-rendering the old revision and comparing. Both the links table and the cached parser output will reflect the state right after the previous edit, not the state right before the current edit, so they will still be vulnerable to this issue. (You will still get false positives for templates which randomize what image to show, sometimes used in portal pages, but that's an edge case and probably won't matter.) Putting it in LinksUpdate is probably not ideal as it would slow a bunch of other more important updates down, but you could just schedule your follow-up job to do it - it would increase the number of parses done by jobrunners a bit, but probably not by much.

Closing in favor of T299667. Please re-open if there's a reason we still need this.

Change 722849 abandoned by Matthias Mullie:

[mediawiki/core@master] More accurate comparison for media change tags

Reason:

Looks like we will not need this after all

https://gerrit.wikimedia.org/r/722849

Change 756014 had a related patch set uploaded (by Matthias Mullie; author: Matthias Mullie):

[operations/mediawiki-config@master] Stop capturing media change tags

https://gerrit.wikimedia.org/r/756014

Re-opening until the existing incomplete implementation is removed (patches in CR)

Change 755996 had a related patch set uploaded (by Matthias Mullie; author: Matthias Mullie):

[mediawiki/core@master] Remove change tags for media additions/removals

https://gerrit.wikimedia.org/r/755996

Change 756014 merged by jenkins-bot:

[operations/mediawiki-config@master] Stop capturing media change tags

https://gerrit.wikimedia.org/r/756014

Mentioned in SAL (#wikimedia-operations) [2022-02-07T12:08:09Z] <taavi@deploy1002> Synchronized wmf-config/CommonSettings.php: Config: [[gerrit:756014|Stop capturing media change tags (T286362)]] (1/2) (duration: 00m 50s)

Mentioned in SAL (#wikimedia-operations) [2022-02-07T12:09:06Z] <taavi@deploy1002> Synchronized wmf-config/InitialiseSettings.php: Config: [[gerrit:756014|Stop capturing media change tags (T286362)]] (2/2) (duration: 00m 50s)

Mentioned in SAL (#wikimedia-operations) [2022-02-07T12:09:50Z] <taavi> taavi@deploy1002 Synchronized wmf-config/InitialiseSettings.php: Config: [[gerrit:756014|Stop capturing media change tags (T286362)]] (2/2) (duration: 00m 50s)

Change 755996 merged by jenkins-bot:

[mediawiki/core@master] Remove change tags for media additions/removals

https://gerrit.wikimedia.org/r/755996

Code has all been removed & I've just completed cleaning up all previously applied tags to all wikis.