Page MenuHomePhabricator

bot causes duplicate reference definitions
Closed, InvalidPublic

Description

The InternetArchiveBot performed this edit: https://en.wikipedia.org/w/index.php?title=Aprimo&type=revision&diff=898631049&oldid=894626155&diffmode=source

Which caused a duplicate reference definition. The bot should detect this situation and either not make the edit or correct the duplicate definition itself.

Event Timeline

Cirdan added a subscriber: Cirdan.May 26 2019, 9:08 AM

Can you clarify what you mean by "duplicate reference definitions"? In the first diff you linked, IABot expanded shortened URLs to their full form, and in the second diff, it added {{webarchive}} to two links.

References in wikipedia can be reused. We can say <ref name="AnchorName">The Pittsburgh Press</ref> to define a reference named "AnchorName", then use only <ref name="AnchorName"/> when we want to repeat that same reference elsewhere in the article.

It turns out that we can't redefine a reference name because it makes an error. And rightfully so: how could the rendering of the article possibly know which reference, by name, was really being invoked when there are multiple definitions with the same name? If we code <ref name="AnchorName">The Pittsburgh Press</ref> and have <ref name="AnchorName">The Pittsburgh Post-Gazette</ref>, we end up with an error message in the {{reflist}} output in the article. One of the two (or more!) references is masked, and doesn't actually appear in the rendered article.

There's something of an exception. We can define a reference with the same name twice (or more) if the content o the reference is exactly the same -- even including whitespace. <ref name="AnchorName">The Pittsburgh Press</ref> twice, exactly, is just fine. adding <ref name="AnchorName">The Pittsburgh Press</ref> gets an error because there are two many spaces and the definitions aren't exactly the same.

When an article on Wikipedia has duplicate references, it ends up in this category so it can be found and fixed: https://en.wikipedia.org/wiki/Category:Pages_with_duplicate_reference_names

Before InternetArchiveBot edited the Aprimo article, it looked like this: https://en.wikipedia.org/w/index.php?title=Aprimo&oldid=894626155 There were no referencing errors.

After InternetArchiveBot edited that same article, it looked like this: https://en.wikipedia.org/w/index.php?title=Aprimo&oldid=898631049. The referencing errors are very visible: "Cite error: Invalid <ref> tag; name "NadiaCameron" defined multiple times with different content (see the help page)."

The edit performed by InternetArchiveBot caused this error and made the page worse. Pages shouldn't render with errors, and references should be correctly defined, with no duplicates. InternetArchiveBot changed one definition of <ref name="NadiaCameron"> to a different URL, change a date, and add a "dead-url=no" parameter. But it left the other definition of <ref name="NadiaCameron"> alone. After the edit, then, those two definitions didn't match.

I think that InternetArchiveBot should modify both definitions so that they match. Or, it should notice that there are two definitions and not make any changes at all. Or, it should make the change it wants to one, and remove the duplicate definition from the other. Any of these choices would be better than the outcome we have now, which is the bot adding errors to the article and causing it to render with a red error message in the references list.

I hope that helps!

Thanks for the explanation!

From what you write, the fact that two references with the same name are still rendered when they have identical content is a nice feature of the parser, but it's still invalid wikitext, and therefore should be fixed.

IABot does its best to not mess up the wikitext even in the case of invalid syntax, but it is generally out of scope for this project to deal with invalid wikitext. Unless these case arise very frequently (I cannot recall encountering this issue in 15+ years - but that's obviously not a reliable sample size) the errors made visible by the IABot edit should be fixed. Can you give an estimate on the frequency at which these problems arise?

I can't understand why you're saying that the article has invalid wikitext. The page rendered correctly and without error before InternetArchiveBot made its edits. To be clear, it was only after InternetArchiveBot made its edit that the page rendered with an error.

InternetArchiveBot might be doing its best, but for sure it can do better: robots, after all, should do no harm. All that's needed is a test to find another definition (not usage) of a named anchor. That test should be performed before modifying any named reference. If another definition exists, the bot should either not make the modification or it should take corrective action.

I expect that you don't recall this issue happening before because you don't know what a duplicate reference is and therefore haven't been checking that InternetArchiveBot is not creating them. Incomplete testing doesn't mean there's not a problem. But I think that implementing the test I suggest would allow you to monitor the bot to see how often it is causing these errors, and that would automate measurement at the source.

I can't understand why you're saying that the article has invalid wikitext. The page rendered correctly and without error before InternetArchiveBot made its edits. To be clear, it was only after InternetArchiveBot made its edit that the page rendered with an error.

Just because the page renders without visible error messages, does not mean its wikitext is valid. The MediaWiki parser is quite generous, similar to how internet browsers will tolerate invalid HTML. In addition to the source-code-level problem, it does not make any sense semantically to define the very same reference twice.

InternetArchiveBot might be doing its best, but for sure it can do better: robots, after all, should do no harm.

IABot does not break valid wikitext. If it does, that's an error.

All that's needed is a test to find another definition (not usage) of a named anchor. That test should be performed before modifying any named reference. If another definition exists, the bot should either not make the modification or it should take corrective action.

The problem is that there are hundreds of similar cases, all of which could be fixed by "just adding some test". IABot's developer (which is not me, I'm just helping out now and then with the bug reports) is understandably very reluctant to add lots of extra features. It's hard enough to deal with valid wikitext and template syntax, we are at the 15th beta version and there are still many unsolved issues.

I expect that you don't recall this issue happening before because you don't know what a duplicate reference is and therefore haven't been checking that InternetArchiveBot is not creating them. Incomplete testing doesn't mean there's not a problem. But I think that implementing the test I suggest would allow you to monitor the bot to see how often it is causing these errors, and that would automate measurement at the source.

I meant that indendent of the bot I do not recall ever coming across the situation where there are two references with exactly the same content, and hence was not aware that this special case is rendered without any visible error message. Instead of trying to patch IABot to deal with these cases, I suggest to run a bot on your wiki to clean up these cases. It's much easier, prevents issues for users which in the future try to alter the reference, and prevents issues if in the future MediaWiki does not ignore these cases anymore.

> The MediaWiki parser is quite generous,

I suppose that's true, but if the parser renders the page without complaint, how can we decide if it's invalid? Is there some other authority that tells us what is "valid" or "invalid" wiki text, if not the parser itself?

While defining the same reference twice doesn't make sense when the definitions are different, defining precisely the same definition multiple times works fine.

> The problem is that there are hundreds of similar cases, all of which could be fixed by "just adding some test".

I'm sure there are many issues in IABot that should be addressed. I'm hoping for a fix for only this one.

> I meant that indendent of the bot I do not recall ever coming across the situation where there are two references with exactly the same content,

Oh, I see. Multiple references with the same content occur in the wild with high frequency. A common source is templates. In fact, I don't think it would be possible for MediaWiki to stop ignoring cases where a reference is redefined with the same text because it would break the complicated (Byzantine!) network of templated reference generation mechanisms all over the public wikis.

Meanwhile, here's another disruptive edit in this same pattern: https://en.wikipedia.org/w/index.php?title=Batman_vs._Two-Face&type=revision&diff=900226786&oldid=897580275&diffmode=source

Cirdan closed this task as Invalid.EditedJun 7 2019, 9:35 AM

The MediaWiki parser is quite generous,

I suppose that's true, but if the parser renders the page without complaint, how can we decide if it's invalid? Is there some other authority that tells us what is "valid" or "invalid" wiki text, if not the parser itself?

As I said above, just because the parser ignores it, does not mean it's valid wikitext. It makes no semantic sense to have two references with the same name, it just so happens that if they have exactly the same content there is a clear way how to deal with them. For example, all web browsers render a page where multiple elements share the same ID just fine, but it's still invalid HTML because an ID must be unique (as it's not an identifier otherwise).

These issues need to be fixed on-wiki, just as other syntax issues are fixed every day. The problem IABot is running into is the same that any user editing such a reference will run into: They edit a named reference and a visible error message appears even though they made a perfectly fine edit. I fully agree that this is a problem, but it's not a problem for IABot to solve. Please find a bot operator on your wiki to regularly clean up duplicate references so that they don't become an issue once one of them is changed by a user or a bot.

The problem is that there are hundreds of similar cases, all of which could be fixed by "just adding some test".

I'm sure there are many issues in IABot that should be addressed. I'm hoping for a fix for only this one.

This is not an issue in IABot, it's an issue with the article's source code. IABot will not clean up syntax errors, this is beyond the scope of the project. Just like a human editor, IABot looks at one reference at a time and (rightfully) assumes that making a syntactically correct edit to it does not cause errors elsewhere.

> does not mean it's valid wikitext

What does mean it's invalid, then? I don't know of any documentation that specifies the specific syntax of the wikitext, so as far as I can tell, the parser is the only test of validity that we have. Do you know of such documentation? Can you share a pointer to it, please?

> It makes no semantic sense to have two references with the same name,

I think that's debatable. You seem to have missed my point that there are many templates that would break if duplicate folding didn't work as it does. As a result, the referencing rendering code actively handles this case.

> IABot looks at one reference at a time and (rightfully) assumes that making a syntactically correct edit to it does not cause errors elsewhere.

Human editors manage to avoid this problem by looking at the rest of the article to see if the definition they're editing (or creating) is a duplicate. If a bot is going to make similar edits, I think it only makes sense that we hold it to the same standard -- and the general standard of "do no harm".

I don't think we're going to reach an agreement here. IABot will not add workarounds and checks for wikitext which is invalid, which would greatly inflate the code base and therefore result in many more errors.

Just FYI, the VE cannot deal with duplicate references either (search for spiegel-chebli) and treats them just as IABot does.

Another faulty edit is here: https://en.wikipedia.org/w/index.php?title=British_Rail_Class_314&type=revision&diff=902025852&oldid=900760109&diffmode=source

This case is typical: there are two refrences with shortened URLs for the archive-url paramter, but InternetArchiveBot chose to only shorten one. If this edit is necessary, why only perform it on one of the two locations where the target issue occurs?

It's fine if you don't want to fix the bot, but I has just hoped you'd be able to answer my questions about it.

I may actually have a different solution here, that should be fairly easy to implement. I can simply delete the second reference and replace it with a self closing one. That seems like the best solution.

That seems like a fine solution to me. If the bot can't handle badly formatted input, it should do what it can to detect that condition before it performs an edit. I think your proposal would do that.