Page MenuHomePhabricator

IABot makes edits that cause new errors in edited articles
Open, HighPublic

Description

IABot has a pattern of modifying references in articles that cause a duplicate reference error. The error manifests itself in the rendered article by displaying text like " Cite error: The named reference "ARIA News 28 Oct" was defined multiple times with different content (see the help page)." in red, in the {{reflist}} output of the article.

The problem seems to be that IABot is reacting to pre-existing duplicate reference definitions in articles by modifying one, and making the other a self-closing
<ref name="here"/> tag. This mostly works, except when the self-closing tag appears as a value in the refs= parameter to the {{reflist}} template.

Here are about a dozen examples of this problem that I've recently found and manually corrected.

I hope that IABot can be repaired to avoid maiking new errors in articles while still achieving its editing goals.

Event Timeline

Mikeblas created this task.Sep 14 2019, 3:33 PM
Restricted Application added a subscriber: Liuxinyu970226. · View Herald TranscriptSep 14 2019, 3:33 PM

Just looking at the first example, I'm already confused. It deduped a named reference but a different reference lit up with a red error. Can you help me understand what is going on here?

Looking more closely at the example, it looks like "ARIA News 28 Oct" was defined 3 times, where one had vastly different content, and the other was an identical duplicate. I don't see an error here other than IABot exposing that there was a ref error not being caught earlier.

Considering the "ARIA News 28 Oct" reference in the "ARIA Music Awards of 2014" article, there was previously no duplicate reference. This can be seen by viewing the article revision before InternetArchiveBot made its edits. In that version, there is on error listed in the "References" section of the article. In that revision, the "ARIA News 28 Oct" is defined twice. One definition is in the body of the article, the other is given as a parameter to the refs= parameter of the {{reflist}} template. These references aren't duplicate as far as the rendering engine is concerned because, even tho they have the same name, their content is identical, character-for-character even including case and white space.

In the revision that IABot edited we now see that there is red text in the references section that says "Cite error: The named reference "ARIA News 28 Oct" was defined multiple times with different content (see the help page)." This makes it clear that IABot introduced the duplicate references.

IABot added archive-url=, archive-date=, and url-status= parameters to the first "ARIA News 28 Oct" definition, the one in the body of the article.

The second definition of "ARIA News 28 Oct" is in the refs= parameter of {{reflist}}. This one was changed to be a self-closing ref. If it were in the body of the article, that would be fine. But here in the refs= list, it ends up causing a duplicate reference definition error. I'm not sure I can explain *why* that's true, but it's certainly the case.

I think that IABot should be implemented to handle these cases by removing repeated references completely from refs= instead of making them self-closing. Or, ideally, it should not buck the status quo of the article and leave the actual definition of the reference in the refs= list and use the self-closing reference in the body of the article.

In the fix I manually made, the self-closed reference definition is deleted.

Cirdan added a subscriber: Cirdan.Sep 15 2019, 5:43 AM

This is now at least the third bug introduced by the "fix" of T224344. Can we please revert that change and let IABot deal with dead links only? Fixing strange parser behavior is not within scope.

This is breaking articles all over the place.

The deduping has been disabled.

@Mikeblas I honestly don't know if this is worth fixing in IABot now. I don't even know what reflist is doing to trigger this error. And I can't program IABot to predict what references inside templates will do on other wikis either. This will need a different solution.

This IABot edit to the Abasy article shows a slightly different pattern. Duplicate (but identical and safe) definitions were present in the refs= list, but IABot decided to shorten one to a self-closing tag, and that caused a duplicate ref def error.

Indeed, it might be best if IABot didn't edit articles where it can't predict the outcome of its own edits.

That's also not feasible. There's no way for IABot to even know that the edit is unpredictable to begin with them. The solution is to try and establish why the bot only touches one ref and ignores the other.

Mikeblas added a comment.EditedSep 15 2019, 2:56 PM

Why can't IABot know that there are multiple definitions of the reference it's about to change?

It also seems possible for IABot to check its own work:

  1. Read the article, uniquely identifying existing error messages
  2. Edit the article as desired
  3. preview (or submit then read) the article, again uniquely identifying error messages
  4. discard (or revert) the edits if the number of error messages has increased

I don't think the bot should be making articles worse, but that's what it's doing with these edits (and the edits that T22434 tried to fix). If its behaviour isn't deterministic, then I don't think it should be trusted to run autonomously.

"Indeed, it might be best if IABot didn't edit articles where it can't predict the outcome of its own edits." <-- I'm referring to that. IABot can't know when an edit it makes might have an unpredictable outcome. That would require serious Machine Learning. Something that's not slated to be implemented until further down the road (IABot v3). That's still some years off at least.

As for changing multiple definitions, it should do that by default but for whatever reason it doesn't. I still haven't ascertained why it modifies one and not the other.

Why can't IABot know that there are multiple definitions of the reference it's about to change?
It also seems possible for IABot to check its own work:

  1. Read the article, uniquely identifying existing error messages
  2. Edit the article as desired
  3. preview (or submit then read) the article, again uniquely identifying error messages
  4. discard (or revert) the edits if the number of error messages has increased

I don't think the bot should be making articles worse, but that's what it's doing with these edits (and the edits that T22434 tried to fix). If its behaviour isn't deterministic, then I don't think it should be trusted to run autonomously.

IABot parses Wikitext only. I'm not sure you understand how the API works. Another feature of IABot is the low impact it has on the MW servers and external links it queries. Adding all of that radically decreases efficiency and radically increases load on the servers. IABot is also working on 25 other wikis.

If its behavior can't be fixed, then it shouldn't exercise that behavior -- that is, it should not longer make changes to any reference, since it can't know if it is making the article (or the reference) any better or worse. The automated unpredictable breaking of articles isn't desirable. I don't think machine learning is necessary; just defensive programming in the face of dirty input. Being more efficient at breaking articles is not a feature compared to not breaking articles more slowly.

The bot already has code to detect multiple definitions; otherwise, the behaviour it exhibits wouldn't be possible. If any multiple definition is detected, the article should be left alone because the bot doesn't know if its changes will make the article worse or not.

The error rate is still very, very low from what I'm seeing. I'm not using that as a reason to dismiss your concerns, I just need to look at what's happening. I'm not fond of patch jobs. I want to fix this at the root.

Unfortunately, no metrics of the bot's activity are visible to me. Since we know that it's not checking it's own work, how are measuring its error rate?

https://en.wikipedia.org/wiki/Category:Pages_with_duplicate_reference_names

Errors are populated here, which given the edit rate of the bot, and based off of a random sample of what IABot broke vs what humans broke, we see that of the millions of pages it edited, this all that it broke. We should certainly commission a bot to clean up the mess, however.

This comment was removed by Cyberpower678.
Cyberpower678 triaged this task as High priority.Sep 15 2019, 4:14 PM