Page MenuHomePhabricator

IABot v2.0 merges separate templates, invalidating the reference
Closed, ResolvedPublic

Description

Event Timeline

More cases:

There seems to be an issue with the regular expression(?) used to detect the templates. I couldn't figure out where it is in the source code (my best guess is somewhere here?). I'm happy to take a look at it and try to fix it.

I'm not seeing any issues here. IABot is correctly merging the two templates.

It is not merging templates correctly. Take for example the first diff:

{{Internetquelle |url=https://www.sn.at/politik/weltpolitik/tumulte-in-brasilia-agrarministerium-angezuendet-11372917 |titel=Tumulte in Brasilia: Agrarministerium angezündet |werk=sn.at |hrsg=[[Salzburger Nachrichten]] |datum=2017-05-24 |abruf=2019-05-01 |kommentar=Quelle: Apa/Dpa}}; {{Webarchiv|url=http://www.vol.at/berater-von-brasiliens-staatschef-festgenommen/apa-1436207740 |wayback=20170524011439 |text=''Berater von Brasiliens Staatschef festgenommen''}}

Both templates contain completely different URLs and unrelated. IABot turns them into a merged template:

{{Internetquelle |url=https://www.sn.at/politik/weltpolitik/tumulte-in-brasilia-agrarministerium-angezuendet-11372917 |titel=Tumulte in Brasilia: Agrarministerium angezündet |werk=sn.at |hrsg=[[Salzburger Nachrichten]] |datum=2017-05-24 |abruf=2019-05-01 |kommentar=Quelle: Apa/Dpa |zugriff=2019-05-01 |archiv-url=https://web.archive.org/web/20170524011439/http://www.vol.at/berater-von-brasiliens-staatschef-festgenommen/apa-1436207740 |archiv-datum=2017-05-24 |offline=ja |archiv-bot=2019-09-22 16:31:44 InternetArchiveBot }}

Clearly, https://web.archive.org/web/20170524011439/http://www.vol.at/berater-von-brasiliens-staatschef-festgenommen/apa-1436207740 is not the archive url for https://www.sn.at/politik/weltpolitik/tumulte-in-brasilia-agrarministerium-angezuendet-11372917 .

The problem here is the standard behavior for archive templates are to generally site right behind the original source. The bot was coded around this expectation. For dewiki, archive templates outright replace the original link, making it effectively a standalone template. So this will not be a simple fix.

As far as I'm aware, this kind of merging is a new issue. I cannot remember that this has been reported before.

if it is, then a different bug was keeping the bot from doing what it was supposed to be doing that I managed to fix in the past.

Perhaps that's something to consider for an initiative like T251966: Migrate IABot parsing code to Parsoid? As that ticket points out, there are quite a few issues that could benefit from a more robust and clearly separated syntax parser.

E.g., in the case we're discussing here, it would be great to be able to write a rule for the bot that states "If a pair of a cite and an archive template share the same URL, merge the archive inito the cite template, otherwise treat them separately".

I'm honestly not sure if Parsoid has what IABot needs TBH. What features does it have that IABot needs?

From my limited understanding, I also don't think that Parsoid is the way to go. As I mentioned in that other ticket, in preparation of cleaning up the {{Literatur}} issues on dewp, I've written a small template lexer/parser that is "roundtrip safe", i.e. maintains all whitespaces etc. and can handle infinitely nested templates. It creates a representation where each template is an instance of an object, so it's easy to check and modify (e.g., continuing the example from above, one can just check whether cite_template.url == archive_template.url).

(I've sent you a GitHub invite, a similar parser/editor should be straightforward to write in PHP as well.)

IABot already has a template handler that breaks them down correctly, handling nested templates very well. I don't think it's worth loading up an entire engine, when IABot has it's own engine that can get the specific job done. The comparison of what's in each template is not the issue. It's the default assumption that archive templates directly behind links or cite templates belong to the latter. That's an assumption of the parser.

Why does the parser make any assumptions regarding the content of what it parses? In my understanding, it should produce some abstract representation, which the bot then modifies and which is then converted back to source code. The parser has no need to know anything about the difference between a cite template and an archive template.

Based on my limited understanding from observing the bot's editing and the issues that we've seen over the years, fully separating lexing, parsing, and editing should solve a great number of them.

Why does the parser make any assumptions regarding the content of what it parses? In my understanding, it should produce some abstract representation, which the bot then modifies and which is then converted back to source code. The parser has no need to know anything about the difference between a cite template and an archive template.

Based on my limited understanding from observing the bot's editing and the issues that we've seen over the years, fully separating lexing, parsing, and editing should solve a great number of them.

It's literally the only assumption it makes. Aside from that, the parser does exactly what you say it does. The assumption is so it groups the primary part of the reference and accurately grabs the rest of the reference in what is called a "remainder". The remainder has additional templates the bot thinks are useful to properly processing the reference. This includes dead link and archive templates. The primary string is what is called the "link_string" and only contains either the cite template or the URL. The original design, which was centered around Enwiki's use of archive templates, and subsequently other wikis' archive templates, was that the archive template is always appended.

Cyberpower678 claimed this task.

This should be fixed in v2.0.5

Thanks, I ran the bot on some of the articles mentioned above and it was fine. So far, there is only one new case where v2.0.5 messed up a reference: diff

However, that seems to be an edge case (URL is almost identical) and I believe the reference is confusing to readers anyway. Just wanted to report it in case it's an easy fix.