Page MenuHomePhabricator

[[Mediawiki:Badtitletext]] being added to articles
Closed, ResolvedPublic1 Estimated Story Points

Description

Search for "badtitletext" in this (very messy) diff: https://fr.wikipedia.org/w/index.php?title=Zach_Galifianakis&diff=prev&oldid=112810599

Based on the location, this is probably replacing internal links (maybe redlinks?).

Event Timeline

Whatamidoing-WMF raised the priority of this task from to Needs Triage.
Whatamidoing-WMF updated the task description. (Show Details)

Were you able to create an edit of your own like this? I suspect the huge number of links being added is the problem here, MediaWiki:Badtitletext showing up being just a symptom of it attempting to link all the things (including something invalid)?

Jdforrester-WMF set Security to None.
Jdforrester-WMF edited a custom field.

I have not tried to reproduce it. However, I point out the remarkable
similarity between the new content and the English Wikipedia article on the
same subject. It is possible that a massive copy-paste operation happened
here.

@Catrope has got the Parsoid team to add debugging to try to isolate this.

@Catrope has got the Parsoid team to add debugging to try to isolate this.

Specifically, https://gerrit.wikimedia.org/r/#/c/197985/

Logs from Kibana:

  • Bad title text <a href="Phil%20Donahue" rel="mw:WikiLink">Phil [Donahue]</a>
  • Bad title text <a href="https://uk.wikipedia.org/wiki/%D8%E0%EF%EE%E2%E0%EB_%CC%E8%EA%E8%F2%E0_%DE%F5%E8%EC%EE%E2%E8%F7"; rel="mw:ExtLink" data-parsoid-diff="{&quot;id&quot;:1714866,&quot;diff&quot;:[&quot;modified&quot;,&quot;inserted&quot;]}">Микитою Шаповалом.</a>
  • Bad title text <a href="http://fr.wikisource.org/wiki/La_Soci%C3%A9t%C3%A9_industrielle_et_son_avenir"; rel="mw:ExtLink" data-parsoid-diff="{&quot;id&quot;:8970773,&quot;diff&quot;:[&quot;modified&quot;,&quot;inserted&quot;]}">.</a>
  • Bad title text <a href="Convenzione%20quadro%20delle%20Nazioni%20Unite%20sui%20cambiamenti%20climatici" rel="mw:WikiLink" data-parsoid-diff="{&quot;id&quot;:1981995,&quot;diff&quot;:[&quot;modified&quot;,&quot;inserted&quot;]}">Il 16 febbraio 2007 si è celebrato l'anniversario del secondo anno di adesione al protocollo di Kyōto, e lo stesso anno ricorre il decennale dalla sua stesura. Con l'accordo Doha l'estensione del protocollo si è prolungata fino al 2020 anziché alla fine del 2012.</a>
  • Bad title text <a title="Aide:Par qui" href="./Aide:Par_qui" rel="mw:WikiLink" id="mw0Q" data-parsoid="{&quot;stx&quot;:&quot;piped&quot;,&quot;a&quot;:{&quot;href&quot;:&quot;./Aide:Par_qui&quot;},&quot;sa&quot;:{&quot;href&quot;:&quot;Aide:Par qui&quot;}}" data-parsoid-diff="{&quot;id&quot;:8841292,&quot;diff&quot;:[&quot;inserted&quot;]}">[Par qui&nbsp;?]</a>

Can add more (21 found in last 24 hours) if required .. Throw anything parsoid-specific over to us.

Scratch all that. There are false positives .. I just confirmed by trying to serialize some of these reports and they serialize just fine but the log output is also generated. So, our logging is not precise enough. Will fix that and we can revisit new reports once that is in place. But, what I can do is take a look at the 36 instances we have in kibana and sift out the false positives.

Scratch all that. There are false positives .. I just confirmed by trying to serialize some of these reports and they serialize just fine but the log output is also generated. So, our logging is not precise enough. Will fix that and we can revisit new reports once that is in place. But, what I can do is take a look at the 36 instances we have in kibana and sift out the false positives.

https://gerrit.wikimedia.org/r/#/c/199800/ is the patch to remove false positives from the logs. Subbu says it'll be deployed on Monday.

Okay, I slurped the relevant log entries from kibana via curl and extracted the HTML snippets and ran them through parsoid html2wt and found a few valid instances of bad title text. I'm going to post one relevant entry (full log):

{"host":"wtp1018","level":3,"version":"1.0","@version":"1","@timestamp":"2015-03-25T10:08:48.539Z","source_host":"10.64.32.91","pid":14556,"logType":"error","wiki":"frwiki","title":"Journal_de_la_psychanalyse_de_l'enfant","oldId":113242101,"longMsg":"Bad title text\n<a rel=\"mw:WikiLink\" href=\"%09http%3A%2F%2Fbsf.spp.asso.fr%2Findex.php%3Flvl%3Dnotice_display%26id%3D290\" data-parsoid-diff=\"{&quot;id&quot;:8472291,&quot;diff&quot;:[&quot;modified&quot;,&quot;inserted&quot;]}\">Indexation complète des articles parus à la Bibliothèque Sigmund Freud</a>","type":"parsoid","tags":["es","gelf","normalized_message_trimmed"],"message":"Bad title text <a rel=\"mw:WikiLink\" href=\"%09http%3A%2F%2Fbsf.spp.asso.fr%2Findex.php%3Flvl%3Dnotice_display%26id%3D290\" data-parsoid-diff=\"{&quot;id&quot;:8472291,&quot;diff&quot;:[&quot;modified&quot;,&quot;inserted&quot;]}\">Indexation complète des articles parus à la Bibliothèque Sigmund Freud</a>","normalized_message":"Bad title text <a rel=\"mw:WikiLink\" href=\"%09http%3A%2F%2Fbsf.spp.asso.fr%2Findex.php%3Flvl%3Dnotice_display%26id%3D290\" data-parsoid-diff=\"{&quot;id&quot;:8472291,&quot;diff&quot;:[&quot;modified&quot;,&quot;inserted&quot;]}\">Indexation complète des arti"}

So, something happened in the editor. In case it matters, note that the link in the old version is a mw:ExtLink .. but, in the log entry above, parsoid got the same link with a mw:WikiLink type.

After the Parsoid deploy, I found one instance of this error in Kibana:

Bad title text <a href="https://ru.wikipedia.org/wiki/%D3%ED%E8%E2%E5%F0._%CD%EE%E2%E0%FF_%EE%E1%F9%E0%E3%E0"; rel="mw:ExtLink" data-parsoid-diff="{&quot;id&quot;:1662279,&quot;diff&quot;:[&quot;inserted&quot;]}">Универ. Новая общага</a>

which corresponds to the following diff: https://ru.wikipedia.org/w/index.php?title=%D0%9C%D0%BE%D0%BB%D0%BE%D1%85%D0%BE%D0%B2%D1%81%D0%BA%D0%B0%D1%8F,_%D0%95%D0%BA%D0%B0%D1%82%D0%B5%D1%80%D0%B8%D0%BD%D0%B0_%D0%92%D0%B8%D0%BA%D1%82%D0%BE%D1%80%D0%BE%D0%B2%D0%BD%D0%B0&diff=next&oldid=69680636

It looks like the link href was encoded using windows-1251 rather than UTF-8, but URL-encoding is always required to be UTF-8, so URL-decoding fails.

Okay, I slurped the relevant log entries from kibana via curl and extracted the HTML snippets and ran them through parsoid html2wt and found a few valid instances of bad title text. I'm going to post one relevant entry (full log):

{"host":"wtp1018","level":3,"version":"1.0","@version":"1","@timestamp":"2015-03-25T10:08:48.539Z","source_host":"10.64.32.91","pid":14556,"logType":"error","wiki":"frwiki","title":"Journal_de_la_psychanalyse_de_l'enfant","oldId":113242101,"longMsg":"Bad title text\n<a rel=\"mw:WikiLink\" href=\"%09http%3A%2F%2Fbsf.spp.asso.fr%2Findex.php%3Flvl%3Dnotice_display%26id%3D290\" data-parsoid-diff=\"{&quot;id&quot;:8472291,&quot;diff&quot;:[&quot;modified&quot;,&quot;inserted&quot;]}\">Indexation complète des articles parus à la Bibliothèque Sigmund Freud</a>","type":"parsoid","tags":["es","gelf","normalized_message_trimmed"],"message":"Bad title text <a rel=\"mw:WikiLink\" href=\"%09http%3A%2F%2Fbsf.spp.asso.fr%2Findex.php%3Flvl%3Dnotice_display%26id%3D290\" data-parsoid-diff=\"{&quot;id&quot;:8472291,&quot;diff&quot;:[&quot;modified&quot;,&quot;inserted&quot;]}\">Indexation complète des articles parus à la Bibliothèque Sigmund Freud</a>","normalized_message":"Bad title text <a rel=\"mw:WikiLink\" href=\"%09http%3A%2F%2Fbsf.spp.asso.fr%2Findex.php%3Flvl%3Dnotice_display%26id%3D290\" data-parsoid-diff=\"{&quot;id&quot;:8472291,&quot;diff&quot;:[&quot;modified&quot;,&quot;inserted&quot;]}\">Indexation complète des arti"}

So, something happened in the editor. In case it matters, note that the link in the old version is a mw:ExtLink .. but, in the log entry above, parsoid got the same link with a mw:WikiLink type.

That one appears to be due to a tab character having been added before the URL. You can't do this with the tab key, but you can paste a tab character (or a URL preceded by a tab) into the link interface.

Is there anything left to address here?