Page MenuHomePhabricator

Some Flow content contains control characters (e.g. \b (backspace))
Closed, ResolvedPublic

Description

I've turned up a flow revision for which the text in external store contains a ^H (\b) embedded in it.

Details on the revision:

page: https://www.mediawiki.org/wiki/Extension_talk:LinkedWiki
topic: Notice: Undefined index: Beschrijving in /var/www/wikifarm-mw1.19/extensions/LinkedWiki/LinkedWiki.php on line 283
post: https://www.mediawiki.org/w/index.php?title=Topic:Ret7qp83fy2cwmjd&topic_showPostId=rfb0t2cr56qwgrp5#flow-post-rfb0t2cr56qwgrp5
rev id (alnum): rfb0t2cr56qwgrp5
flags: utf-8,gzip,html,external
url: DB://cluster25/650451

Content after decompression has a ^H in the line


&lt;binding name="Beschrijving">&lt;literal>Het product Liaan e-<span typeof="mw:Entity" data-parsoid='{"src":"&amp;#8;","srcContent":"\b","dsr":[694,698,null,null]}'></span>Dienstverlening is ontwikkeld om uw organisatie uitgebreid te ondersteunen bij het implementeren van digitale dienstverlening / e-Formulieren.

between


null]}'> and </span>

I verified this by pulling the specific blob_text from external store and decompressing it, then running it through od -c.

This bad character is duly written out in the flow xml dumps, which breaks XMLReader() when we try to re-use these dumps for prefetch.

So there are two problems: 1) the ^H in the revision, 2) bad CDATA isn't stripped out before the revision content is written to the dump file.

Event Timeline

Restricted Application added a subscriber: Aklapper. · View Herald Transcript

https://gerrit.wikimedia.org/r/#/c/357873/ I used this script to grab, uncompress and dump out the rev content. Example run (after placing it in the core maintenance directory):

php5 multiversion/MWScript.php maintenance/examineFlowRevisions.php --wiki=mediawikiwiki --flowrevid=rfb0t2cr56qwgrp5 --silent > /path/to/badrev.txt

I have checked all flow history files from the May 20th run. Of those, only mediawikiwiki and testwiki have issues. For mediawikiwiki there are 5 revisions in total, all with ^H in the middle; I have not yet checked the details for testwiki. I will post information on all problematic revision texts later tonight, or more likely, tomorrow.

The testwiki issue is the known failure to complete the run, it's been broken for a long time but I acn't find that ticket right now. This leaves just the 5 mediawikiwiki revisions.

The good news is that it's all the same revision. I've been grepping out the bad lines and running the resulting file(s) back through a short wrapper with XMLReader() and while there are 5 bad lines they are all in the same revision described above. For completeness' sake, here's the full revision text piped through cat -vte, so that the ^H are made visible:


<revision id="rfb0t2cr56qwgrp5" userid="290446" userwiki="mediawikiwiki" changetype="reply" type="post" typeid="rfb0t2cr56qwgrp5" flags="utf-8,html" modstate="" contentlength="1732" previouscontentlength="0" treeparentid="ret7qplnemlfjdtc" treedescendantid="rfb0t2cr56qwgrp5" treerevid="rfb0t2cr56qwgrp5" treeoriguserid="290446" treeoriguserwiki="mediawikiwiki" globaluserid="7576794" globaltreeoriguserid="7576794">&lt;body data-parsoid='{&quot;dsr&quot;:[0,1732,0,0]}' lang=&quot;en&quot; class=&quot;mw-content-ltr sitedir-ltr ltr mw-body mw-body-content mediawiki&quot; dir=&quot;ltr&quot;&gt;&lt;p data-parsoid='{&quot;dsr&quot;:[0,61,0,0]}'&gt;The parser PHP (xml_parse function) said&t;span typeof=&quot;mw:DisplaySpace mw:Placeholder&quot;data-parsoid='{&quot;src&quot;:&quot; &quot;,&quot;isDisplayHack&quot;:true,&quot;dsr&quot;:[40,41,null,0]}'&gt;M-BM- &lt;/span&gt;: Invalid character.&lt;/p&gt;$
$
&lt;p data-parsoid='{&quot;dsr&quot;:[63,138,0,0]}'&gt;In the sparql result, there are strange characters in XML (start with e-)&lt;span typeof=&quot;mw:DisplaySpace mw:Placeholder&quot; data-parsoid='{&quot;src&quot;:&quot; &quot;,&quot;isDisplayHack&quot;:true,&quot;dsr&quot;:[136,137,null,0]}'&gt;M-BM- &lt;/span&gt;:&lt;/p&gt;$
&lt;pre data-parsoid='{&quot;stx&quot;:&quot;html&quot;,&quot;strippedNL&quot;:true,&quot;dsr&quot;:[139,1677,5,6]}'&gt; &amp;lt;result&gt;$
&amp;lt;binding name=&quot;Softwareproduct&quot;&gt;&amp;lt;uri&gt;http://model.i-catalogus.nl/gemma/SoftwareProduct/Liaan_e-Dienstverlening&amp;lt;/uri&gt;&amp;lt;/binding&gt;$
&amp;lt;binding name=&quot;Naam&quot;&gt;&amp;lt;literal&gt;Liaan e-Dienstverlening&amp;lt;/literal&gt;&amp;lt;/binding&gt;$
&amp;lt;binding name=&quot;Homepage&quot;&gt;&amp;lt;literal&gt;http://www.liaan.nl/produkten/produkt_info.php?id=22&amp;lt;/literal&gt;&amp;lt;/binding&gt;$
&amp;lt;binding name=&quot;WordtGeleverdDoor&quot;&gt;&amp;lt;literal&gt;Liaan&amp;lt;/literal&gt;&amp;lt;/binding&gt;$
&amp;lt;binding name=&quot;HomePageLeverancier&quot;&gt;&amp;lt;literal&gt;http://www.liaan.nl&amp;lt;/literal&gt;&amp;lt;/binding&gt;$
&amp;lt;binding name=&quot;Beschrijving&quot;&gt;&amp;lt;literal&gt;Het product Liaan e-&lt;span typeof=&quot;mw:Entity&quot; data-parsoid='{&quot;src&quot;:&quot;&amp;amp;#8;&quot;,&quot;srcContent&quot;:&quot;\b&quot;,&quot;dsr&quot;:[694,698,null,null]}'&gt;^H&lt;/span&gt;Dienstverlening is ontwikkeld om uw organisatie uitgebreid te ondersteunen bij het implementeren van digitale dienstverlening / e-Formulieren.$
$
Met een abonnement op Liaan e-&lt;span typeof=&quot;mw:Entity&quot; data-parsoid='{&quot;src&quot;:&quot;&amp;amp;#8;&quot;,&quot;srcContent&quot;:&quot;\b&quot;,&quot;dsr&quot;:[872,876,null,null]}'&gt;^H&lt;/span&gt;Dienstverlening beschikt u over alle benodigde middelen om e-&lt;span typeof=&quot;mw:Entity&quot; data-parsoid='{&quot;src&quot;:&quot;&amp;amp;#8;&quot;,&quot;srcContent&quot;:&quot;\b&quot;,&quot;dsr&quot;:[937,941,null,null]}'&gt;^H&lt;/span&gt;Formulieren te ontwikkelen en te integreren in uw website en dus beschikbaar te stellen aan de inwoners van uw gemeente. $
Ook kunt u e-&lt;span typeof=&quot;mw:Entity&quot; data-parsoid='{&quot;src&quot;:&quot;&amp;amp;#8;&quot;,&quot;srcContent&quot;:&quot;\b&quot;,&quot;dsr&quot;:[1076,1080,null,null]}'&gt;^H&lt;/span&gt;Producten beschikbaar stellen aan uw frontoffice (Servicecenter) of backoffice.$
$
Het product beschikt over de noodzakelijke voorzieningen zoals DigiD ondersteuning (voorzien van de benodigde PKI&lt;span typeof=&quot;mw:Entity&quot; data-parsoid='{&quot;src&quot;:&quot;&amp;amp;#8;&quot;,&quot;srcContent&quot;:&quot;\b&quot;,&quot;dsr&quot;:[1274,1278,null,null]}'&gt;^H&lt;/span&gt;Overheidscertificaten, voorinvulling (zoals GBA&lt;span typeof=&quot;mw:Entity&quot; data-parsoid='{&quot;src&quot;:&quot;&amp;amp;#8;&quot;,&quot;srcContent&quot;:&quot;\b&quot;,&quot;dsr&quot;:[1325,1329,null,null]}'&gt;^H&lt;/span&gt;V, DKD) en (StuF)berichtenverkeer. $
$
Technische kennis is niet vereist, installatie niet nodig:$
Liaan e-Dienstverlening wordt volledig gehost in onze e-&lt;span typeof=&quot;mw:Entity&quot; data-parsoid='{&quot;src&quot;:&quot;&amp;amp;#8;&quot;,&quot;srcContent&quot;:&quot;\b&quot;,&quot;dsr&quot;:[1481,1485,null,null]}'&gt;^H&lt;/span&gt;loket.nl infrastructuur. Binnen deze infrastructuur beheert u al uw e-Formulieren met behulp van de zeer gebruiksvriendelijke webapplicatie e&lt;span typeof=&quot;mw:Entity&quot; data-parsoid='{&quot;src&quot;:&quot;&amp;amp;#8;&quot;,&quot;srcContent&quot;:&quot;\b&quot;,&quot;dsr&quot;:[1626,1630,null,null]}'&gt;^H&lt;/span&gt;-Beheer.&amp;lt;/literal&gt;&amp;lt;/binding&gt;$
&amp;lt;/result&gt;$
&lt;/pre&gt;$
$
&lt;p data-parsoid='{&quot;dsr&quot;:[1679,1726,0,0]}'&gt;You have to fix the format in your triplestore.&lt;/p&gt;$
$
&lt;p data-parsoid='{&quot;dsr&quot;:[1728,1732,0,0]}'&gt;Bye.&lt;/p&gt;&lt;/body&gt;</revision>$

Note that <span typeof="mw:Entity" data-parsoid='{"src":"&amp;#8;","srcContent":"\b","dsr":[694,698,null,null]}'> means "there is an HTML entity here and it's &#8;, which is the \b character. It seems to me to be a bug in Parsoid that such characters are able to be output. Then again, this rev is 4 years old, so its possible it was fixed on their end already.

Change 362169 had a related patch set uploaded (by ArielGlenn; owner: ArielGlenn):
[mediawiki/extensions/Flow@master] filter out non-compliant characters (bad PCDATA) from revision text

https://gerrit.wikimedia.org/r/362169

Change 362173 had a related patch set uploaded (by ArielGlenn; owner: ArielGlenn):
[mediawiki/extensions/Flow@master] clean up illegal chars in revision text retrieved during flow content dumps

https://gerrit.wikimedia.org/r/362173

Change 362169 merged by jenkins-bot:
[mediawiki/extensions/Flow@master] Dumps: filter out non-compliant characters (bad PCDATA) from revision text

https://gerrit.wikimedia.org/r/362169

Mattflaschen-WMF renamed this task from Bad revision text from Flow to Some Flow content contains control charcters (e.g. \b (backspace)).Jul 28 2017, 8:59 PM
Mattflaschen-WMF renamed this task from Some Flow content contains control charcters (e.g. \b (backspace)) to Some Flow content contains control characters (e.g. \b (backspace)).
Catrope claimed this task.

I think this is good to close now, since we're working around the presence of control characters.

Change 362173 abandoned by ArielGlenn:
Clean up illegal chars in revision text retrieved during flow content dumps

Reason:
This was included in later versions of I787a26ff6004a875b71ef38905904b7c489f22d4 and it might as well stay there

https://gerrit.wikimedia.org/r/362173