Page MenuHomePhabricator

Some Flow content contains control characters (e.g. \b (backspace))
Closed, ResolvedPublic

Description

I've turned up a flow revision for which the text in external store contains a ^H (\b) embedded in it.

Details on the revision:

page: https://www.mediawiki.org/wiki/Extension_talk:LinkedWiki
topic: Notice: Undefined index: Beschrijving in /var/www/wikifarm-mw1.19/extensions/LinkedWiki/LinkedWiki.php on line 283
post: https://www.mediawiki.org/w/index.php?title=Topic:Ret7qp83fy2cwmjd&topic_showPostId=rfb0t2cr56qwgrp5#flow-post-rfb0t2cr56qwgrp5
rev id (alnum): rfb0t2cr56qwgrp5
flags: utf-8,gzip,html,external
url: DB://cluster25/650451

Content after decompression has a ^H in the line


&lt;binding name="Beschrijving">&lt;literal>Het product Liaan e-<span typeof="mw:Entity" data-parsoid='{"src":"&amp;#8;","srcContent":"\b","dsr":[694,698,null,null]}'></span>Dienstverlening is ontwikkeld om uw organisatie uitgebreid te ondersteunen bij het implementeren van digitale dienstverlening / e-Formulieren.

between


null]}'> and </span>

I verified this by pulling the specific blob_text from external store and decompressing it, then running it through od -c.

This bad character is duly written out in the flow xml dumps, which breaks XMLReader() when we try to re-use these dumps for prefetch.

So there are two problems: 1) the ^H in the revision, 2) bad CDATA isn't stripped out before the revision content is written to the dump file.

Event Timeline

Restricted Application added a project: Collaboration-Team-Triage. · View Herald TranscriptJun 8 2017, 6:39 PM
Restricted Application added a subscriber: Aklapper. · View Herald Transcript

https://gerrit.wikimedia.org/r/#/c/357873/ I used this script to grab, uncompress and dump out the rev content. Example run (after placing it in the core maintenance directory):

php5 multiversion/MWScript.php maintenance/examineFlowRevisions.php --wiki=mediawikiwiki --flowrevid=rfb0t2cr56qwgrp5 --silent > /path/to/badrev.txt

I have checked all flow history files from the May 20th run. Of those, only mediawikiwiki and testwiki have issues. For mediawikiwiki there are 5 revisions in total, all with ^H in the middle; I have not yet checked the details for testwiki. I will post information on all problematic revision texts later tonight, or more likely, tomorrow.

The testwiki issue is the known failure to complete the run, it's been broken for a long time but I acn't find that ticket right now. This leaves just the 5 mediawikiwiki revisions.

The good news is that it's all the same revision. I've been grepping out the bad lines and running the resulting file(s) back through a short wrapper with XMLReader() and while there are 5 bad lines they are all in the same revision described above. For completeness' sake, here's the full revision text piped through cat -vte, so that the ^H are made visible:


<revision id="rfb0t2cr56qwgrp5" userid="290446" userwiki="mediawikiwiki" changetype="reply" type="post" typeid="rfb0t2cr56qwgrp5" flags="utf-8,html" modstate="" contentlength="1732" previouscontentlength="0" treeparentid="ret7qplnemlfjdtc" treedescendantid="rfb0t2cr56qwgrp5" treerevid="rfb0t2cr56qwgrp5" treeoriguserid="290446" treeoriguserwiki="mediawikiwiki" globaluserid="7576794" globaltreeoriguserid="7576794">&lt;body data-parsoid='{&quot;dsr&quot;:[0,1732,0,0]}' lang=&quot;en&quot; class=&quot;mw-content-ltr sitedir-ltr ltr mw-body mw-body-content mediawiki&quot; dir=&quot;ltr&quot;&gt;&lt;p data-parsoid='{&quot;dsr&quot;:[0,61,0,0]}'&gt;The parser PHP (xml_parse function) said&t;span typeof=&quot;mw:DisplaySpace mw:Placeholder&quot;data-parsoid='{&quot;src&quot;:&quot; &quot;,&quot;isDisplayHack&quot;:true,&quot;dsr&quot;:[40,41,null,0]}'&gt;M-BM- &lt;/span&gt;: Invalid character.&lt;/p&gt;$
$
&lt;p data-parsoid='{&quot;dsr&quot;:[63,138,0,0]}'&gt;In the sparql result, there are strange characters in XML (start with e-)&lt;span typeof=&quot;mw:DisplaySpace mw:Placeholder&quot; data-parsoid='{&quot;src&quot;:&quot; &quot;,&quot;isDisplayHack&quot;:true,&quot;dsr&quot;:[136,137,null,0]}'&gt;M-BM- &lt;/span&gt;:&lt;/p&gt;$
&lt;pre data-parsoid='{&quot;stx&quot;:&quot;html&quot;,&quot;strippedNL&quot;:true,&quot;dsr&quot;:[139,1677,5,6]}'&gt; &amp;lt;result&gt;$
&amp;lt;binding name=&quot;Softwareproduct&quot;&gt;&amp;lt;uri&gt;http://model.i-catalogus.nl/gemma/SoftwareProduct/Liaan_e-Dienstverlening&amp;lt;/uri&gt;&amp;lt;/binding&gt;$
&amp;lt;binding name=&quot;Naam&quot;&gt;&amp;lt;literal&gt;Liaan e-Dienstverlening&amp;lt;/literal&gt;&amp;lt;/binding&gt;$
&amp;lt;binding name=&quot;Homepage&quot;&gt;&amp;lt;literal&gt;http://www.liaan.nl/produkten/produkt_info.php?id=22&amp;lt;/literal&gt;&amp;lt;/binding&gt;$
&amp;lt;binding name=&quot;WordtGeleverdDoor&quot;&gt;&amp;lt;literal&gt;Liaan&amp;lt;/literal&gt;&amp;lt;/binding&gt;$
&amp;lt;binding name=&quot;HomePageLeverancier&quot;&gt;&amp;lt;literal&gt;http://www.liaan.nl&amp;lt;/literal&gt;&amp;lt;/binding&gt;$
&amp;lt;binding name=&quot;Beschrijving&quot;&gt;&amp;lt;literal&gt;Het product Liaan e-&lt;span typeof=&quot;mw:Entity&quot; data-parsoid='{&quot;src&quot;:&quot;&amp;amp;#8;&quot;,&quot;srcContent&quot;:&quot;\b&quot;,&quot;dsr&quot;:[694,698,null,null]}'&gt;^H&lt;/span&gt;Dienstverlening is ontwikkeld om uw organisatie uitgebreid te ondersteunen bij het implementeren van digitale dienstverlening / e-Formulieren.$
$
Met een abonnement op Liaan e-&lt;span typeof=&quot;mw:Entity&quot; data-parsoid='{&quot;src&quot;:&quot;&amp;amp;#8;&quot;,&quot;srcContent&quot;:&quot;\b&quot;,&quot;dsr&quot;:[872,876,null,null]}'&gt;^H&lt;/span&gt;Dienstverlening beschikt u over alle benodigde middelen om e-&lt;span typeof=&quot;mw:Entity&quot; data-parsoid='{&quot;src&quot;:&quot;&amp;amp;#8;&quot;,&quot;srcContent&quot;:&quot;\b&quot;,&quot;dsr&quot;:[937,941,null,null]}'&gt;^H&lt;/span&gt;Formulieren te ontwikkelen en te integreren in uw website en dus beschikbaar te stellen aan de inwoners van uw gemeente. $
Ook kunt u e-&lt;span typeof=&quot;mw:Entity&quot; data-parsoid='{&quot;src&quot;:&quot;&amp;amp;#8;&quot;,&quot;srcContent&quot;:&quot;\b&quot;,&quot;dsr&quot;:[1076,1080,null,null]}'&gt;^H&lt;/span&gt;Producten beschikbaar stellen aan uw frontoffice (Servicecenter) of backoffice.$
$
Het product beschikt over de noodzakelijke voorzieningen zoals DigiD ondersteuning (voorzien van de benodigde PKI&lt;span typeof=&quot;mw:Entity&quot; data-parsoid='{&quot;src&quot;:&quot;&amp;amp;#8;&quot;,&quot;srcContent&quot;:&quot;\b&quot;,&quot;dsr&quot;:[1274,1278,null,null]}'&gt;^H&lt;/span&gt;Overheidscertificaten, voorinvulling (zoals GBA&lt;span typeof=&quot;mw:Entity&quot; data-parsoid='{&quot;src&quot;:&quot;&amp;amp;#8;&quot;,&quot;srcContent&quot;:&quot;\b&quot;,&quot;dsr&quot;:[1325,1329,null,null]}'&gt;^H&lt;/span&gt;V, DKD) en (StuF)berichtenverkeer. $
$
Technische kennis is niet vereist, installatie niet nodig:$
Liaan e-Dienstverlening wordt volledig gehost in onze e-&lt;span typeof=&quot;mw:Entity&quot; data-parsoid='{&quot;src&quot;:&quot;&amp;amp;#8;&quot;,&quot;srcContent&quot;:&quot;\b&quot;,&quot;dsr&quot;:[1481,1485,null,null]}'&gt;^H&lt;/span&gt;loket.nl infrastructuur. Binnen deze infrastructuur beheert u al uw e-Formulieren met behulp van de zeer gebruiksvriendelijke webapplicatie e&lt;span typeof=&quot;mw:Entity&quot; data-parsoid='{&quot;src&quot;:&quot;&amp;amp;#8;&quot;,&quot;srcContent&quot;:&quot;\b&quot;,&quot;dsr&quot;:[1626,1630,null,null]}'&gt;^H&lt;/span&gt;-Beheer.&amp;lt;/literal&gt;&amp;lt;/binding&gt;$
&amp;lt;/result&gt;$
&lt;/pre&gt;$
$
&lt;p data-parsoid='{&quot;dsr&quot;:[1679,1726,0,0]}'&gt;You have to fix the format in your triplestore.&lt;/p&gt;$
$
&lt;p data-parsoid='{&quot;dsr&quot;:[1728,1732,0,0]}'&gt;Bye.&lt;/p&gt;&lt;/body&gt;</revision>$

Note that <span typeof="mw:Entity" data-parsoid='{"src":"&amp;#8;","srcContent":"\b","dsr":[694,698,null,null]}'> means "there is an HTML entity here and it's &#8;, which is the \b character. It seems to me to be a bug in Parsoid that such characters are able to be output. Then again, this rev is 4 years old, so its possible it was fixed on their end already.

Change 362169 had a related patch set uploaded (by ArielGlenn; owner: ArielGlenn):
[mediawiki/extensions/Flow@master] filter out non-compliant characters (bad PCDATA) from revision text

https://gerrit.wikimedia.org/r/362169

Change 362173 had a related patch set uploaded (by ArielGlenn; owner: ArielGlenn):
[mediawiki/extensions/Flow@master] clean up illegal chars in revision text retrieved during flow content dumps

https://gerrit.wikimedia.org/r/362173

Change 362169 merged by jenkins-bot:
[mediawiki/extensions/Flow@master] Dumps: filter out non-compliant characters (bad PCDATA) from revision text

https://gerrit.wikimedia.org/r/362169

Mattflaschen-WMF renamed this task from Bad revision text from Flow to Some Flow content contains control charcters (e.g. \b (backspace)).Jul 28 2017, 8:59 PM
Mattflaschen-WMF renamed this task from Some Flow content contains control charcters (e.g. \b (backspace)) to Some Flow content contains control characters (e.g. \b (backspace)).
Catrope closed this task as Resolved.Oct 11 2017, 4:56 PM
Catrope claimed this task.

I think this is good to close now, since we're working around the presence of control characters.

Change 362173 abandoned by ArielGlenn:
Clean up illegal chars in revision text retrieved during flow content dumps

Reason:
This was included in later versions of I787a26ff6004a875b71ef38905904b7c489f22d4 and it might as well stay there

https://gerrit.wikimedia.org/r/362173