Page MenuHomePhabricator

Bad UTF-8 in ThreadSignature breaks display and can't be exported
Open, LowPublic

Description

Export of one of the discussion threads (this is page ID 803932 in huwiki_p):

https://secure.wikimedia.org/wikipedia/hu/wiki/Speciális:Lapok_exportálása/Téma:Szerkesztővita:Dencey/Fölösleges_információk/válasz_(3)

contains invalid (truncated) probably UTF-8 for the thread poster signature.

Hexdump of the export page reveals:

00000be0 74 3b 67 72 65 65 6e 26 71 75 6f 74 3b 20 66 61 |t;green" fa|
00000bf0 63 65 3d 26 71 75 6f 74 3b 4c 75 63 69 64 61 20 |ce="Lucida |
00000c00 63 61 6c 6c 69 67 72 61 70 68 79 26 71 75 6f 74 |calligraphy&quot|
00000c10 3b 26 67 74 3b ce 93 ce bf cf 85 ce b2 ce b2 ce |;>...........|
00000c20 bf cf 82 20 ce 98 ce b9 ce bb ce bf ce 3c 2f 54 |... .........</T|
00000c30 68 72 65 61 64 53 69 67 6e 61 74 75 72 65 3e 0a |hreadSignature>.|

0xCE byte at offset 0x00000c2a should be followed by at least one more byte to get a correct UTF-8 encoding.

XML dump process fails silently - the last page in those dumps:

http://download.wikimedia.org/huwiki/20110531/huwiki-20110531-pages-articles.xml.bz2

http://download.wikimedia.org/huwiki/20110614/huwiki-20110614-pages-articles.xml.bz2

is page ID 803931, after this there is no XML so whole dump is a non-valid XML.

It gets compressed via bzip2, though.

This problem was reported on the pywikipedia mailing list by Bináris:

http://thread.gmane.org/gmane.comp.python.pywikipediabot.general/11335


Version: unspecified
Severity: major
URL: https://hu.wikipedia.org/wiki/Speci%C3%A1lis:Lapok_export%C3%A1l%C3%A1sa/T%C3%A9ma:Szerkeszt%C5%91vita:Dencey/F%C3%B6l%C3%B6sleges_inform%C3%A1ci%C3%B3k/v%C3%A1lasz_%283%29
See Also:
https://bugzilla.wikimedia.org/show_bug.cgi?id=47885

Details

Reference
bz29564

Event Timeline

bzimport raised the priority of this task from to Low.Nov 21 2014, 11:32 PM
bzimport set Reference to bz29564.
bzimport added a subscriber: Unknown Object (MLST).
saper created this task.Jun 24 2011, 1:30 PM
saper added a comment.Jun 24 2011, 1:49 PM

It looks like that database entries got truncated at 256th byte:

select thread_signature from thread where thread_root=803932 \G

  • 1. row *******

thread_signature: <span title="bétaverzió"> <!--<font style="text-decoration: blink;">--><font color="red">♥</font><font color="white">♥</font><font color="green">♥</font> </font> [[User:Gubbubu|<font color="green" face="Lucida calligraphy">Γουββος ΘιλοÎ

"thread_signature" field is a TINYBLOB (http://svn.wikimedia.org/viewvc/mediawiki/trunk/extensions/LiquidThreads/lqt.sql?revision=72707&view=markup) but no attempt is obviously made to truncate UTF-8 contents sensibly.

This means that database entries need to be fixed first, adding "shell" keyword and bumping priority.

brion added a comment.Jun 24 2011, 6:00 PM

So we can split this into a few separate parts:

  • saving data into thread_signature fails to properly truncate long strings
  • LQT's extension to XML export fails to run UTF-8 validation & cleanup on output
  • old db entries potentially ought to get cleaned up (shell issue, but probably mostly irrelevant if the above is fixed)
brion added a comment.Jun 24 2011, 6:13 PM

r90723 fixes the XML export on trunk; one-line fix will be easy to merge to deployment.

Applies UtfNormal::cleanUp() on the XML chunk that LQT adds into the output stream; this is already applied on the rest of the export data via WikiExporter's xmlsafe() escaping wrapper etc.

saper added a comment.Jun 24 2011, 6:32 PM

Thanks for looking at this quickly.

I just went through the LQT wikis using the toolserver databases, issuing a query:

select thread_id, thread_signature from thread where length(thread_signature)=255;

149 sql enwikinews_p < problem.sql
150 sql enwiktionary_p < problem.sql
151 sql mediawikiwiki_p < problem.sql
153 sql ptwikibooks_p < problem.sql
154 sql strategywiki_p < problem.sql
155 sql sewikimedia_p < problem.sql
156 sql svwikisource_p < problem.sql
157 sql wikimania2010wiki_p < problem.sql
158 sql wikimania2011wiki_p < problem.sql

officewiki_p couldn't be checked because we don't have this one :)

Few wikis have that long signatures stored, but the above case in huwiki
is the only one that ends with a broken UTF-8 sequence. Many signatures in other database ended up encoded in HTML entities, so they have no chance to break UTF-8 this way.

So it seems to be that only one row with thread_id = 1288 needs to be updated in the huwiki_p database.

Are the current dumps still missing a bunch of pages (as described in the original report)?

What content should go into the thread_signature field for thread_id 1288 in order to fix this manually for the one row?

Marcin: Could you answer comment 5, please?

saper added a comment.Jan 24 2013, 9:11 AM
  1. I just checked the current dump and it looks like that it is not truncated after the abovementioned page; but currently I can't find the page ID 803931 there. I'll double check that again, but simple pywikipedia loop:

Python 2.7.3 (default, Sep 17 2012, 21:25:11)
[GCC 4.3.4] on linux2
Type "help", "copyright", "credits" or "license" for more information.

import xmlreader
z = xmlreader.XmlDump("huwiki-20121021-pages-articles.xml.bz2")
for i in z.parse():

... if i.id == 803931:
... print repr(i)
...
Reading XML dump...

does not seem to give any results.

  1. To fix this entry in the database I would simply remove the last byte of the "thread_signature" field. Or maybe a whole greek text can be removed and

this:

[[User:Gubbubu|<font color="green" face="Lucida
calligraphy">Γουββος ΘιλοÎ

changed to

[[User:Gubbubu|Gubbubu]]

or something like that.

saper added a comment.Jan 24 2013, 9:43 AM

Sorry, I used the wrong dump above, now tried this with 0 results:

import xmlreader
z = xmlreader.XmlDump("huwiki-20130120-pages-meta-current.xml.bz2")
for i in z.parse():

if i.id in [803931, 803932]:
   print repr(i)

Created attachment 11679
Dump of the text node of page 803932

Attached please find the result of running:

import xmlreader
out = open("803932.txt", "w")
z = xmlreader.XmlDump("huwiki-20130120-pages-meta-current.xml.bz2")
for i in z.parse():

if i.id in ["803932"]:
   out.write(i.text.encode("utf-8"))
   break

out.close()

What's interesting, this body looks more complete than what is acutally displayed under the URL of this bug. Is the output prepared for export of better quality than the rendered wikipage? Interesting.

Attached:

Created attachment 11680
XML dump of <page id="803932"/>

This is the node taken from the uncompressed dump.

It seems that <ThreadSignature> part looks correct now:

00000380 62 75 7c 26 6c 74 3b 66 6f 6e 74 20 63 6f 6c 6f |bu|&lt;font colo|
00000390 72 3d 26 71 75 6f 74 3b 67 72 65 65 6e 26 71 75 |r=&quot;green&qu|
000003a0 6f 74 3b 20 66 61 63 65 3d 26 71 75 6f 74 3b 4c |ot; face=&quot;L|
000003b0 75 63 69 64 61 20 63 61 6c 6c 69 67 72 61 70 68 |ucida calligraph|
000003c0 79 26 71 75 6f 74 3b 26 67 74 3b ce 93 ce bf cf |y&quot;&gt;.....|
000003d0 85 ce b2 ce b2 ce bf cf 82 20 ce 98 ce b9 ce bb |......... ......|
000003e0 ce bf ef bf bd 3c 2f 54 68 72 65 61 64 53 69 67 |.....</ThreadSig|
000003f0 6e 61 74 75 72 65 3e 0a 3c 2f 44 69 73 63 75 73 |nature>.</Discus|

We have few more bytes from the signature available and XML tools do not complain about UTF-8 anymore.

Attached:

saper added a comment.Jan 24 2013, 2:44 PM

To sum up:

  1. The dump looks okay.
  1. I am confused about the actual information in the database: toolserver replica still shows truncated bytes in the database and the webpage itself shows truncated wikitext as well as [[Special:Export]].

(In reply to comment #11)

  1. I am confused about the actual information in the database: toolserver

replica still shows truncated bytes in the database and the webpage itself
shows truncated wikitext as well as [[Special:Export]].

To clarify, truncated only before </ThreadSignature> but continuing after that. We also don't see the signature displayed after that point, so this is a user-facing problem.

I'm reducing severity and updating the bug summary now that the export works.

Jorm removed a subscriber: Jorm.Dec 26 2015, 7:24 PM
Restricted Application added a subscriber: Aklapper. · View Herald TranscriptDec 26 2015, 7:24 PM
Jdforrester-WMF lowered the priority of this task from Low to Lowest.Aug 4 2016, 11:34 PM
Jdforrester-WMF added a subscriber: Jdforrester-WMF.

LiquidThreads has been replaced by StructuredDiscussions on all Wikimedia production wikis (except one, which will be done soon). It is no longer under active development or maintenance, so I'm re-classifying all open LQT tasks as "Lowest" priority.

Nemo_bis raised the priority of this task from Lowest to Low.Aug 5 2016, 7:37 AM