Page MenuHomePhabricator

Export function indicates wrong length (due to HTML entities converted into their ASCII representations)
Closed, InvalidPublic

Description

Special:Export indicates in the exported XML the wrong length.
To me it seems, that HTML entities cause this problem. Eg the export from the german wikipedia of article "Vergleich (Zahlen)" indicates in the text element: <text xml:space="preserve" bytes="23353">
Including the HTML entities the exported text is 25659 bytes long, having converted all HTML entities into their ASCII representations the article text becomes 23253 bytes long.
I would prefer to see the length within XML here, as it would make it easier to retrieve the content.

Event Timeline

Aklapper renamed this task from Export function indicates wrong length to Export function indicates wrong length (due to HTML entities converted into their ASCII representations).May 19 2018, 9:14 AM
Vvjjkkii renamed this task from Export function indicates wrong length (due to HTML entities converted into their ASCII representations) to urcaaaaaaa.Jul 1 2018, 1:09 AM
Vvjjkkii triaged this task as High priority.
Vvjjkkii updated the task description. (Show Details)
Vvjjkkii removed a subscriber: Aklapper.
CommunityTechBot renamed this task from urcaaaaaaa to Export function indicates wrong length (due to HTML entities converted into their ASCII representations).Jul 2 2018, 4:01 PM
CommunityTechBot raised the priority of this task from High to Needs Triage.
CommunityTechBot updated the task description. (Show Details)
CommunityTechBot added a subscriber: Aklapper.
Umherirrender subscribed.

The export is done in UTF-8 and with unix-newlines (only \n, not \r\n like under windows)

The bytes indicates the real bytes, not the characters in the document. german umlauts takes 2 bytes, for example.

From a quick check the current bytes of the page and the export matches (after the decode of entities, which is normal processing in xml)