Page MenuHomePhabricator

VisualEditor corrupts data and repeats the first ~10kB
Closed, ResolvedPublic0 Estimate Story Points

Description

Problem
Saving wikitext, or switching to source editor, from VisualEditor which has over ~10kB (long Lorem Ipsum text is enough) results in corrupted wikitext. Only the first ~10kB is taken, then appears some HTML tags (see below), and then appears the first ~10kB. This repeats again and again until the original size is reached.

<meta charset="utf-8" />
<meta property="mw:pageNamespace" content="0" />
<meta property="mw:pageId" content="1" />

<meta property="dc:modified" content="2017-12-13T15:01:56.000Z" />
<meta property="mw:revisionSHA1" content="5702e4d5fd9173246331a889294caf01a3ad3706" />
<meta property="isMainPage" content="true" />
<meta property="mw:html:version" content="1.6.0" />
[http://localhost/company/index.php?title=Main_Page]<title>Main Page</title><base href="http://localhost/company/index.php?title=" />[//localhost/company/load.php?modules=mediawiki.legacy.commonPrint%2Cshared%7Cmediawiki.skinning.content.parsoid%7Cmediawiki.skinning.interface%7Cskins.vector.styles%7Csite.styles%7Cext.cite.style%7Cext.cite.styles%7Cmediawiki.page.gallery.styles&only=styles&skin=vector]<!--[if lt IE 9]><script src="//localhost/company/load.php?modules=html5shiv&only=scripts&skin=vector&sync=1"></script><script>html5.addElements('figure-inline');</script><![endif]-->

Steps to reproduce

  1. Fresh installation of MW + working VisualEditor extension only
  2. Edit any page
  3. Add long Lorem ipsum text
  4. Save or switch to source editor
  5. Text will be corrupted (repeated)

Environment

  • Windows 10 / Windows Server 2008 R2+
  • Node.js 8.9.3 LTS
  • PHP 7.0.26
  • Firefox / Chrome latest version
  • IIS
  • MW 1.30 stable / 1.31 alpha from GIT
  • VisualEditor for 1.30 / for 1.31 from GIT
  • Parsoid 0.8.0
  • DB engine for MW is SQLite

config.yaml for Parsoid

worker_heartbeat_timeout: 300000

logging:
    level: info
services:
  - module: lib/index.js
    entrypoint: apiServiceWorker
    conf:
        mwApis:
        - uri: 'http://localhost/company/api.php'
          domain: 'company'

LocalSettings.php

<?php
if ( !defined( 'MEDIAWIKI' ) ) {
	exit;
}

$wgSitename = "Company";

$wgScriptPath = "/company";

$wgServer = "http://localhost";

$wgResourceBasePath = $wgScriptPath;

$wgLogo = "$wgResourceBasePath/resources/assets/wiki.png";


$wgEnableEmail = true;
$wgEnableUserEmail = true;

$wgEmergencyContact = "apache@localhost";
$wgPasswordSender = "apache@localhost";

$wgEnotifUserTalk = false;
$wgEnotifWatchlist = false;
$wgEmailAuthentication = true;

$wgDBtype = "sqlite";
$wgDBserver = "";
$wgDBname = "wiki";
$wgDBuser = "";
$wgDBpassword = "";

$wgSQLiteDataDir = "C:\\Dev\\Web\\company\\App_Data";
$wgObjectCaches[CACHE_DB] = [
	'class' => 'SqlBagOStuff',
	'loggroup' => 'SQLBagOStuff',
	'server' => [
		'type' => 'sqlite',
		'dbname' => 'wikicache',
		'tablePrefix' => '',
		'dbDirectory' => $wgSQLiteDataDir,
		'flags' => 0
	]
];

$wgMainCacheType = CACHE_NONE;
$wgMemCachedServers = [];

$wgEnableUploads = false;

$wgUseInstantCommons = false;

$wgPingback = false;

$wgShellLocale = "C.UTF-8";

$wgLanguageCode = "en";

$wgSecretKey = "d85e1b836ff04d1075ffb06cd3ee7d64bcc99a10ee701587f68a909c548f1160";

$wgAuthenticationTokenVersion = "1";

$wgUpgradeKey = "9c4e319200f92f43";

$wgRightsUrl = "";
$wgRightsText = "";
$wgRightsIcon = "";

$wgDiff3 = "";

$wgDefaultSkin = "vector";

wfLoadSkin( 'Vector' );

wfLoadExtension( 'VisualEditor' );

$wgDefaultUserOptions['visualeditor-enable'] = 1;
$wgDefaultUserOptions['visualeditor-editor'] = "visualeditor";
$wgHiddenPrefs[] = 'visualeditor-enable';
$wgVisualEditorShowBetaWelcome = false;
$wgVisualEditorAvailableNamespaces = [
    NS_MAIN => true,
    NS_USER => true,
    102 => true,
    "_merge_strategy" => "array_plus"
];

$wgVirtualRestConfig['modules']['parsoid'] = array(
	'url' => 'http://localhost:8000',
	'domain' => 'company'
);

Example picture after save

Event Timeline

Restricted Application added a subscriber: Aklapper. · View Herald TranscriptDec 20 2017, 12:54 PM
Czech.Fox updated the task description. (Show Details)Dec 20 2017, 12:55 PM

Command line "node bin/server.js" returns these messages when this issue occurs:

{"name":"lib/index.js","hostname":"DEVELOPER","pid":6324,"level":50,"logType":"error/html2wt/link","wiki":"wiki$0","title":"Main_Page","oldId":"31","reqId":null,"userAgent":"VisualEditor-MediaWiki/1.30.0","msg":"Encountered <link rel=\"dc:replaces\" resource=\"mwr:revision/30\"> -- serializing as extlink and dropping <a> attributes unsupported in wikitext.","longMsg":"Encountered\n<link rel=\"dc:replaces\" resource=\"mwr:revision/30\">\n-- serializing as extlink and dropping <a> attributes unsupported in wikitext.","levelPath":"error/html2wt/link","time":"2017-12-21T09:07:50.488Z","v":0}

or

{"name":"lib/index.js","hostname":"DEVELOPER","pid":6324,"level":30,"logType":"info/html2wt","wiki":"wiki$0","title":"Main_Page","oldId":"27","reqId":null,"userAgent":"VisualEditor-MediaWiki/1.30.0","msg":"Incompatible constraints (merge): {\"a\":{},\"b\":{\"min\":0,\"max\":1},\"min\":1,\"max\":1} {\"a\":{},\"b\":{\"min\":2,\"max\":2},\"min\":2,\"max\":2,\"constraintInfo\":{\"onSOL\":false,\"sepType\":\"sibling\",\"nodeA\":\"P\",\"nodeB\":\"P\"}} {\"a\":{\"min\":1,\"max\":2},\"b\":{\"min\":0,\"max\":1},\"min\":1,\"max\":1}","longMsg":"Incompatible constraints (merge):\n{\"a\":{},\"b\":{\"min\":0,\"max\":1},\"min\":1,\"max\":1}\n{\"a\":{},\"b\":{\"min\":2,\"max\":2},\"min\":2,\"max\":2,\"constraintInfo\":{\"onSOL\":false,\"sepType\":\"sibling\",\"nodeA\":\"P\",\"nodeB\":\"P\"}}\n{\"a\":{\"min\":1,\"max\":2},\"b\":{\"min\":0,\"max\":1},\"min\":1,\"max\":1}","levelPath":"info/html2wt","time":"2017-12-21T09:05:57.474Z","v":0}
Czech.Fox updated the task description. (Show Details)Dec 21 2017, 9:23 AM

Do you have an example of the test you're using? Please put it in a paste somewhere.

Czech.Fox added a comment.EditedJan 2 2018, 11:20 AM

In attached URL is ZIP package with wiki. It's a clean installation of 1.30 with VisualEditor. Main page was edited and saved with 50 paragraphs of Lorem ipsum in plain text from lipsum.com, no formatting. The package should be unpacked to folder "C:/Dev/Web/" and run under "http://localhost/T183356" .

https://drive.google.com/file/d/1NocZvIuhGwoqTxW6-ZfQiWhOg1pPvHcp/view?usp=sharing

Configuration for parsoid:

worker_heartbeat_timeout: 300000

logging:
    level: info
services:
  - module: lib/index.js
    entrypoint: apiServiceWorker
    conf:
        mwApis:
        - uri: 'http://localhost/T183356/api.php'
          domain: 'T183356'

Thanks. 50 paragraphs of plaintext should not cause any problems.

The first error message seems to indicate that Parsoid is receiving a corrupted DOM to serialize.

Encountered <link rel=\"dc:replaces\" resource=\"mwr:revision/30\"> -- serializing as extlink and dropping <a> attributes unsupported in wikitext.

It would be helpful to see what VE is sending. Make some edits to the page and hit save, but before submitting, run the following command in the browser console and paste that somewhere for us to inspect.

From https://www.mediawiki.org/wiki/Parsoid/Debugging#Dumping_HTML_DOM_in_VE

copy(ve.init.target.docToSave.body.outerHTML)

I'm attaching the result of the command. To me, it looks fine at this point.

I'm also attaching data from browser dev tools after hitting the save. 2 api.php calls are made. In the second call, the response already contains corrupted data.

1st request



2nd request



To me, it looks fine at this point.

Agreed.

In the second call, the response already contains corrupted data.

Yup, so the first request results in a cache miss. The payload is the second request seems fine, however, the response looks like a conflation of Parsoid's html and the php parser's output.

{"visualeditoredit":{"content":"<div class=\"mw-parser-output\"><p>Lorem ipsum dolor 
...
\n&lt;meta charset=\"utf-8\" /&gt;\n&lt;meta property=\"mw:pageNamespace\" content=\"0\" /&gt;\n&lt;m
...
ie blandit turpis, a ul\n</p>\n<!-- \nNewPP limit report\nCached time: 20180103115659\nCache expiry: 86400\nDynamic content: false\nCPU time usage: 0.000 seconds\nReal time usage: 0.004 seconds\nPreprocessor visited node count: 8/1000000\nPreprocessor generated node count: 22/1000000\nPost\u2010expand include size: 0/2097152 bytes\nTemplate argument size: 0/2097152 bytes\nHighest expansion depth: 1/40\nExpensive parser function count: 0/100\n-->\n<!--\nTransclusion expansion time report (%,ms,calls,template)\n100.00%    0.000      1 -total\n-->\n</div>\n<!-- Saved in parser cache with key wiki:pcache:idhash:1-0!canonical and timestamp 20180103115659 and revision id 8\n -->\n",

which, if I had to venture a guess, suggests a caching error, or at least something to do with sqlite.

Can you try commenting out from your LocalSettings.php,

$wgObjectCaches[CACHE_DB] = [
	'class' => 'SqlBagOStuff',
	'loggroup' => 'SQLBagOStuff',
	'server' => [
		'type' => 'sqlite',
		'dbname' => 'wikicache',
		'tablePrefix' => '',
		'dbDirectory' => $wgSQLiteDataDir,
		'flags' => 0
	]
];
Czech.Fox added a comment.EditedJan 4 2018, 8:25 AM

After commenting out, the VE almost doesn't work because of an error in api.php calls.

When opening VE it displays a message about HTTP 500. After clicking OK, HTTP 500 may appear again (random), or the VE loads correctly.

The call about HTTP 500 is (GET method):
http://localhost/T183356/api.php?action=visualeditor&format=json&paction=parse&page=Main_Page&uselang=en


When trying to save changes in VE the api call is either never finished, or crashes with the same HTTP 500 as above. So the editor displays error 500 or 0.

When activating $wgObjectCaches[CACHE_DB] parameter back, it all works as before.

Hmm, it's throwing on,

Function: Wikimedia\Rdbms\Database::query (MessageCache::loadFromDB(en)-big)

You should have $wgMainCacheType = CACHE_NONE; in the LocalSettings.php, which $wgMessageCacheType should inherit from,

https://www.mediawiki.org/wiki/Manual:$wgMessageCacheType

I guess try being explicit about all three of these while we're trying to figure this out,

$wgMainCacheType = CACHE_NONE;
$wgMessageCacheType = CACHE_NONE;
$wgParserCacheType = CACHE_NONE;

I set explicitly all three and is fine now. I tested again only with default $wgMainCacheType = CACHE_NONE; (which is generated by the installation page), but in this case it crashes. I tried fresh restart of IIS and Nodejs, and clean all cookies.

Console command is ok. 1st call of api.php about badcachekey is same. I'm attaching just the 2nd call.


And this is my LocalSettings.php

Hmm, guess that wasn't it?

Since the payload VE is sending inflates to something reasonable but Parsoid seems to be trying to serializing parts of the header as if there're in the body (the first error in T183356#3853438), maybe let's try logging the html it's receiving. Can you a logging statement to Parsoid and restart the service?

env.log('warning', html);

at https://github.com/wikimedia/parsoid/blob/86fa722032ec3c9f17a516b9b452bc790389360a/lib/api/routes.js#L590

however, the response looks like a conflation of Parsoid's html and the php parser's output.

The content should be the result of the php parser's render so the page can update w/o refreshing. Bit of confusion on my part there.

https://github.com/wikimedia/mediawiki-extensions-VisualEditor/blob/master/ApiVisualEditorEdit.php#L330-L332

Arlolra triaged this task as Medium priority.Jan 4 2018, 11:15 PM

I changed the command to env.log('info', html); because I did not see output in CLI otherwise.

Here is the complete output from CLI

I did not see output in CLI otherwise.

Sorry, should have been "warn" instead of "warning".

Here is the complete output

Ok, looks like it's already corrupted by the time it hits there.

The path it takes through MW's action API is as follow,

https://github.com/wikimedia/mediawiki-extensions-VisualEditor/blob/master/ApiVisualEditorEdit.php#L278

to

https://github.com/wikimedia/mediawiki-extensions-VisualEditor/blob/master/ApiVisualEditorEdit.php#L155

Maybe the problem is the decompression there? I would try to error_log() the result of self::tryDeflate( $params['html'] ) to confirm if that's where things are going wrong.

Here are results:

I added error_log(self::tryDeflate( $params['html'] )); before line https://github.com/wikimedia/mediawiki-extensions-VisualEditor/blob/master/ApiVisualEditorEdit.php#L154 with result:

Then I added error_log($wikitext); before line https://github.com/wikimedia/mediawiki-extensions-VisualEditor/blob/master/ApiVisualEditorEdit.php#L157 with result (here I see corrupted text):

I digged one step deeper, and added error_log($path . "\n\n" . json_encode($request)); before line https://github.com/wikimedia/mediawiki-extensions-VisualEditor/blob/master/ApiVisualEditor.php#L96

and error_log($path . "\n\n" . json_encode($response)); after that line

Deskana set the point value for this task to 0.

I was debugging more the issue (tried also PHP 5.6.33 x86 NTS), and stopped in MultiHttpClient.php file. Here I don't know how to continue with debugging.

I logged the request and response. Request is fine, but response is wrong. When we debugged Parsoid part, it was already corrupted in input.

I added error_log(json_encode($req)); to https://github.com/wikimedia/mediawiki/blob/master/includes/libs/MultiHttpClient.php#L400

and error_log($data); before line https://github.com/wikimedia/mediawiki/blob/master/includes/libs/MultiHttpClient.php#L413

Can you help me where to continue with debugging, because I don't know the architecture.

It certainly seems like what's being POSTed is fine, but we verified in T183356#3877500 that what's received is already corrupted. Is there any proxy sitting in between the two?

You can try taking the content from

I added error_log(json_encode($req)); to https://github.com/wikimedia/mediawiki/blob/master/includes/libs/MultiHttpClient.php#L400

and removing MediaWiki from the equation. Try sending something like,

curl -d ... -X POST http://localhost:8000/T183356/v3/transform/html/to/wikitext/Main_Page/50

directly to Parsoid and seeing if it serializes it as expected.

To simplify things, you can enable devAPI: true around here, https://github.com/wikimedia/parsoid/blob/master/config.example.yaml#L79
and then visit http://localhost:8000/_html to paste just the body.html there and submit.

Maybe you're hitting some sort of php-curl max post field size limit or something? Maybe it's some "application/x-www-form-urlencoded" or "multipart/form-data" encoding issue? Unfortunately, this isn't something we've experienced and seems specific to your local setup.

Czech.Fox added a comment.EditedJan 11 2018, 12:15 PM

Thank you for other debugging tips.

I tried devAPI on Parsoid and the result seems fine. Running pure command always resulted in No html was supplied. no matter what I wrote, so I skipped this part.

After some research I added curl_setopt($ch, CURLOPT_PROXY, '127.0.0.1:8888'); to https://github.com/wikimedia/mediawiki/blob/master/includes/libs/MultiHttpClient.php#L301 and used Fiddler4 to catch the communication. I don't know why but I wasn't able to otherwise catch any communication between PHP-curl and Nodejs in Fiddler4 or Wireshark.

The request comming from php-cgi.exe to localhost:8000 is already broken at that moment. Attaching raw request/response with header. The request already repeats part of text and then limits. Strange to me.


Flow in Fiddler4

Is there any proxy sitting in between the two?

No proxy.

Unfortunately, this isn't something we've experienced and seems specific to your local setup.

I'm aware of this now. MW is not usually run from Windows machine. I will do another test later on another machine in different network.

Aha!

we got an interesting regression reported by @jeroen that I just managed to reproduce with this test case and the amended commit. It shows that the output data is somehow repeating after the first 16K!

https://github.com/jeroen/curl/issues/126
https://github.com/curl/curl/pull/1988

Czech.Fox added a comment.EditedJan 12 2018, 8:34 AM

Yes! Yes!

Thank you Arlolra for the finding.

I counted ~10K for Lorem ipsum only, and with html tags can be 16K.

I tried 7.56 this evening and was able to reproduce this. It was fixed in 5f9e2ca (#1988) which is in 7.56.1. I suggest updating to curl 7.57. If you still have problems after that then investigate the encoding.
https://github.com/curl/curl/issues/2198

I see on PHP site that they have 7.56.0 for compilation (libcurl-7.56.0-2-vc15-x86.zip). So, it's still with this bug for Windows PHP. http://windows.php.net/downloads/php-sdk/deps/vc15/x86/

I see now

1/12/2018 11:59 AM 2225683 libcurl-7.57.0-vc15-x86.zip

Sorry im kinda new to this. So we have to wait for a new PHP build or ?
I tried getting the Curl 7.57, but if i did it correct, i am still getting the same truncate error.

I see now
1/12/2018 11:59 AM 2225683 libcurl-7.57.0-vc15-x86.zip

Super. Good to see.

Sorry im kinda new to this. So we have to wait for a new PHP build or ?
I tried getting the Curl 7.57, but if i did it correct, i am still getting the same truncate error.

Yes, we have to wait for new Windows PHP build.

Windows PHP is built with libcurl inside, so there is no way to change it, unless you make your own build of PHP binaries.
Versions 5.6.33/7.0.27/7.1.13 contain libcurl-7.56.0.
In order to find out which version the build contains, download PHP source code, and open file /win32/build/libs_version.txt

I tested PHP 7.1.14RC1 (php-7.1.14RC1-nts-Win32-VC14-x64.zip) from http://windows.php.net/qa/ and the issue is gone ! I tried 100 paragraphs of Lorem ipsum, and after save it is ok. Parsoid console does not report any error.

Arlolra closed this task as Resolved.Jan 22 2018, 5:18 PM
Arlolra claimed this task.

Glad to hear it :)

Restricted Application added a project: User-Ryasmeen. · View Herald TranscriptJan 22 2018, 5:18 PM