Page MenuHomePhabricator

HTML comments cause invisible text in tables
Closed, ResolvedPublic

Description

Author: beesley

Description:
If you use the following table, the "Weird invisible text" will be invisible
when you preview the page before saving it. It shows up normally if you save the
page and only affects the page preview. If you remove <!--bar-->, it all works
normally.

{|

-
valign="top"

<!--foo--> Weird invisible text

}

<!--bar-->


Version: 1.3.x
Severity: minor
URL: http://en.wikipedia.org/w/wiki.phtml?title=User:Angela/Sandbox&oldid=5879312

Details

Reference
bz487

Event Timeline

bzimport raised the priority of this task from to Medium.Nov 21 2014, 6:56 PM
bzimport added a project: MediaWiki-Parser.
bzimport set Reference to bz487.
bzimport added a subscriber: Unknown Object (MLST).

I can reproduce the disappearing text, but I don't see it showing "normally if you save the page".

beesley wrote:

http://en.wikipedia.org/w/wiki.phtml?title=User:Angela/Sandbox&oldid=5879312 is
a saved version of it. The text is not invisible there (which is what I meant by
showing normally). It's only when you preview the page that it's invisible.

It seems to stay invisible if there's any following text:
http://meta.wikimedia.org/w/wiki.phtml?title=Sandbox&oldid=60172

Perhaps the preview/editing-related stuff triggers the same thing.

wmahan_04 wrote:

Make preview more consistent with displayed article

The fact that the preview behaves differently from viewing the article is due
to the way newlines are appended to the preview text (I'm not sure why). If
this patch is applied, "Weird invisible text" shows up in the preview, as
expected.

But if you add <p><p> to the end of the testcase, you can reproduce the problem
both in the preview and in the saved article, as Brion said.

Attached:

The patch may be a good thing (please check that it produces correct spacing in the output with both preview before & preview after edit box options
set), but the root problem is the vanishing text itself...

wmahan_04 wrote:

Indeed, the root problem is this line in Parser.php:

  1. Remove HTML comments $text = preg_replace( '/(\\n *<!--.*--> *(?=\\n)|<!--.*-->)/sU', '$2', $text );

The regular expression is supposed to remove comments, but it transforms this:

BEGIN text

{|

-
valign="top"

<!--foo--> Weird invisible text

}

<!--bar-->

<p>

END text

into this:

BEGIN text

{|

-
valign="top"

<p>

END text

The desired result of the regex is:

BEGIN text

{|

-
valign="top"

Weird invisible text

}

<p>

END text

The regex will need to be fixed; I'll try to take a look at this more later.

jeluf wrote:

replacing it with

$text = preg_replace( '/(\\n *<!--.*--> *|<!--.*?-->)/sU', '', $text );

works for me. I don't see why this should be anchored to an end-of-line.
Tested it with some combinations, and didn't break any trivial example.

Committed to CVS HEAD for further testing.

wmahan_04 wrote:

The problem with that is if you have something like

abc
<!-- comment --> def

your code eats the newline and gives "abc def", while the original code gives

abc
def

and the second seems more correct to me. This is really tricky.

jeluf wrote:

abc
def

would start preformatted text. But that should only happen if the line starts
with a blank. It doesn't. It starts with a <!--. So at least the space should
also be eaten:

abc
def

But,

abc
<!-- comment -->
def

should not become

abc

def

since that would add a paragraph, which is not intended.

But:

  • abc

<!-- comment --> def

currently becomes

  • abc def

and this probably should not happen

wmahan_04 wrote:

I was just pointing out that your code represents a change in behavior from the
original. In addition to the problem you mentioned, what about this:

abc

<!-- comment --> def

That looks more like preformatted text, but your regexp would still make "abc
def", generating only normal text.

The only thing I have been able to convince myself is correct so far is this:

$text = preg_replace( '/\\n *<!--.*--> *(?=\\n)/sUe',

'(strstr("$1", "-->")) ? "$0" : ""', $text );

$text = preg_replace( '/<!--.*-->/sU', '', $text );

it's slow and quite ugly, but I think at least does what we want.

wmahan_04 wrote:

Err, I meant

$text = preg_replace( '/\\n *<!--(.*)--> *(?=\\n)/sUe',

'(strstr("$1", "-->")) ? "$0" : ""', $text );

$text = preg_replace( '/<!--.*-->/sU', '', $text );

This gets your example right:

abc
<!-- comment -->
def

But another constraint is that

abc
<!-- comment -->
<!-- comment -->
def

should give the same thing; at least, that's how it works now. To make that work
you can't capture the newline following the first comment, because then the
second comment won't match the pattern and have its newline removed. But without
capturing the newline at the end of a comment, how can you know whether to
remove trailing whitespace?

In short, I'm not sure there's a good way to do this without either changing the
current behavior, or doing two separate passes as I did above.

wmahan_04 wrote:

Custom function to remove comments

I got tired of trying to make a regexp do the right thing, so I wrote a custom
function to strip out the comments by hand. I hope this will be more robust
than the previous methods, and in my tests, it was faster. :)

Let me know what you think.

Attached:

wmahan_04 wrote:

I just realized the change JeLuF committed is very broken. Now in the text

abc <!-- test -->
blah blah blah blah
<!-- test -->

all the "blah"s get removed.

The "U" modifier makes all quantifiers ungreedy by default, so the ? acually
enables greediness. Oops!

I also think the problem with the problem with lists and preformatted text must
be addressed. My second patch would fix this, but if it's too ugly, I can look
for another solution....

wmahan_04 wrote:

Sorry for the spam, but I think this will do the trick:

$text = preg_replace_callback( '/(\\n)??( *)<!--.*-->( *(?=\\1))??/sU',
removeComment, $text );
function removeComment($m) {

if ($m[1] === "\n")
        return '';
else
        return $m[2] . $m[3];

}

although the custom function still seems faster for larger inputs.

jeluf wrote:

Committed the custom function to CVS HEAD.