Page MenuHomePhabricator

Edit summaries containing many stacked combining characters ("Zalgo") can overflow the line box
Closed, DeclinedPublic

Description

This is a semi-standard Unicode "abuse": stacking many combining characters on one base character which causes some rendering engines in some contexts to display characters where the diacritics extend far beyond the usual line box, often overlapping adjacent lines.

See example here:
https://test2.wikipedia.org/w/index.php?title=Test_page_ugg&action=history

Enwiki VPT discussion: https://en.wikipedia.org/w/index.php?title=Wikipedia:Village_pump_(technical)&oldid=828758923#Strange_page_markings

Example:

Example for T188865 (418×1 px, 49 KB)

Event Timeline

Suggest a blacklist or whitelist be used for input validation of edit summaries, or have the parser ignore these somehow on display.

Xaosflux renamed this task from Edit summaries causing certain control characters lead to dispay issues in histories and diffs to Edit summaries containing certain control characters lead to dispay issues in histories and diffs.Mar 4 2018, 4:24 PM
Xaosflux updated the task description. (Show Details)

I do not see any software misbehavior here as the "display issues" look like correct rendering?

Actually the point is to have the software deny characters with so many superscript/subscript characters attached to the first character. It should be at most 4 stacked characters.

To clarify, in an experiment, @Samtar used ridiculous text in the edit summary and made his contributions page unreadable. I had to delete the edit summaries to clean it up.

@Aklapper see the image attached to the description

If it is actual desired to support that many vertically shifted characters (which I think it a bad idea), the boundary between rows should be maintained to prevent characters from one entry from overlapping another.

Anomie renamed this task from Edit summaries containing certain control characters lead to dispay issues in histories and diffs to Edit summaries containing many stacked combining characters can overflow the line box.Mar 4 2018, 10:31 PM
Anomie updated the task description. (Show Details)

These are not "control characters", they're Unicode combining characters.

It's not as simple as blacklisting or whitelisting certain characters, since these are valid in some languages. It'd have to be a check for "excessive" combining characters, whatever the appropriate definition of "excessive" might be. I don't know how many combining characters on a single base character are valid in various languages.

Similarly, ignoring them on display would mean not displaying diacritics for languages that need them. Again, unless it only stripped "excessive" combining characters.

I also note this isn't new, T32408: Histories and diffs being obscured by 'watermark' text was about the same issue (enwiki admins can look at this revision-deleted edit). The longer limit just allows more characters in an edit summary abusing this.

These are not "control characters", they're Unicode combining characters.

It's not as simple as blacklisting or whitelisting certain characters, since these are valid in some languages. It'd have to be a check for "excessive" combining characters, whatever the appropriate definition of "excessive" might be. I don't know how many combining characters on a single base character are valid in various languages.

Similarly, ignoring them on display would mean not displaying diacritics for languages that need them. Again, unless it only stripped "excessive" combining characters.

I’d like to think I’ve seen a lot of languages and of all the ones I’ve seen the characters usually never had more than 2. So I think it’s safe to cap it at 4 combining characters.

  1. @Cyberpower678 Perhaps we can make an edit filter for high counts of U+0300 to U+036F in the summaries
  1. @Anomie without ignoring, can these not still be bound to prevent flowing to another row during parsing (by making the current row taller) ?
  1. @Cyberpower678 Perhaps we can make an edit filter for high counts of U+0300 to U+036F in the summaries
  1. @Anomie without ignoring, can these not still be bound to prevent flowing to another row during parsing (by making the current row taller) ?

Undoubtedly, we can, and I'm hoping you can teach me some filter basics. As for separating the rows out more, I don't think that's possible unless the parser recognized where the character was flowing to. That would be a significant modification from my viewpoint.

I’d like to think I’ve seen a lot of languages and of all the ones I’ve seen the characters usually never had more than 2. So I think it’s safe to cap it at 4 combining characters.

I've found that it's seldom safe to make assumptions in matters such as this. Rather than choosing a limit just because we think it's safe, we should probably get input from people who're familiar with encoding of languages from around the world.

  1. @Anomie without ignoring, can these not still be bound to prevent flowing to another row during parsing (by making the current row taller) ?

In some quick testing I've been unable to find a method that works reliably. Browsers seem not to take the diacritics into account when calculating the size of the text.

The best I can some up with would be to put a block element with an overflow-y: hidden CSS rule inside each <li>, to chop them off.

unless the parser recognized

Note "the parser" has nothing to do with this. It's inside the browser and/or OS's font rendering code.

I’d like to think I’ve seen a lot of languages and of all the ones I’ve seen the characters usually never had more than 2. So I think it’s safe to cap it at 4 combining characters.

I've found that it's seldom safe to make assumptions in matters such as this. Rather than choosing a limit just because we think it's safe, we should probably get input from people who're familiar with encoding of languages from around the world.

  1. @Anomie without ignoring, can these not still be bound to prevent flowing to another row during parsing (by making the current row taller) ?

In some quick testing I've been unable to find a method that works reliably. Browsers seem not to take the diacritics into account when calculating the size of the text.

The best I can some up with would be to put a block element with an overflow-y: hidden CSS rule inside each <li>, to chop them off.

unless the parser recognized

Note "the parser" has nothing to do with this. It's inside the browser and/or OS's font rendering code.

When I referred to the parser, I meant unless it can detect and then instruct the MW renderer to create extra-spaces when rendering the output, but that's a significant change that is not worth the benefit.

Edit summaries don't go through the parser at all. They're processed by the Linker class, see formatComment and sub-methods.

And, as I said, it depends on the browser and font. On my system Firefox handles that summary rather differently:

screenshot.png (80×330 px, 16 KB)

@Aklapper see the image attached to the description

@Xaosflux: I did that before I added my comment "I do not see any software misbehavior here as the 'display issues' look like correct rendering?".
To me this task looks like trying to come up with a technical solution to an 'issue' only perceived by people who do not write in languages that use those characters. And I personally don't think it makes any sense to write and maintain code to 'fix' this. It's up to editors to write edit summaries that make sense, not to technology.

@Aklapper the "display issue" is that the bounds of one row are intruding on another row as in the example above. While this may change with browser preferences somewhat as in Anomie's example, stock Firefox and stock Chrome on a not-logged in page both demonstrate the disruption.

True. That sounds like an issue with rendering performed by web browsers and out of scope for MediaWiki though. (I might be wrong.)

If it means anything, I've definitely seen this on Twitter and other major websites, where the text overlapped adjacent posts. Indeed it doesn't seem like an issue with the software, per se. I don't think even the browser is wrong to show it like this, but I can see how it could be disruptive on wikis like English Wikipedia. If you wanted to prevent it I would use CSS, as Anomie suggested.

Change 416700 had a related patch set uploaded (by Bartosz Dziewoński; owner: Bartosz Dziewoński):
[mediawiki/core@master] Limit impact of Zalgo text (multiple combining marks) in edit summaries

https://gerrit.wikimedia.org/r/416700

I am not really convinced that this patch is needed. In my opinion the right reaction here is to ban the offenders. But if it upsets people so much, we can stop the combining marks from overflowing to neighbouring lines of text.

Actually, this doesn't seem to be so easy. The patch also makes the list bullets disappear, which is actually reasonable, since they are shown outside of the list item, same as the text we want to hide. We would need to invent something to prevent that, and I don't really have time to spend on this bug. Perhaps someone else wants to pick this up. Sorry.

Before
image.png (184×971 px, 70 KB)
After
image.png (184×971 px, 48 KB)

Change 416700 abandoned by Bartosz Dziewoński:
Limit impact of Zalgo text (multiple combining marks) in edit summaries

Reason:
See

https://gerrit.wikimedia.org/r/416700

Copying over what I said on the patch: I think this might a "solution looking for a problem" type thing. To my knowledge, we haven't actually observed this sort of disruption beyond those test edits. And you make a good point, scripts/gadgets etc. may intentionally put a dropdown or the like in this area of the DOM, and perhaps other MediaWiki installations, hard to make assumptions.

I'd leave it up to the individual wikis to do something about it, if they want to. With CSS I think you can use overflow-y: hidden (only hiding vertically), so that the bullets aren't cut off.

Aklapper renamed this task from Edit summaries containing many stacked combining characters can overflow the line box to Edit summaries containing many stacked combining characters ("Zalgo") can overflow the line box.May 12 2018, 5:26 PM