Page MenuHomePhabricator

padleft: padright: and nowiki do not get on (guess same for all parser functions) UNIQ exposed
Open, LowPublic

Description

Author: conrad.irwin

Description:
{{padleft:<nowiki>{{{1}}}</nowiki>|50}} -> 000000000{{{1}}} (which has length 16)
or -> 000000000<nowiki>{{{1}}}</nowiki> (which has length 33).

(I guess the length of the UNIQ blob thingies is 41 chars)

Of course, the horrible hack of a template {{str len}} measures incorrectly (as it is based on the same)

{{str len|{{padleft:<nowiki>muchmuchmuchmuchmuchmuchlonger</nowiki>|50}}}} -> 50

The situation is worse when a <nowiki> is used for the padding:

{{padleft:<nowiki>{{{1}}}</nowiki>|84|<nowiki>{{{1}}}</nowiki>}}
-> UNIQ6d07314c6f88f655-nowiki-00000003-QIN{{{1}}}

{{padleft:<nowiki>{{{1}}}</nowiki>|86|<nowiki>{{{1}}}</nowiki>}}
-> {{{1}}}{{{1}}}


Version: unspecified
Severity: minor
URL: http://www.mediawiki.org/wiki/User:MarkAHershberger/BugDemo#Bug_22555

Details

Reference
bz22555

Event Timeline

bzimport raised the priority of this task from to Low.Nov 21 2014, 11:00 PM
bzimport added a project: MediaWiki-Parser.
bzimport set Reference to bz22555.
bzimport added a subscriber: Unknown Object (MLST).

See also the demonstration of the bug/limitation documented on

http://en.wikipedia.org/wiki/Template:Str_left

And why this occurs. Clearly, the {{padleft:}} parser function does not handle filter correctly its third parameter (which specifies which characters to use to pad the string specified in the first parameter.

Note also that the Template documentation page above also demonstrates that padleft: does not properly handle HTML character references (such as &nbsp; or &#32;) found in this third parameter.

Exposing the "\x07UNIQ-...-nowiki-INUQ\x07" in the generated result may also reveal some internal states of the wiki server (Is there a possible leakage of secure information, allowing external attacks by revealing the value of the unique id ?).

All those horrible template hacks should find an end. It would be much simpler to deploy parser functions that correctly perform basic string operations that will filter out contents that should be invisible such as nowiki, and will not have the string length limitation, i.e.
{{#len:string}}
{{#left:string|maxlength|skip}}
{{#right:string|maxlength|skip}}

These functions should work with international characters encoded as UTF-8 sequences, should discard all nowiki unique ids (includeonly, noinclude, onlyinclude wiki markups as well as HTML comments are already preprocessed, as well as inclusion of other templates or builtin functions within parameters), should accept "fromindex" values where:

  • 0 means the position just before the start of string for strleft, or just past the end of string for strright

Negative values or out of range should be accepted, so that you can simply deduce a "fromindex" and a "toindex" from the parameters:

  • a missing skip parameter is interpreted as 0 for strleft, or as strlen for strright.
  • a negative maxlength is interpreted as 0 (return nothing).
  • overflowing or missing maxlength is interpreted as meaning the maximum number of characters in the left or right direction (according to the function name).
  • in #left: fromindex=skip ; toindex=skip+maxlength
  • in #right: toindex=strlen-skip, fromindex=strlen-skip-maxlength

e.g.: {{#left:text|1|n}} will return the n-th character (possibly multiple UTF-8 bytes), even if it's just a Unicode format control or combining character. But a shirtcut function could be: {{#char:text|n}}

Or there could also be a function like:

{{#cp:text|index}}

to return the (index+1)-th codepoint scalar value (default value for n will be 0 for the first character in the text), possibly extended with:

{{#cp:text|index|property tag}}

to return one of the unicode character properties supported using its tag as documented in the standard Unicode annex for regexps, i.e. those used in \p{tag}, for example:

  • {{#cp:text|index|g}} to return the Unicode general category type (like "Lo")
  • {{#cp:text|index|script}} to return the Unicode script type (like "Latn")
  • {{#cp:text|index|Lo}} to return 1 (true) if the character is of general general category "Lo", 0 (false) otherwise.

No need to treat specially the presentational wiki or HTML markup. However, there should exist a way to strip this markup from the input string using

{{#text:some wiki text}}

so that it will strip the parameter to only return the characters that would be generated as plain-text elements in HTML (removing all XML or HTML tags and inlike wiki markups such as italic/bold quotes, as well as some magic keywords, category links and targets of links, keeping only the visible text of a wiki link or in a thumbnailed image). It should also convert the remaining single quotes into character entities, for safe processing in other templates or parser functions.

Note that the English Wikipedia Template:Trunc experiments similar problems, except that it uses {{padright:}} and places the template parameter string in the first parameter of {{padright:}}, which can also be truncated at the wrong position (including in the middle of a UNIQ sequence).

So both {{padleft:}} and {{padright:}} parser functions should be corrected to skip over UNIQ-nowiki ids when they are present, when counting characters, just like they already parse and skip over multiple bytes of UTF-8 encoded non-ASCII characters.

conrad.irwin wrote:

Hi Phillippe,

This problem is nothing to do with Unicode (which MediaWiki is pretty good at, though I don't think anyone else in the world uses string indexing that merges combining characters with their base — nice though it would be).

The "basic" solution is just to unstripBoth() before passing anything to parser functions. However, the problem is working out what exactly the difference this will make to the assumptions on which all the existing extensions have been buikt, and fixing those extensions (or providing a mechanism whereby they don't break).

See Bug 8161 for a discussion of {{#text}}.

Obviously the nicest solution of all would be to use a real programming language :)

Related bug #16778

Seems to be general to various parser functions, like:
#if:
#ifeq:
#ifexpr:
#expr:
#switch:
whose parameters are not properly filtered out of their "nowiki" sections where they should, and where these nowiki sections will NEVER be returned as these parameters are only used by the parser function itself which will compute and generate something else.

It is more complex to handle for padleft: and padright: (except for its length parameter), because parts of their parameter will be returned that should preserve some nowiki markup.


Note: Do we really need to generate so long UNIQ ids for <nowiki> sections ? Should they simply be stripped out completely or just replaced by a single control character (if they are just there to block whitespace trimming of template parameters) ? After all the final HTML tidy step will automagically remove this control which is invalid in HTML.

And if the only thing that the UNIQ id prevents is to avoid trimming in successive passes, isn't there a better way to represent whitespaces that should be preserved ?

  • For example trimmable whitespaces would remain as SPACE, TAB or NEWLINE, all other untrimmable whitespaces would be encoded temporarily as some other C0 or C1 controls (such as \x0C for untrimmable SPACE, \x8A for untrimmable NEWLINE, \x89 for untrimmable TAB), and converted back to the whitespaces just before the HTML generation step at end of parsing, using a very fast byte-for-byte subtitution (that could then reuse the same string buffer).
  • It would simplify a lot the implementation of padleft: and padright: (when they parse UTF-8 characters, they just have to skip over a single byte \x06 just like they skip over UTF-8 sequences...)

If you fear compatibility problems with existing "padleft:" and "padright:", why not creating "#padleft:" and "#padright:" that will correct this legacy problem by stripping the UNIQ-nowiki id's ?

The problem for stipping (and trimming) other presentational markup is clearly separate, but if stripping is there, it could be used within call parameters of existing padleft: and padright:

My opinion is that exposing the UNIS id's is wrong in ALL cases. This should never be possible and there must not exist any such assumption in extensions.

Horrible template hacks are already used in English Wikipedia (notably in infoboxes for computing a stripped-down URL and display only the domain name extracted from a specified (possibly long) URL.

I really bet that those infoboxes should better use a separate parameter (and if it's missing, it should simply display the full URL, even if it's long). For now it just generate parser errors (because the target URL is longer than 100 characters, something that is not so uncommon).

Another note: the Template:Str_left that has this inherent bug of {{padleft:}} is used in English Wikiepdia in thousands of templates (notably stub templates and in almost all infoboxes), so this bug can affect really a LOT of pages (probably hundreds of thousands pags, in cluding many talk pages and project pages).

The reason for this frequence is that those templates frequently use "Template:Str_trim" to trim spaces within their parameters, using a very costly method.

This also explains why so many English Wikipedia are SOOOOO LONG to render when they are edited. This bug, and the absence of a lightweight support for basic string functions is a SEVERE performance problem for Wikipedia, where the English version is consuming REALLY too much CPU and memory ressources.

I even suspect that this is one of the frequent crashes or non responsive sites (and it affects all other Wikimeia projects that DON'T use these template hacks), and this is also affecting the speed of navigation in all sites.

Please consider providing TRUE basic string functions that will no longer need those very costly template hacks.

One of those missing functions is {{#trim:text}} that is needed to trim the leading and trailing whitespaces in positional template parameters (that are not passed by name), and which uses instead those very costly template hacks to extract characters at specific positions, just to condionally rebuild a string from these characters (this is very often recursively applied across templates and embedded subtemplates.)

Given the VERY HUGE usage of string manipulation templates in English Wikipedia, the past decision to NOT apply support for more efficient string manipulation parser functions in Wikipedia should be reconsidered.
The argument that "Mediawiki is not a programming language" proves to be wrong, because Mediawiki is already used extensively in a way that needs these string functions, but using very ugly template hacks, which are already trying to bypass the decision of not using such "programming language" features on almost all article pages (infoboxes and stub templates are present almost everywhere, sometimes several times on the same page).
Clearly this demonstrates that those string parser functions are really needed and it would be in fact more reasonnable to allow them directly as parser functions, in order to obsolete all these ugly [[en:category:String manipulation templates]] (str left, str right, trunc, str len, str rightc, str index, and so on...), or rewrite their implementation in a much more efficient way for helping the transition.

After all, Mediawiki is already and basically a janguage for generating complex texts, templates are already programming features. in comparison, WordPress already has these natural builtin functions, and sites built with WordPress are still not critized for abusing these basic string manipulation functions.

This is the same rationale that was applied when #expr support was added, and then slightly extended to support a few trigonometric functions and safer conversions to integers (less limited and bogous than the "round" operator).

Let's be pragmatic. A decision to not implement these features has already been bypassed extensively throughout English Wikipedia that already wants these functions and use them in almost all its articles of its default namespace (just count the number of pages (and the thousands of infobox an stub templates) already using these ugly string templates !

matthew.britton wrote:

(In reply to comment #9)

bypassed extensively throughout English Wikipedia

When bad programmers die and go to hell, they have to read through the source code of the English Wikipedia's template namespace.

The "str find" template[1] searches for a string inside another string. It does this using the slowest, ugliest, most inefficient string searching algorithm ever created. I actually felt sick trying to read it.

Anyone who claims to have seen bad code should read these templates. Once they have done so, they will beg you for a 10,000-line undocumented VBScript function with which to cleanse their minds.

[1] [[Template:Str find]]

(In reply to comment #5)

Hi Phillippe,

This problem is nothing to do with Unicode...

I've not written that. I just said that MediaWiki's PArserFunctions can already pass over UTF-8 byte sequences encoding a single code point, counting these sequences as a single character in {{padleft:}} and {{padright:}} when they are counting characters. They should then be able to pass over the "\x7FUNIQ-*" place-holders generated to protect spans of texts that must not be reparsed as Wiki-code.

Note: if you use unstripBoth() when evaluating the parameters of padleft/padright; you risk ro reexpose some wiki code present in the returned string, because it will no longer be protected by the presence of the place holders, that will have disappeared in the source string, before it is padded.

That's exactly why the code of padleft: and pad-right: have to pass over the place holders when counting characters, if they have to truncate the source string that contains too many characters (no such counting is necessary if truncation does not occur, but only padding characters are added.

When the maximum number of characters is reached, it should also be able to know if the last actual character counted was part of a nowiki section (protected by place-holders) or not (note that characters present in nowiki sections ARE to be counted, between these place-holders).

If the last character counted that reaches the maximum length was not within a nowiki section, truncation can occur at the position of this character. Otherwise, a new placeholder has to be generated to correctly maintain the nowiki restriction on the remaining part of the protected section.

This requires special code to handle the byte \x7F (and pass over the whole placeholder), and then counting characters that are present in the referenced string that the placeholder represents. This code must know for each position if it falls within a nowiki section or not.

For now the existing code just effectively ignores the presence of placeholders, counting them as if they were normal characters. It just knows how to parse UTF-8 sequences when scanning the parameter, using a very basic method (it just looks if each byte has its two highest bits 7 and 6 set or not, to see if it's an UTF-8 trailing byte or not, because every trailing byte between \x80 and \xBF is simply not counted), but it makes no other special case when the parsed byte is \x7F.

Note that using unstripBoth() will be less efficient in memory than using the string parsing loop, because it will force the parameter string to be reallocated and copied, even before actually performing the string truncation or padding. This is not necessary, as padleft/padright can directly work on the source string to perform either the string truncation (at the correct position), or the padding (in a new string), or returning the parameter itself (when the source string contains exactly the requested number of characters (not counting the placeholders and trailing bytes).

Writing this parsing loop is quite simple, to do, but requires managing an additional state (if the current character position is in a nowiki section or not, as signaled by the presence of placeholders whose UTF-8 encoded syntax only uses a fixed number of ASCII-only bytes).


Anyway, I do think that the "UNIQ" encoding of nowiki placeholders is overkill (in fact this syntax is really overlong) when these nowiki sections are empty: it should be enough to represent these empty nowikis (that are used for example to protect leading or trailing spaces from whitespace stripping or interpretation as special wiki syntax at the beginning of a paragraph) only simply as "\x7x\x7F" (you don't need anything else for empty nowiki).

The double DEL is still needed only for disambiguating normal text from the inner unique id of a represented non-empty nowiki section, because these placeholders are also encoded in a pair of DEL characters). You could even represent an empty nowiki as a single control byte "\x01" (which will be simply stripped out when generating the final HTML after the wiki code parsing).

The "UNIQ" id represented internally by the full placeholder is ONLY needed to represent a reference to a protected non-empty substring stored elsewhere in some strings table (this encoding is made so that the value of the id is preserved even if the text is converted to other case).

When performing successive substitutions of templates, we should not even use text appending (the whole final text can be reconstructed, recursively, by using these ids representing stable substrings that need no further parsing in calling templates or parser functions : this could speed up a lot the generation of pages without having to copy each time large amounts of texts across template substitutions).


Note that for parser functions that evaluate the actual text of the parameter in order to infer a value (for example the first parameter of #expr or #ifexpr, or the first parameter and case parameters of #switch before "="), only these fully processed parameters can effectively use unstripBoth().

This unstripBoth() should not be done for parameters that will be returned unchanged (if they are returned), such as the second or third parameter of #if* or the returned values in parameters 2 (or more) of #switch after the equal sign... For #switch, unstripboth() should only be done after detecting the presence of the equal sign and splitting it.

NOT fixed in the 1st parameter of {{#switch:...}}. Which also stills considers an input parameter like "&#x40;" or "&#64;" as different from "@".

NOT fixed in the 1st parameter of {{#expr:...}}, {{#if:...}}, {{#ifexpr:...}}, or in the 2 first parameters of {{#ifeq:...}}.

In all of these parser functions, the presence of a nowiki section will not work as expected. Those nowiki markers should be stripped completely as well from these parameters (just like already the HTML comments) before processing the rest of the functions, because there's nothing in these parameters that will be part of the output of these functions and that need a protection against a further wiki reparsing.

Finally, you still use overlong markers for those nowiki sections. A single DEL character can be used at start or end of the text sections that need to be protected, the unique identifiers created in the current markers are overkill and serve absolutely no purpose. And an empty <nowiki/> can as well be a single pair of DEL characters.

Note that the non-handling of numeric character references as being equivalent to the natively UTF-8 encoded character, is a separate bug. It does not just affect #switch but also:

"{{LC:&#x41;}}" just returns "&#x41;" (the same as "A", which is a CAPITAL letter !)

"{{LC:&#x4F;}}" just returns "&#x4f;" (same problem, only the hexadecimal digits are converted to lowercase, but this has no impact on the represented character itself which remains a CAPITAL letter).

Note that if there are nowiki sections in the input parameter of LC: and UC:, these markers should NOT be stripped.

So the return from

"{{LC:<nowiki>*</nowiki> SOME TEXT}}"

should first become:

"{{LC:\x7F*\x7X SOME TEXT}}"

(here \x7F means the DEL character alone, which is filtered out and rejected from the input wiki code as it is not plain-text). Then it will return:

"\x7F*\x7F some text"

If you strip the markers from the input, it will return:

"* some text"

And this will restart to generate a list item according to the wiki syntax!

Hope this helps...

Concerning {{padleft:}} and {{padright:}}, if those functions have to return a string, it should still contain the nowiki markers, for the same reason as the one given for {{LC:...}}, because the output should not be reinterpreted as being using a wiki syntax, for all characters in the input that have been protected (partly) in a nowiki section.

So:

{{padright:|2|x<nowiki/>y}}

will in a first pass (before processing parser functions) be converted into

{{padright:|2|x\x7F\x7Fy}}

Then its result should be:

x\x7F\x7Fy

which only contains the 2 visible characters x and y (as expected), plus 2 DEL markers that will be filtered out after the end of the wiki parsing and just before the HTML generation.

This means that when counting the characters to output in the result of the function, you just still have to output the DEL character as counting for zero in length, but you must still ensure that they are outputed in pairs.

So for:

{{padright:|2|xy<nowiki/>}}

will in a first pass (before processing parser functions) be converted into

{{padright:|2|xy\x7F\x7F}}

Then its result should be:

xy

(you can safely drop the rest of the filling parameter, because you have enough characters in the output, and the markers are still associated in pairs here, even if there are none of them).

But for:

{{padright:|2|x<nowiki>yz<nowiki>}}

will in a first pass (before processing parser functions) be converted into

{{padright:|2|x\x7Fyz\x7F}}

Then its result should be:

w\x7Fy\x2F

with an additional marker added after y, to preserve the pair started before y.

In summary:

  • some characters may be represented in the input as numeric character references or known named character references:
    • Those references may be converted early to UTF-8, EXCEPT IF they represent a character that may have a contextual meaning in the wiki or HTML syntax; all of them are ASCII punctuations, i.e. those cahracters in { < > & ' " = * # : ; - | }
    • Those exceptional characters should remain encoded as a character reference if they are in such format, but these references can safely be unified using a decimal character reference, because those characters treated speacilly in the wiki syntax are all part of ASCII.
  • you must reject from the 1st pass (before all processings of the wiki source) any DEL character present in the input.
  • you never need any UNIQ identifier sequence when converting <nowiki> sections.
    • you can just generate a single DEL character at start and at end.
    • all parser functions can safely pass over those DEL characters, and can safely discard them before processing, but only if the parameter itself if not used within the output of the parser function
    • otherwise, if the parameter is used partly or fully in the output, the output must make sure that there will be an even number of DEL characters in the final ouput
    • you can safely replace all sequences of 3 or more DEL's by dropping them in pairs, e.g. replace 3 DELs by 1 DEL, 4 DELs by 2 DELs, 5 DELs by 1 DEL
    • if at end of the generation of the ouput of the parser function, there remains only an odd number of DELs, then append an additional DEL to the ouput.
  • continue processing other parserfunctions.
  • continue by generating recognizing the special wiki syntax or recognizing HTML <elements> or special <elements>
  • when you've processed completely the wiki syntax, and just before generating the HTML, you can strip out all the remain DELs that were just there to prevent the wiki parser to work.
  • And the same time you drop the remaining DEL, you may replace the remaining numerical character references remaining for ASCII punctuations, and that play no role in HTML i.e. the caracters in { ; : * # | - } but not those in < ' & " > which play a special role in HTML, either within a text element, or within an attribute value.