padleft and padright do not handle multibyte characters properly
Closed, ResolvedPublic


Author: rememberthedot

Proposed patch v1

The built-in parser functions "padleft" and "padright" do not handle multibyte characters properly. For example, including {{padright:Hello-|7|Æ}} anywhere on a page blanks the entire page. Including multibyte characters in the first parameter produces unexpected results, for example:

{{padleft:Æ|5}} = 000Æ when 0000Æ was expected
{{padleft:本|5}} = 00本 when 0000本 was expected

These problems severely reduce the usefulness of the padleft and padright functions because there is no guarantee that the actual output will even remotely resemble the expected output. The output could have the wrong amount of padding, or blank the page entirely.

I'm including a patch I wrote that cleans up MediaWiki's pad function and gives it proper support for multibyte characters. It also adds a couple test cases to parserTests.txt.

Version: 1.14.x
Severity: major

Attached: Proposed_patch_v1


bzimport set Reference to bz16852.
bzimport added a subscriber: Unknown Object (MLST).
demon added a comment.Jan 2 2009, 1:31 PM
  • Bug 12324 has been marked as a duplicate of this bug. ***

ayg wrote:

+ $lengthOfPadding = mb_strlen( $padding );
+ if ( $lengthOfPadding == 0 ) return $string;
+ #the remaining length to add counts down to 0 as padding is added
+ $length = min( $length, 500 ) - mb_strlen( $string );
+ $finalPadding = ''; # $finalPadding is just $padding repeated enough times so that strlen( $string ) + strlen ( $finalPadding ) == $length
+ while ( $length > 0 ) {
+ $finalPadding .= mb_substr( $padding, 0, $length );
+ $length -= $lengthOfPadding;
+ }

This seems pretty elaborate. Why don't you just do:

		$char = mb_substr( $padding, 0, 1 );
		if ( strlen( $padding ) == 0 ) return $string;
		$length = min( $length, 500 ) - mb_strlen( $string );
		$finalPadding = str_repeat( $char, $length );

I don't understand a lot of what you've written. Like:

+ $lengthOfPadding = mb_strlen( $padding );

Why do you care? You're only padding with the first character anyway, aren't you?

+ $finalPadding .= mb_substr( $padding, 0, $length );

Isn't $length usually going to be *longer* than $padding? And again, why do you care about more than the first character? Current code only uses the first character of the padding and simply discards the rest -- if you're changing that (other than by using only the first *Unicode* character), you should say so, and say why. You should especially address the question of what's expected to happen if $length isn't a multiple of $lengthOfPadding.

rememberthedot wrote:

All right, those are good comments. I tried to make the pad function behave like PHP's str_pad function as much as possible. That included adding the ability for $padding to be more than one character. If the padding does not evenly fill the desired length, it is truncated. For example, {{padright:abc|8|def}} = abcdefde

You're right that I should have explained this better earlier. I was hoping to make padleft and padright just as flexible as PHP's str_pad, so that we can design more powerful wikitemplates than is currently possible. Since padleft and padright currently break with Unicode input, we really can't design reliable templates at all. If support for multibyte characters is to be added, and I feel that it should be, it would be a good idea to add it at the same time as we fix the glaring bugs. This will help prevent backwards-compatibility problems - the new functionality will be added at the same time that this function becomes reliable enough for general use.

ayg wrote:

I wouldn't consider dropping non-initial padding characters to be a bug, more like an omission. But in that case I see what you're doing here, the patch makes sense now. Committed as r45734.

rememberthedot wrote:

I agree. Thank you!

Add Comment