Page MenuHomePhabricator

"Corrupted"/truncated text on deletion log entry in Recent Changes
Closed, ResolvedPublic

Description

Recently I noticed, while viewing Recent Changes that sometimes the text for deleted articles (also
on user creation logs) is truncated and all the text following it on the page becomes italicized.
An example of such corrupted text is (this example taken from ka: of 30th March at 12:01 UTC):

(წაშლილთა სია); 13:01 . . Zangala (განხილვა | წვლილი | ბლოკირება) (წაშლილია "კატეგორია:დაუსრულებელი
სტატიები ბიოლოგია": შინაარსი იყო: 'კატეგორია:ბიოლოგია[[კატე᩼/span>

As you can see here, the text was cut at [[კატე.... (which is a link that was on the deleted page
contents) and then there is that square, which I believe is the merge of the first byte of a
Unicode character in the text appended to the first byte of the Unicode '<' character (of </span>
tag).

As I mentioned before, this also happens on user creation logs in Recent Changes when the text is a
bit longer (longer username).


Version: unspecified
Severity: normal
URL: http://ka.wikipedia.org/wiki/special:Log

Details

Reference
bz5401

Event Timeline

bzimport raised the priority of this task from to Medium.Nov 21 2014, 9:10 PM
bzimport set Reference to bz5401.
bzimport added a subscriber: Unknown Object (MLST).

(წაშლილთა სია); 13:01 . . Zangala (განხილვა | წვლილი | ბლოკირება) (წაშლილია "კატეგორია:დაუსრულებელი
სტატიები ბიოლოგია": შინაარსი იყო: 'კატეგორია:ბიოლოგია[[კატე᩼/span>

can be translated as

(deletion log); 13:01 . . Zangala (talk| contribs | block) (deleted "Category:Unfinished Biology articles":
contents was: 'Category:Biology[[Cate᩼/span>

Wiki.Melancholie wrote:

*** Bug 2386 has been marked as a duplicate of this bug. ***

Wiki.Melancholie wrote:

Oops, bug 2386 is not a duplicate; but see bug 2386 comment #2!

gangleri wrote:

Hallo Malafaya!

I changed the url to [[ka:special:Log]] because I assume that
http://ka.wikimedia.org/ does not exist yet. I also changed "Component" to
"Internationalization" because this seems more appropriate then "Categories".

Please use consistently {{ns:special}}, {{ns:project}}, {{ns:user}} etc. in the
*translations* / *localizations" at [[ka:special:Allmessages]]. This would make
it easier to verify your setup.

If the problem is still present I would suggest that you take a view of the
source code from your browser, copy it and make an attachment of type "HTML
source (text/html).

Good luck and best regards reinhardt [[user:gangleri]]

gangleri wrote:

Hallo again!

Looking at
http://ka.wikipedia.org/w/index.php?title=special:Log&type=&user=Zangala
and searching for
კატეგორია:ბიოლოგია
I found some entries (with other timestamps then mentioned, probably of your/y
different settings in [[ka:special:Preferences]] > 'Date and time')

I assume that the background of your request is the 'discrepancy' between what
you enter in the "Summary"-field, "Reason for deletion"-field, "Reason for
move"-field, "Reason for protection"-field etc. and what you get.

These fields have a limited size. I suppose that the GUI (browser, Java,
MediaWiki-SW) counts the character but does *not* care about the *final* length
if the characters get UTF-8 encoded in the database. Truncation happens later
somewhere in the MediaWiki software.

You will / might see less depending on how long the comments etc. are and how
many UTF-8 characters (requiring two or three bytes) you are using.

I would say this is "behaviour as today" and one should find out what duplicate
bug report this is.

There would be more ways to fix this:
a) The limitation on the lenght field should care about the final requested size
inside the database.
b) "Preview" / "Confirm" should notify about truncations; this could lead to
multiple posts instead of one which can iritate contributors.

best regards reinhardt [[user:gangleri]]

Hi reinhardt.

The problem here is not the truncation "per se". The truncation of a character
long than one byte in its middle (let's say a 3-byte UTF-8 character gets cut IN
THE DATABASE after the 1st character) causes a strange character to be output.
Browsers like IE which don't take into account that truncation of UTF-8 characters
may occur fail to properly render the page.
Nikerabbit has investigated this problem and Brion says it's a known issue.

Please note that I'm not talking about the contents truncation (which happens in
every Wiki, even in English) but about the truncation of the last character's
bytes (which only happens if there are UTF-8 characters longer than one byte, like
in many Asian languages).

Ganleri, please stop adding comments to this bug; the problem is well known and
understood, and the fix is forthcoming. :)

robchur wrote:

*** This bug has been marked as a duplicate of 332 ***