Page MenuHomePhabricator

HTML Entity (Hex) Encoding Problem
Closed, DeclinedPublic

Description

Some characters are shown as ? in the Wiki page. For example,

&#xFC is converted to the letter ü correctly. But &#x15F is not converted to ş as it should. Instead it just shows ?

Reference: http://www.fileformat.info/info/unicode/char/search.htm?q=%26%23x15F&preview=entity

Versions

Html2Wiki-REL1_30-2cebb1f.tar
MediaWiki v1.30
Apache2 v2.4.18
PHP v7.2.2
Ubuntu 16.04
Mysql v15.1
Wiki Default Charset Encoding = tr_TR

LocalSettings.php

$wgShellLocale = "C.UTF-8";
$wgLanguageCode = "tr";

All other extensions

# Enabled extensions. Most of the extensions are enabled by adding
# wfLoadExtensions('ExtensionName');
# to LocalSettings.php. Check specific extension documentation for more details.
# The following extensions were automatically enabled:
wfLoadExtension( 'Cite' );
wfLoadExtension( 'CiteThisPage' );
wfLoadExtension( 'ConfirmEdit' );
wfLoadExtension( 'Gadgets' );
wfLoadExtension( 'ImageMap' );
wfLoadExtension( 'InputBox' );
wfLoadExtension( 'Interwiki' );
wfLoadExtension( 'LocalisationUpdate' );
wfLoadExtension( 'Nuke' );
wfLoadExtension( 'ParserFunctions' );
wfLoadExtension( 'PdfHandler' );
wfLoadExtension( 'Poem' );
wfLoadExtension( 'Renameuser' );
wfLoadExtension( 'SpamBlacklist' );
wfLoadExtension( 'SyntaxHighlight_GeSHi' );
wfLoadExtension( 'TitleBlacklist' );
wfLoadExtension( 'WikiEditor' );
wfLoadExtension( 'Html2Wiki' );
$wgNamespacesWithSubpages[NS_MAIN] = true;
wfLoadExtension( 'Nuke' );

Steps to reproduce

  1. Upload a ZIP file on the Special:Html2Wiki page which comprises of several HTML files encoded in HTML Entity (Hex).

Example input (part of the HTML code)

<span lang="TR" style="font-size:11.0pt;line-height:115%;font-family:&quot;Calibri&quot;,&quot;sans-serif&quot;">Allah
i&#xE7;in sevmek ne demek diye d&#xFC;&#x15F;&#xFC;nm&#xFC;&#x15F;t&#xFC;m yak&#x131;nlarda kendi d&#xFC;nyamda.
Eskiden Allah i&#xE7;in sevmeyi beraber &rdquo;dini&rdquo; bir&#x15F;eyler yapt&#x131;&#x11F;&#x131;m&#x131;z arkada&#x15F;lar&#x131;
sevmek diye d&#xFC;&#x15F;&#xFC;n&#xFC;rd&#xFC;m. Bazen Allah i&#xE7;in sevmenin kar&#x15F;&#x131;l&#x131;ks&#x131;z sevmek olarak
yorumlad&#x131;&#x11F;&#x131;n&#x131; da g&#xF6;r&#xFC;yorum. Bug&#xFC;nlerde Allah i&#xE7;in sevmenin &#xE7;ok daha farkl&#x131;
oldu&#x11F;unu anl&#x131;yorum. Kendi i&#xE7;imde yapt&#x131;&#x11F;&#x131;m muhasebeyi sizlerle de payla&#x15F;mak
istiyorum.&nbsp;</span>
  1. Go to one of the generated Wiki pages
  2. Some characters are displayed as ?

Example output (part of the HTML code)

Allah için sevmek ne demek diye dü?ünmü?tüm yak?nlarda kendi dünyamda. Eskiden Allah için sevmeyi beraber ?dini? bir?eyler yapt???m?z arkada?lar? sevmek diye dü?ünürdüm. Bazen Allah için sevmenin kar??l?ks?z sevmek olarak yorumlad???n? da görüyorum. Bugünlerde Allah için sevmenin çok daha farkl? oldu?unu anl?yorum. Kendi içimde yapt???m muhasebeyi sizlerle de payla?mak istiyorum.

Expected output (part of the HTML code)

Allah için sevmek ne demek diye düşünmüştüm yakınlarda kendi dünyamda. Eskiden Allah için sevmeyi beraber ”dini” birşeyler yaptığımız arkadaşları sevmek diye düşünürdüm. Bazen Allah için sevmenin karşılıksız sevmek olarak yorumladığını da görüyorum. Bugünlerde Allah için sevmenin çok daha farklı olduğunu anlıyorum. Kendi içimde yaptığım muhasebeyi sizlerle de paylaşmak istiyorum.

Problem persists every time I convert HTML to Wiki. Regardless of the HTML file I am uploading. Even when I upload one HTML file, same issue occurs.

Event Timeline

Hi @Bisherbas, thanks for taking the time to report this!
Unfortunately this report lacks some information. If you have time and can still reproduce the problem: Please add a more complete description to this report (MediaWiki version information, extension version information, database backend version and default charset encoding, etc). You can edit the task description by clicking Edit Task.
Ideally, exact and clear steps to reproduce should allow any other person to follow these steps (without having to interpret those steps) and see the same results. Problems that others can reliably reproduce can get fixed faster. Thanks!

Bisherbas updated the task description. (Show Details)

@Aklapper Thanks! Hope I have been thorough enough. If not please do let me know what I have missed.

Bisherbas updated the task description. (Show Details)

Which character set and collation is your MySQL database set to?
Which character set is defined in the meta data in the HTML pages to import, if any?

I am using the default settings of mysql (not super familiar with it). I ran the commands on this page https://stackoverflow.com/questions/1049728/how-do-i-see-what-character-set-a-mysql-database-table-column-is and mysql returned Empty set (0.00 sec)

Entire HTML code (No charset defined)

<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN"
        "http://www.w3.org/TR/html4/loose.dtd">
<html>
<head>
<title>Allah i&#xE7;in sevmek</title>
</head>
<body>

<h1>Allah i&#xE7;in sevmek</h1>
<hr>
<ul>
<li><em>Subject</em>: Allah i&#xE7;in sevmek</li>
<li><em>From</em>: b.c. &lt;b...@gmail.com&gt;</li>
<li><em>Date</em>: Tue, 9 Oct 2012 06:18:35 +0300</li>
</ul>
<div><span lang="TR" style="font-size:11.0pt;line-height:115%;font-family:&quot;Calibri&quot;,&quot;sans-serif&quot;">esselam&#xFC; aleyk&#xFC;m,</span></div><div><span lang="TR" style="font-size:11.0pt;line-height:115%;font-family:&quot;Calibri&quot;,&quot;sans-serif&quot;"><br>



</span></div><span lang="TR" style="font-size:11.0pt;line-height:115%;font-family:&quot;Calibri&quot;,&quot;sans-serif&quot;">Allah

i&#xE7;in sevmek ne demek diye d&#xFC;&#x15F;&#xFC;nm&#xFC;&#x15F;t&#xFC;m yak&#x131;nlarda kendi d&#xFC;nyamda.

Eskiden Allah i&#xE7;in sevmeyi beraber &rdquo;dini&rdquo; bir&#x15F;eyler yapt&#x131;&#x11F;&#x131;m&#x131;z arkada&#x15F;lar&#x131;

sevmek diye d&#xFC;&#x15F;&#xFC;n&#xFC;rd&#xFC;m. Bazen Allah i&#xE7;in sevmenin kar&#x15F;&#x131;l&#x131;ks&#x131;z sevmek olarak

yorumlad&#x131;&#x11F;&#x131;n&#x131; da g&#xF6;r&#xFC;yorum. Bug&#xFC;nlerde Allah i&#xE7;in sevmenin &#xE7;ok daha farkl&#x131;

oldu&#x11F;unu anl&#x131;yorum. Kendi i&#xE7;imde yapt&#x131;&#x11F;&#x131;m muhasebeyi sizlerle de payla&#x15F;mak

istiyorum.&nbsp;</span>
</body>
</html>

Please find out what your "default setting" is. I currently believe this is a misconfiguration issue (throwing bytes to be interpreted as charset X into a database while the database is configured to interpret those bytes as charset Y) and not a software bug. :)

OK I have modified the my.conf as follows

[client]
default-character-set=utf8
[mysqld]
character-set-server = utf8

and the mysql output shows

MariaDB [(none)]> SHOW VARIABLES LIKE 'character_set%';
+--------------------------+----------------------------+
| Variable_name            | Value                      |
+--------------------------+----------------------------+
| character_set_client     | utf8                       |
| character_set_connection | utf8                       |
| character_set_database   | utf8mb4                    |
| character_set_filesystem | binary                     |
| character_set_results    | utf8                       |
| character_set_server     | utf8mb4                    |
| character_set_system     | utf8                       |
| character_sets_dir       | /usr/share/mysql/charsets/ |
+--------------------------+----------------------------+

Problem still persists after these changes. I mean I restarted the server and reuploaded the same HTML file from the scratch after updating the mysql settings.

OK I have modified the my.conf as follows

What were the previous settings?

Problem still persists after these changes.

To avoid misunderstandings: You deleted the imported data and re-uploaded the ZIP file on Special:Html2Wiki and after that, the problem still exists?

That's right. In fact, for troubleshooting purposes, I am working with a single HTML file for now. And it is the one I posted above.

my.conf previous settings did not define any charset. It was simply blank.

I can confirm that the entire MediaWiki database charset is UTF8. Mysql Ouput of

SELECT default_character_set_name FROM information_schema.SCHEMATA S WHERE schema_name = "wikidb";

is

+----------------------------+
| default_character_set_name |
+----------------------------+
| utf8                       |
+----------------------------+

I am confident that mysql is now using the my.cnf with the following content:

[client-server]
!includedir /etc/mysql/conf.d/
!includedir /etc/mysql/mariadb.conf.d/
[client]
default-character-set=utf8
[mysqld]
character-set-server = utf8

And I made sure that all the fallbacks include the exact same content above.

bish@UBUNTU:~$ which mysqld
/usr/sbin/mysqld
bish@UBUNTU:~$ /usr/sbin/mysqld --verbose --help | grep -A 1 "Default options"
Default options are read from the following files in the given order:
/etc/my.cnf /etc/mysql/my.cnf ~/.my.cnf

I don't think it's a database configuration issue because when I copy paste the exact same HTML code into a new page manually, MediaWiki shows all characters properly. Don't you think? See below image.

https://imgur.com/aiYSCFr