Page MenuHomePhabricator

generateSitemap.php - Error: 1300 Invalid utf8 character string: 'F0A8AE'
Open, LowPublic

Description

Hi! I do not speak English. But I will try to describe the problem.

Two MediaWiki:

  1. MW_Shared - shared user tables
  2. MW_Standalone - connection to MW_Shared user tables

MW_Standalone LocalSettings.php:

$wgSharedDB = "MW_Shared";
$wgSharedTables[] = "ipblocks";
$wgSharedPrefix = "shared_";

When a user creates his page on MW_Standalone (e.g.,User:TestUser), generateSitemap.php shows an error:

/usr/bin/php /public_html/maintenance/generateSitemap.php --fspath /public_html/sitemap --server "https://domain.com" --urlpath "https://domain.com/sitemap" --identifier="domaincom"
0 ()
        /public_html/sitemap/sitemap-domaincom-NS_0-0.xml.gz
A database query error has occurred.
Query: SELECT  user_name,up_value  FROM `prj_shared`.`shared_user` LEFT JOIN `prj_shared`.`shared_user_properties` ON ((user_id = up_user) AND up_property = 'gender')  WHERE user_name = '𨮁𨮁𨮁𨮁𨮁𨮁𨮁𨮁𨮁𨮁𨮁𨮁𨮁𨮁𨮁𨮁𨮁𨮁𨮁𨮁𨮁𨮁𨮁𨮁𨮁𨮁𨮁𨮁𨮁𨮁𨮁𨮁𨮁𨮁𨮁𨮁𨮁𨮁𨮁𨮁𨮁𨮁𨮁𨮁𨮁𨮁𨮁𨮁𨮁𨮁𨮁𨮁𨮁𨮁𨮁𨮁𨮁𨮁𨮁𨮁𨮁𨮁𨮁喃'
Function: GenderCache::doQuery/domaincomWikiTitleCodec::getNamespaceName
Error: 1300 Invalid utf8 character string: 'F0A8AE' (127.0.0.1)

If I delete a user's page, then everything is fine.
MW ver. 1.25.1
Encoding MariaDB database: utf8_general_ci
PHP 5.6.11

Event Timeline

Base64.zion raised the priority of this task from to High.
Base64.zion updated the task description. (Show Details)
Base64.zion subscribed.
Base64.zion set Security to None.

MW ver. 1.25.1

Are you sure both databases/tables are using the same collation? And that MediaWiki knows it?

Yes. I'm already an administrator with experience. Both databases are utf8_general_ci. And tables too.

Physikerwelt lowered the priority of this task from High to Low.May 27 2017, 6:18 PM

as @Reedy said, it seems to be charset problem. You might want to tell mysql that your user table is binary.

ALTER TABLE user CONVERT TO CHARACTER SET binary;

I was trying to see how it could be fixed for other char sets, but I could not find bug 17961 that was mentioned in https://phabricator.wikimedia.org/rMW056b1daada36d7807d5502cdbc4b1fddb546dcee

ALTER TABLE user CONVERT TO CHARACTER SET binary; helped me

I saw this bug, and my first analysis is that depending on the language (and particularly the number of bytes for the translation of the namespace "User") the 64 UTF-8 characters are sometimes splitted and the last character becomes invalid (e.g. 2 first bytes instead of 4 bytes).

If I understand correctly the original code in 5188821115ff (written in 2005) there is no special need for non-ASCII characters, except fun :) So I replaced the 63 4-bytes characters + 1 3-bytes character by 255 1-byte characters, and it works.

diff --git a/maintenance/generateSitemap.php b/maintenance/generateSitemap.php
index 6060567bf2c..ff8fe0fddb4 100644
--- a/maintenance/generateSitemap.php
+++ b/maintenance/generateSitemap.php
@@ -549 +549 @@ class GenerateSitemap extends Maintenance {
-               $title = Title::makeTitle( $namespace, str_repeat( "\u{28B81}", 63 ) . "\u{5583}" );
+               $title = Title::makeTitle( $namespace, str_repeat( "a", 255 ) );

@Seb35: Thanks for investigating this! Would you like to propose that as a patch in Gerrit?

Change 661767 had a related patch set uploaded (by Seb35; owner: Seb35):
[mediawiki/core@master] Fix edge case in maintenance/generateSitemap.php

https://gerrit.wikimedia.org/r/661767

Here it is! With a small comment in the commit message to explain the issue.

We just had a wiki run into this same issue. Has this been merged into core yet?

We just had a wiki run into this same issue. Has this been merged into core yet?

No, previous activity in the patch was at April 30th, 2021.

Change 808009 had a related patch set uploaded (by PleaseStand; author: PleaseStand):

[mediawiki/core@master] generateSitemap.php: Fix a couple limit checking bugs

https://gerrit.wikimedia.org/r/808009

I saw this bug, and my first analysis is that depending on the language (and particularly the number of bytes for the translation of the namespace "User") the 64 UTF-8 characters are sometimes splitted and the last character becomes invalid (e.g. 2 first bytes instead of 4 bytes).

I think this problem is something you would only run into if the wiki has gendered User namespace aliases (which causes the database query to happen), and the character set of the user_name column is the old 3-byte "utf8" (in which U+28B81 would be out of range).

If I understand correctly the original code in 5188821115ff (written in 2005) there is no special need for non-ASCII characters, except fun :) So I replaced the 63 4-bytes characters + 1 3-bytes character by 255 1-byte characters, and it works.

diff --git a/maintenance/generateSitemap.php b/maintenance/generateSitemap.php
index 6060567bf2c..ff8fe0fddb4 100644
--- a/maintenance/generateSitemap.php
+++ b/maintenance/generateSitemap.php
@@ -549 +549 @@ class GenerateSitemap extends Maintenance {
-               $title = Title::makeTitle( $namespace, str_repeat( "\u{28B81}", 63 ) . "\u{5583}" );
+               $title = Title::makeTitle( $namespace, str_repeat( "a", 255 ) );

Percent-encoding makes the URL with the non-ASCII characters much longer, so using a string of 255 a's would break the size check. Another option would be to instead replace the 4-byte characters with twice the number of 2-byte characters. The check may still be flawed though because of the gendered User namespace issue (aliases may be of different byte lengths), and there are better ways to implement such a size check. The patch set I uploaded (see above) is one of them.

This comment was removed by PleaseStand.

Percent-encoding makes the URL with the non-ASCII characters much longer, so using a string of 255 a's would break the size check. Another option would be to instead replace the 4-byte characters with twice the number of 2-byte characters. The check may still be flawed though because of the gendered User namespace issue (aliases may be of different byte lengths), and there are better ways to implement such a size check. The patch set I uploaded (see above) is one of them.

Today I re-checked this bug on current master but was not able to reproduce it (I cannot remember the language where there was a bug, and it seems it’s not written in the bug report).

Anyway your patch is better than the current code.

Seb35 removed Seb35 as the assignee of this task.Aug 16 2022, 7:12 PM

Change 661767 abandoned by Seb35:

[mediawiki/core@master] Fix edge case in maintenance/generateSitemap.php

Reason:

Some issues with this patch and in favor of I26d592c2f1fe78d0c504a35f615375630969f44b

https://gerrit.wikimedia.org/r/661767