Page MenuHomePhabricator

Unable to view page information on [[Difracció]] article at ca.wikipedia.org
Closed, ResolvedPublicPRODUCTION ERROR

Description

This follows-up from T388924: PHP Deprecated: preg_replace(): Passing null to parameter #3 ($subject) of type array|string is deprecated which masked a PHP 8.1 warning about bad input being passed between Parser internals (it seems a string somewhere is turning into a null, but isn't meant to be. Once this is fixed, we should probably revert that in favour of native typehints).

That same request actualy emitted two production errors, one of which is a fatal error that affects PHP 7.4 and PHP 8.1 both, and remains reproducible today even with the deprecation warning resolved.

https://ca.wikipedia.org/w/index.php?title=Difracció&action=info

From T388924:

PHP Deprecated: preg_replace(): Passing null to parameter #3 ($subject) of type array|string is deprecated

at MediaWiki\Parser\MagicWord->matchStartAndRemove(null)
at …
at MediaWiki\Message\Message->parseText(string)
at MediaWiki\Message\Message->__toString()
at InfoAction->onView()
RuntimeException: PCRE failure

at MediaWiki\Parser\Parser->handleExternalLinks(string)
at …
at MediaWiki\Message\Message->__toString()
at InfoAction->onView()

The same stack trace is the same as these tasks:

Event Timeline

Restricted Application changed the subtype of this task from "Task" to "Production Error". · View Herald TranscriptMar 20 2025, 9:48 PM
Restricted Application added a subscriber: Aklapper. · View Herald Transcript

Change #1128051 merged by jenkins-bot:

[mediawiki/core@master] MagicWord::replace*: Make sure we don't pass null into preg_match/preg_replace

https://gerrit.wikimedia.org/r/1128051

I've reverted the above patch on mwdebug1002 and tried to reproduce the error at the original URL:
https://ca.wikipedia.org/w/index.php?title=Difracció&action=info

And added the following instrumentation:

--- includes/actions/InfoAction.php
+++ includes/actions/InfoAction.php
    foreach ( $pageInfo as $header => $infoTable ) {
+       trigger_error( 'Ad-Hoc krinkle ' . $header );
---
+++ includes/parser/Parser.php
--- a/includes/parser/Parser.php
+++ b/includes/parser/Parser.php
@@ -2168,6 +2168,7 @@ class Parser {
    private function handleExternalLinks( $text ) {
        $bits = preg_split( $this->mExtLinkBracketedRegex, $text, -1, PREG_SPLIT_DELIM_CAPTURE );
        if ( $bits === false ) {
+           trigger_error( 'Ad-Hoc krinkle ' . substr( $text, 0, 100 ) );
            throw new RuntimeException( "PCRE failure" );
        }

Which yields:

  • PHP Notice: Ad-Hoc krinkle header-basic
  • PHP Notice: Ad-Hoc krinkle header-restrictions
  • PHP Notice: Ad-Hoc krinkle header-edits
  • PHP Notice: Ad-Hoc krinkle (pageinfo-firstuser: Pl�cid P�rez Bru)
  • RuntimeException: PCRE failure

And https://ca.wikipedia.org/wiki/MediaWiki:Pageinfo-firstuser contains:

{{GENDER:$1|Creador|Creadora}} de la pàgina

Which is passed:

mediawiki/includees/actions/InfoAction.php
		$firstRev = $this->revisionLookup->getFirstRevision( $this->getTitle() );
		$firstRevUserName = $firstRevUser ? $firstRevUser->getName() : '[HIDDEN]';
		$pageInfo['header-edits'][] = [
			$this->msg( 'pageinfo-firstuser', $firstRevUserName ),
			Linker::revUserTools( $firstRev )
		];

So it seems that the first revision of this ca.wikipedia.org article, has an author whose username string is somehow corrupt and thus throws off PCRE.

https://ca.wikipedia.org/w/index.php?title=Difracci%C3%B3&action=history&dir=prev&limit=1

The first revision of this page is not displayed on the history page, presumably due to the same underying corruption.

From Quarry:

SELECT FROM revision WHERE rev_id = 39311;
-- rev_actor: 342864
-- rev_timestamp: 2003-11-20 17:50:48

SELECT * FROM actor WHERE actor_id = 342864; 
-- actor_id: 42864
-- actor_user:
-- actor_name: Pl??cid P??rez Bru

Screenshot 2025-03-20 at 14.41.19.png (180×2 px, 22 KB)

The lack of user ID suggests this was an imported revision, probably from the pre-MediaWiki days?

SELECT COUNT(*)
FROM revision
WHERE rev_actor = 342864
ORDER BY rev_timestamp ASC;

--- 1

SELECT * FROM actor
WHERE actor_name = (SELECT actor_name FROM actor WHERE actor_id=342864 LIMIT 1);

-- 1 row

The author did not have any other edits (neither by actor ID, nor by name).

The question here is basically: How do we want to address corrupted/invalid UTF-8 strings that made it into tables like actor.actor_name.

Options:

  1. Address them at the high-level and last-mile by requiring every call site that performs a string or regex operation to check and normalize UTF-8, and for each of thes paths to decide and be responsible for having an exit path that does not involve a fatal error. That is, they need to normalize it, or support end-to-end in their feature a way to propagate and reserve an invalid name (e.g. by empty string or some other placeholder).
    • This seems most-expensive and would not be what I recommend. It means we will spent the next twenty years playing whack-a-mole as we go in circles with cleaning up code, simplifying it, re-introducing issues that we only observe in production, report them as production errors that end up deprioritized and meanwhile these pages remain inaccessible to end-users and spam our logs.
  1. Address them at the low-level straight out of the database. E.g. classes like RevisionStore or ActorStore can be made responsible for normalizing any corrupt/bad data at runtime and thus tolerate. This is much cheaper than the first option, but by supporting this here, we'd mask/tolerate it for all data, and likely miss detection of genuine regressions. It'd also add overhead to normalize this.
  1. Modify the source data in these rows on a case-by-case basis (i.e. come up with a more reasonable representation and manually fix the data).
    • This would essentially backport and retroactively apply the guruantees we uphold today. This would introduce the least ceveats and downstream burden on developers, and makes it impossible for these errors to come up again in the future. The downside is that it's more manual work to find these rows, decide how to represent them, and fix them. There is also a risk of doing this badly, and thus corrupting live data. We'd want to make sure to export any mass-changed rows to a restorable binary representation on disk first. Or (more likely) not do any mass-changes in the first place.
SELECT actor_id, actor_user, actor_name, HEX(actor_name) FROM actor
WHERE actor_id=342864;
506CE06369642050E972657A20427275
'506CE06369642050E972657A20427275'.replace(/(..)/g, '\\u00$1')
"\\u0050\\u006C\\u00E0\\u0063\\u0069\\u0064\\u0020\\u0050\\u00E9\\u0072\\u0065\\u007A\\u0020\\u0042\\u0072\\u0075"
JSON.parse('"' + '506CE06369642050E972657A20427275'.replace(/(..)/g, '\\u00$1') + '"')
"Plàcid Pérez Bru"

That seems like a plausible rendering. It is a valid Catalan phrase, too, and corresponds to an account that was active between 2003 and 2013 on ca.wikipedia.org:

https://ca.wikipedia.org/wiki/Usuari:Plàcid_Pérez_Bru

However, this could I suppose also have been an impersonation attempt and corrupted for other reasons. So rather than attach the edit there, I suggest we keep it as an unregistered edit (presumed imported), but normalize it to the modern format for imported usernames, by using the imported> prefix. Keeping it as unregistered edit but with fixed encoding and no user ID and no prefix would be confusing, I think, since that would still link to the above user's contributions etc. It is only in the database that we see the lack of user ID association.

Mentioned in SAL (#wikimedia-operations) [2025-03-21T07:54:25Z] <Krinkle> krinkle@mwmaint: Fix actor_name encoding on cawiki for 1 row: actor_id=342864, per T389559

Krinkle claimed this task.
Read-only prep
[07:38 UTC] krinkle at mwmaint1002.eqiad.wmnet in ~
$ sql cawiki;

(cawiki)> SELECT actor_id, actor_user, actor_name, HEX(actor_name) FROM actor WHERE actor_id=342864;
+----------+------------+------------------+----------------------------------+
| actor_id | actor_user | actor_name       | HEX(actor_name)                  |
+----------+------------+------------------+----------------------------------+
|   342864 |       NULL | Pl?cid Prez Bru  | 506CE06369642050E972657A20427275 |
+----------+------------+------------------+----------------------------------+
1 row in set (0.000 sec)

wikiadmin2023@10.64.32.139(cawiki)> SELECT user_id, user_name, HEX(user_name) FROM user WHERE user_id=9 LIMIT 1;
+---------+--------------------+--------------------------------------+
| user_id | user_name          | HEX(user_name)                       |
+---------+--------------------+--------------------------------------+
|       9 | Plàcid Pérez Bru   | 506CC3A06369642050C3A972657A20427275 |
+---------+--------------------+--------------------------------------+
1 row in set (0.000 sec)

(cawiki)> SELECT actor_name, HEX(actor_name) FROM actor WHERE actor_user=9;
+--------------------+--------------------------------------+
| actor_name         | HEX(actor_name)                      |
+--------------------+--------------------------------------+
| Plàcid Pérez Bru   | 506CC3A06369642050C3A972657A20427275 |
+--------------------+--------------------------------------+
1 row in set (0.000 sec)


(cawiki)> SELECT page_title, HEX(page_title) FROM page WHERE page_id=5700 LIMIT 1;
+--------------------+--------------------------------------+
| page_title         | HEX(page_title)                      |
+--------------------+--------------------------------------+
| Plàcid_Pérez_Bru   | 506CC3A06369645F50C3A972657A5F427275 |
+--------------------+--------------------------------------+
1 row in set (0.000 sec)
  • bad actor_name
    • 506CE06369642050E972657A20427275
    • 506CE__06369642050E9__72657A20427275
  • good actor_name
    • 506CC3A06369642050C3A972657A20427275
  • good user_name
    • 506CC3A06369642050C3A972657A20427275
  • good page_title
    • 506CC3A06369645F50C3A972657A5F427275 (note: underscores 5F, instead of spaces 20)
Database recovery
[07:48 UTC] krinkle at mwmaint1002.eqiad.wmnet in ~
$ sql cawiki --write
(cawiki)> UPDATE actor SET actor_name=CONCAT('imported>', UNHEX('506CC3A06369642050C3A972657A20427275')) WHERE actor_id=342864 LIMIT 1;
Query OK, 1 row affected
Rows matched: 1  Changed: 1

https://ca.wikipedia.org/w/index.php?title=Difracció&action=info is now accessible again.

a.png (1×1 px, 357 KB)

And https://ca.wikipedia.org/w/index.php?title=Difracci%C3%B3&action=history&dir=prev&limit=1 now shows the name instead of a hole in the layout:

Screenshot 2025-03-21 at 00.53.06.png (118×1 px, 40 KB)

For future reference, a generic conversation can be done using CONVERT() to interpret the data as latin1, and converting it again to utf8.

Read-only via Toolforge
[07:57 UTC] krinkle at tools-login
$ sql cawiki

MariaDB [cawiki_p]> SET @bad = UNHEX('506CE06369642050E972657A20427275');
Query OK, 0 rows affected (0.001 sec)

MariaDB [cawiki_p]> SELECT @bad;
+------------------+
| @bad             |
+------------------+
| Pl?cid P?rez Bru   |
+------------------+
1 row in set (0.001 sec)

MariaDB [cawiki_p]> SELECT CONVERT(@bad USING latin1), HEX(CONVERT(CONVERT(@bad USING latin1), CHAR CHARACTER SET utf8));
+----------------------------+-------------------------------------------------------------------+
| CONVERT(@bad USING latin1) | HEX(CONVERT(CONVERT(@bad USING latin1), CHAR CHARACTER SET utf8)) |
+----------------------------+-------------------------------------------------------------------+
| Plàcid Pérez Bru           | 506CC3A06369642050C3A972657A20427275                              |
+----------------------------+-------------------------------------------------------------------+
1 row in set (0.001 sec)

The resulting hex is the same as what I ended up setting above.

If you use this to update one or more rows in-place by referencing the column in its own value, that means the query is no longer indempotent, which isn't great for live hacks (e.g. if you accidentally run it twice, it corrupts it in another way instead of fixing it). Also, you do still need CONCAT() to insert the "imported>" prefix. Without that you'll likey get a unique constraint error (as expected), or produce a logically invalid entry (bad, missing user ID).

I suggest going through all usernames/actor names on all wikis calling mb_check_encoding on them and see what's left.

I would suggest using something like badutf8> rather than imported> - the fact that the edit was imported is a guess, not something you know.

I would suggest using something like badutf8> rather than imported> - the fact that the edit was imported is a guess, not something you know.

Whether it was an import, or a bug causing the revision to be stored in the format exclusively used for imports now (non-IP name with user_id=0) is a distinction without a difference. Introducing a new reserved prefix in the software that we have to forever keep and dance around seems more trouble than it is worth, but it is worth considering. I suggest discussing this further on the parent task.

Importing already can create prefixed usernames with arbitrary prefixes. For example there's a deleted edit attributed to "aug2001>Adsl-63-205-153-170.dsl.lsan03.pacbell.net" in the deleted history of "Human–computer_interaction/Temp" on enwiki. "imported>" is just a convention, not a hard assumption.