Page MenuHomePhabricator

Various old revisions are encoded as Windows-1252 rather than UTF-8, causing "RuntimeException: PCRE failure" when viewing them
Open, Needs TriagePublicPRODUCTION ERROR

Description

When I try to display some of the oldest revisions on Esperanto Wikipedia, I am getting an error message:

Fatal exception of type "RuntimeException"

This applies to most revisions of "Main Page", listed at https://eo.wikipedia.org/w/index.php?title=Main_Page&action=history – in particular revisions between 4062 to 4097 (but there might be more with the same issue).

Is the issue possibly related to the page texts missing from the database? If that is the case, is there any chance of uploading them if I manage to recover them from early tarballs (dumps)?


Error
normalized_message
[{reqId}] {exception_url}   RuntimeException: PCRE failure
exception.trace
from /srv/mediawiki/php-1.42.0-wmf.5/includes/parser/Parser.php(2119)
#0 /srv/mediawiki/php-1.42.0-wmf.5/includes/parser/Parser.php(1574): Parser->handleExternalLinks(string)
#1 /srv/mediawiki/php-1.42.0-wmf.5/includes/parser/Parser.php(651): Parser->internalParse(string)
#2 /srv/mediawiki/php-1.42.0-wmf.5/includes/content/WikitextContentHandler.php(420): Parser->parse(string, MediaWiki\Title\Title, ParserOptions, boolean, boolean, integer)
#3 /srv/mediawiki/php-1.42.0-wmf.5/includes/content/ContentHandler.php(1759): WikitextContentHandler->fillParserOutput(WikitextContent, MediaWiki\Content\Renderer\ContentParseParams, ParserOutput)
#4 /srv/mediawiki/php-1.42.0-wmf.5/includes/content/Renderer/ContentRenderer.php(47): ContentHandler->getParserOutput(WikitextContent, MediaWiki\Content\Renderer\ContentParseParams)
#5 /srv/mediawiki/php-1.42.0-wmf.5/includes/Revision/RenderedRevision.php(260): MediaWiki\Content\Renderer\ContentRenderer->getParserOutput(WikitextContent, MediaWiki\Title\Title, integer, ParserOptions, boolean)
#6 /srv/mediawiki/php-1.42.0-wmf.5/includes/Revision/RenderedRevision.php(232): MediaWiki\Revision\RenderedRevision->getSlotParserOutputUncached(WikitextContent, boolean)
#7 /srv/mediawiki/php-1.42.0-wmf.5/includes/Revision/RevisionRenderer.php(226): MediaWiki\Revision\RenderedRevision->getSlotParserOutput(string, array)
#8 /srv/mediawiki/php-1.42.0-wmf.5/includes/Revision/RevisionRenderer.php(164): MediaWiki\Revision\RevisionRenderer->combineSlotOutput(MediaWiki\Revision\RenderedRevision, ParserOptions, array)
#9 [internal function]: MediaWiki\Revision\RevisionRenderer->MediaWiki\Revision\{closure}(MediaWiki\Revision\RenderedRevision, array)
#10 /srv/mediawiki/php-1.42.0-wmf.5/includes/Revision/RenderedRevision.php(199): call_user_func(Closure, MediaWiki\Revision\RenderedRevision, array)
#11 /srv/mediawiki/php-1.42.0-wmf.5/includes/poolcounter/PoolWorkArticleView.php(84): MediaWiki\Revision\RenderedRevision->getRevisionParserOutput()
#12 /srv/mediawiki/php-1.42.0-wmf.5/includes/poolcounter/PoolWorkArticleViewOld.php(66): PoolWorkArticleView->renderRevision()
#13 /srv/mediawiki/php-1.42.0-wmf.5/includes/poolcounter/PoolCounterWork.php(167): PoolWorkArticleViewOld->doWork()
#14 /srv/mediawiki/php-1.42.0-wmf.5/includes/page/ParserOutputAccess.php(307): PoolCounterWork->execute()
#15 /srv/mediawiki/php-1.42.0-wmf.5/includes/page/Article.php(756): MediaWiki\Page\ParserOutputAccess->getParserOutput(WikiPage, ParserOptions, MediaWiki\Revision\RevisionStoreRecord, integer)
#16 /srv/mediawiki/php-1.42.0-wmf.5/includes/page/Article.php(559): Article->generateContentOutput(MediaWiki\User\User, ParserOptions, integer, MediaWiki\Output\OutputPage, array)
#17 /srv/mediawiki/php-1.42.0-wmf.5/includes/actions/ViewAction.php(78): Article->view()
#18 /srv/mediawiki/php-1.42.0-wmf.5/includes/MediaWiki.php(583): ViewAction->show()
#19 /srv/mediawiki/php-1.42.0-wmf.5/includes/MediaWiki.php(363): MediaWiki->performAction(Article, MediaWiki\Title\Title)
#20 /srv/mediawiki/php-1.42.0-wmf.5/includes/MediaWiki.php(960): MediaWiki->performRequest()
#21 /srv/mediawiki/php-1.42.0-wmf.5/includes/MediaWiki.php(613): MediaWiki->main()
#22 /srv/mediawiki/php-1.42.0-wmf.5/index.php(50): MediaWiki->run()
#23 /srv/mediawiki/php-1.42.0-wmf.5/index.php(46): wfIndexMain()
#24 /srv/mediawiki/w/index.php(3): require(string)
#25 {main}

Event Timeline

There are a very large number of changes, so older changes are hidden. Show Older Changes
Pppery renamed this task from "RuntimeException: PCRE failure" displaying oldest revisions on eowiki to Various old revisions are encoded as Windows-1252 rather than UTF-8, causing "RuntimeException: PCRE failure" when viewing them.Mar 2 2025, 9:25 PM

(I am checking before merging each of these that the revision really is valid in Windows-1252 rather and is caused by this rather than some other corruption bug with the same symptom)

I guess we could have a new configuration flag $wgValidateUtf8ForOldRevisions and run the preg_match('//u', $articleText) every time we fetch an old revision from the database. I don't think we'd want to turn that on for most wikis, but we could certainly afford the small performance hit by enabling it for the esperanto wiki.

I think it could be done faster than that. Fast enough that you could just do it all the time, without the config flag.

PHP 8.3 includes this commit by Alex Dowad, which introduced an AVX2 accelerated implementation of mb_check_encoding(), with caching of the result in the zval. It's so fast that you could use it redundantly throughout the codebase. On my laptop, it does about 1 GB/s on the first call on a given string, and ~20ns per call after that to check the cache.

mb_scrub() checks the new cache but doesn't set it, so it's best to use it in combination with mb_check_encoding().

The old mb_check_encoding() in PHP 7.4 does about 140 MB/s. I think that's tolerable in the context of a parser cache miss. I'll put it up for review.

Change #1124221 had a related patch set uploaded (by Tim Starling; author: Tim Starling):

[mediawiki/core@master] Fix invalid UTF-8 in Parser::parse() and SqlBlobStore

https://gerrit.wikimedia.org/r/1124221

Change #1124565 had a related patch set uploaded (by Ladsgroup; author: Amir Sarabadani):

[mediawiki/core@master] maintenance: Also check for utf-8 encoding in findBadBlobs

https://gerrit.wikimedia.org/r/1124565

FindBadBlobs basically can work out of the box going through revisions which is nice but it doesn't check whether it produces valid utf-8. The above would help here. Then I can run a search across all wikis to hunt down and see what I can about them.

Also noting that just removing bad encodings will mess up the checksum. If it's not that many, I'd say let's just mark them but let's see how many we are dealing with first.

Is it useful to mark these as bad? The effect of that would be that instead of some content being visible (which is often reasonably readable despite being in the wrong encoding since on many wikis most content is ASCII) no content is visible at all, which seems worse.

Re-encoding from whatever character set they are in to UTF-8 would best reflect the author's intention. But replacing the non-ASCII characters with U+FFFD will look better in diffs since the corrupted article text would have been fixed by an editor in a subsequent revision. Either way, the checksums could be updated.

Marking them as bad is fine as a quick and simple solution. If I understand correctly, it preserves the original blob in the database in case some future generation wants to recover the text.

Change #1124565 merged by jenkins-bot:

[mediawiki/core@master] maintenance: Also check for utf-8 encoding in findBadBlobs

https://gerrit.wikimedia.org/r/1124565

Change #1124761 had a related patch set uploaded (by Ladsgroup; author: Amir Sarabadani):

[mediawiki/core@wmf/1.44.0-wmf.18] maintenance: Also check for utf-8 encoding in findBadBlobs

https://gerrit.wikimedia.org/r/1124761

Change #1124762 had a related patch set uploaded (by Ladsgroup; author: Amir Sarabadani):

[mediawiki/core@wmf/1.44.0-wmf.19] maintenance: Also check for utf-8 encoding in findBadBlobs

https://gerrit.wikimedia.org/r/1124762

Change #1124761 merged by jenkins-bot:

[mediawiki/core@wmf/1.44.0-wmf.18] maintenance: Also check for utf-8 encoding in findBadBlobs

https://gerrit.wikimedia.org/r/1124761

Change #1124762 merged by jenkins-bot:

[mediawiki/core@wmf/1.44.0-wmf.19] maintenance: Also check for utf-8 encoding in findBadBlobs

https://gerrit.wikimedia.org/r/1124762

Mentioned in SAL (#wikimedia-operations) [2025-03-05T12:39:16Z] <ladsgroup@deploy2002> Started scap sync-world: Backport for [[gerrit:1124761|maintenance: Also check for utf-8 encoding in findBadBlobs (T351953)]], [[gerrit:1124762|maintenance: Also check for utf-8 encoding in findBadBlobs (T351953)]]

Mentioned in SAL (#wikimedia-operations) [2025-03-05T12:42:40Z] <ladsgroup@deploy2002> ladsgroup: Backport for [[gerrit:1124761|maintenance: Also check for utf-8 encoding in findBadBlobs (T351953)]], [[gerrit:1124762|maintenance: Also check for utf-8 encoding in findBadBlobs (T351953)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug)

Mentioned in SAL (#wikimedia-operations) [2025-03-05T13:11:52Z] <ladsgroup@deploy2002> Started scap sync-world: Backport for [[gerrit:1124761|maintenance: Also check for utf-8 encoding in findBadBlobs (T351953)]], [[gerrit:1124762|maintenance: Also check for utf-8 encoding in findBadBlobs (T351953)]]

Mentioned in SAL (#wikimedia-operations) [2025-03-05T13:15:04Z] <ladsgroup@deploy2002> ladsgroup: Backport for [[gerrit:1124761|maintenance: Also check for utf-8 encoding in findBadBlobs (T351953)]], [[gerrit:1124762|maintenance: Also check for utf-8 encoding in findBadBlobs (T351953)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug)

Mentioned in SAL (#wikimedia-operations) [2025-03-05T13:23:24Z] <ladsgroup@deploy2002> Finished scap sync-world: Backport for [[gerrit:1124761|maintenance: Also check for utf-8 encoding in findBadBlobs (T351953)]], [[gerrit:1124762|maintenance: Also check for utf-8 encoding in findBadBlobs (T351953)]] (duration: 11m 31s)

Now findBadblobs shows them as broken:

ladsgroup@mwmaint2002:~$ mwscript findBadBlobs.php  --wiki eowiki --revisions 4062,4063,4064
DEPRECATION WARNING: Maintenance scripts are moving to Kubernetes. See
https://wikitech.wikimedia.org/wiki/Maintenance_scripts for the new process.
Maintenance hosts will be going away; please submit feedback promptly if
maintenance scripts on Kubernetes don't work for you. (T341553)
Scanning 3 ids
	! Found bad blob on revision 4062 from 20011128234605 (main slot): content_id=8381, address=<es:DB://cluster5/5320/0a57e6057e76cd9536ef9ebb9ff9a647?flags=utf-8>, error='Invalid UTF-8', type='invalid-utf-8'. ID:	4062
	! Found bad blob on revision 4063 from 20011203185221 (main slot): content_id=8382, address=<es:DB://cluster5/5320/e6095fcd90e9dfd010f3d2484b304aee?flags=utf-8>, error='Invalid UTF-8', type='invalid-utf-8'. ID:	4063
	! Found bad blob on revision 4064 from 20011203185255 (main slot): content_id=8383, address=<es:DB://cluster5/5320/977243872771c9c695e2af325250d734?flags=utf-8>, error='Invalid UTF-8', type='invalid-utf-8'. ID:	4064
	- Scanned a batch of 3 revisions
Found 3 bad revisions.

I'm running on all of eowiki up to 2010 to see how many are like this.

Update: I'm running this on every wiki everywhere for all revisions up to 2007. So far not that many has been found which is good. It'll take a while to finish but once the list is fully done. I'm going to put it somewhere and for now mark all of them as bad blobs to prevent them triggering an exception in production.

Then I will try to find a way to either scrub them or decode in another encoding then re-encode in utf-8. Scrubbing requires changing the checksum and decode/encode requires figuring out which encoding is the right one which is not trivial.

(The check is still ongoing, currently almost done on enwiki)

Mentioned in SAL (#wikimedia-operations) [2025-03-12T18:36:22Z] <Amir1> marking ~3K revisions with bad blobs (T351953)

I just marked 2827 revisions with bad blob (via running tail -n +3 bad_blobs_12_mar | sed -n 's/^\(.*wik.*\)\:.* revision \([0-9]*\) from.*/\1 \2/p' | xargs -l bash -c 'mwscript findBadBlobs.php --wiki $0 --revisions $1 --mark \"Invalid UTF-8\"'). I put the list of them in P74206. The script hasn't finished yet so there might be more but this should take care of around 90% of the wikis.

I looked at the enwiki diffs there, to see if there was another old corruption task I could file and investigate. Aside from two cases I already knew about, they seem to be random, with no pattern whatsoever I could see.

I did discover T388708 and an old copyvio from 2005 to revdel, though.

I think I know why that happened, the rev_id is way too high while the edit being from 2002. I think it's probably del/undel making the rev_id high. I don't have an easy way to find cases like this but I will try, probably I will patch findBadBlobs to just check for timesstamps instead.

Change #1127659 had a related patch set uploaded (by Ladsgroup; author: Amir Sarabadani):

[mediawiki/core@master] findBadBlobs: Allow for timestamp based search via --scan-to

https://gerrit.wikimedia.org/r/1127659

Change #1127659 merged by jenkins-bot:

[mediawiki/core@master] findBadBlobs: Allow for timestamp based search via --scan-to

https://gerrit.wikimedia.org/r/1127659

I marked 702 more revisions as bad revisions (list: P74235). Tomorrow morning I will backport the above patch and re-run findBadBlobs across all wikis.

Change #1128366 had a related patch set uploaded (by Ladsgroup; author: Amir Sarabadani):

[mediawiki/core@wmf/1.44.0-wmf.20] findBadBlobs: Allow for timestamp based search via --scan-to

https://gerrit.wikimedia.org/r/1128366

Change #1128366 merged by jenkins-bot:

[mediawiki/core@wmf/1.44.0-wmf.20] findBadBlobs: Allow for timestamp based search via --scan-to

https://gerrit.wikimedia.org/r/1128366

Mentioned in SAL (#wikimedia-operations) [2025-03-17T10:50:33Z] <ladsgroup@deploy2002> Started scap sync-world: Backport for [[gerrit:1128366|findBadBlobs: Allow for timestamp based search via --scan-to (T351953)]], [[gerrit:1128365|media: Make SvgHandler respect physicalWidth when building URL for thumb (T360589)]]

Mentioned in SAL (#wikimedia-operations) [2025-03-17T10:55:11Z] <ladsgroup@deploy2002> ladsgroup: Backport for [[gerrit:1128366|findBadBlobs: Allow for timestamp based search via --scan-to (T351953)]], [[gerrit:1128365|media: Make SvgHandler respect physicalWidth when building URL for thumb (T360589)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug)

Mentioned in SAL (#wikimedia-operations) [2025-03-17T11:04:43Z] <ladsgroup@deploy2002> Finished scap sync-world: Backport for [[gerrit:1128366|findBadBlobs: Allow for timestamp based search via --scan-to (T351953)]], [[gerrit:1128365|media: Make SvgHandler respect physicalWidth when building URL for thumb (T360589)]] (duration: 14m 09s)

Running it again. It'll take a while.

PCRE exceptions are now much much better:

grafik.png (362×943 px, 32 KB)

Some are still lingering though. I will see what I can do about those.

Change #1124221 abandoned by Tim Starling:

[mediawiki/core@master] Fix invalid UTF-8 in Parser::parse()

Reason:

old

https://gerrit.wikimedia.org/r/1124221

Change #1214446 had a related patch set uploaded (by Ladsgroup; author: Amir Sarabadani):

[mediawiki/core@master] findBadBlobs: Fix the --scan-to option

https://gerrit.wikimedia.org/r/1214446

Change #1214446 merged by jenkins-bot:

[mediawiki/core@master] findBadBlobs: Fix the --scan-to option

https://gerrit.wikimedia.org/r/1214446

Change #1214592 had a related patch set uploaded (by Ladsgroup; author: Amir Sarabadani):

[mediawiki/core@wmf/1.46.0-wmf.5] findBadBlobs: Fix the --scan-to option

https://gerrit.wikimedia.org/r/1214592

Change #1214593 had a related patch set uploaded (by Ladsgroup; author: Amir Sarabadani):

[mediawiki/core@wmf/1.46.0-wmf.4] findBadBlobs: Fix the --scan-to option

https://gerrit.wikimedia.org/r/1214593

Change #1214598 had a related patch set uploaded (by Ladsgroup; author: Amir Sarabadani):

[mediawiki/core@REL1_45] findBadBlobs: Fix the --scan-to option

https://gerrit.wikimedia.org/r/1214598

Change #1214592 merged by jenkins-bot:

[mediawiki/core@wmf/1.46.0-wmf.5] findBadBlobs: Fix the --scan-to option

https://gerrit.wikimedia.org/r/1214592

Change #1214593 merged by jenkins-bot:

[mediawiki/core@wmf/1.46.0-wmf.4] findBadBlobs: Fix the --scan-to option

https://gerrit.wikimedia.org/r/1214593

Mentioned in SAL (#wikimedia-operations) [2025-12-03T18:19:54Z] <ladsgroup@deploy2002> Started scap sync-world: Backport for [[gerrit:1214592|findBadBlobs: Fix the --scan-to option (T351953)]], [[gerrit:1214593|findBadBlobs: Fix the --scan-to option (T351953)]]

Mentioned in SAL (#wikimedia-operations) [2025-12-03T18:22:14Z] <ladsgroup@deploy2002> ladsgroup: Backport for [[gerrit:1214592|findBadBlobs: Fix the --scan-to option (T351953)]], [[gerrit:1214593|findBadBlobs: Fix the --scan-to option (T351953)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there.

Change #1214598 merged by jenkins-bot:

[mediawiki/core@REL1_45] findBadBlobs: Fix the --scan-to option

https://gerrit.wikimedia.org/r/1214598

Mentioned in SAL (#wikimedia-operations) [2025-12-03T18:26:42Z] <ladsgroup@deploy2002> Finished scap sync-world: Backport for [[gerrit:1214592|findBadBlobs: Fix the --scan-to option (T351953)]], [[gerrit:1214593|findBadBlobs: Fix the --scan-to option (T351953)]] (duration: 06m 48s)

I'm seeing the check for archive failing sometimes with this:

Wikimedia\Assert\ParameterAssertionException from line 72 of /srv/mediawiki/php-1.46.0-wmf.4/vendor/wikimedia/assert/src/Assert.php: Bad value for parameter $page: must represent a proper page
#0 /srv/mediawiki/php-1.46.0-wmf.4/includes/Revision/RevisionArchiveRecord.php(80): Wikimedia\Assert\Assert::parameter(false, '$page', 'must represent ...')
#1 /srv/mediawiki/php-1.46.0-wmf.4/includes/Revision/RevisionStore.php(1639): MediaWiki\Revision\RevisionArchiveRecord->__construct(Object(MediaWiki\Title\Title), Object(MediaWiki\User\UserIdentityValue), Object(MediaWiki\CommentStore\CommentStoreComment), Object(stdClass), Object(MediaWiki\Revision\RevisionSlots), false)
#2 /srv/mediawiki/php-1.46.0-wmf.4/includes/Revision/RevisionStore.php(2038): MediaWiki\Revision\RevisionStore->newRevisionFromArchiveRowAndSlots(Object(stdClass), Object(MediaWiki\Revision\RevisionSlots), 0, Object(MediaWiki\Title\Title))
#3 [internal function]: MediaWiki\Revision\RevisionStore->MediaWiki\Revision\{closure}(Object(stdClass))
#4 /srv/mediawiki/php-1.46.0-wmf.4/includes/Revision/RevisionStore.php(2017): array_map(Object(Closure), Array)
#5 /srv/mediawiki/php-1.46.0-wmf.4/maintenance/findBadBlobs.php(287): MediaWiki\Revision\RevisionStore->newRevisionsFromBatch(Object(Wikimedia\Rdbms\MysqliResultWrapper), Array)
#6 /srv/mediawiki/php-1.46.0-wmf.4/maintenance/findBadBlobs.php(216): FindBadBlobs->loadArchiveByRevisionId(10451, 34930298, 1000)
#7 /srv/mediawiki/php-1.46.0-wmf.4/maintenance/findBadBlobs.php(128): FindBadBlobs->scanRevisionsByTimestamp()
#8 /srv/mediawiki/php-1.46.0-wmf.4/maintenance/includes/MaintenanceRunner.php(696): FindBadBlobs->execute()
#9 /srv/mediawiki/php-1.46.0-wmf.4/maintenance/run.php(53): MediaWiki\Maintenance\MaintenanceRunner->run()
#10 /srv/mediawiki/multiversion/MWScript.php(221): require_once('/srv/mediawiki/...')

It's not a big deal per se since archives are even smaller chunk and have even a smaller blast radius. But it breaks the check script that's iterating over the dblist

Change #1214673 had a related patch set uploaded (by Ladsgroup; author: Amir Sarabadani):

[mediawiki/core@master] RevisionStore: Catch ParameterAssertionException too

https://gerrit.wikimedia.org/r/1214673

Change #1214673 merged by jenkins-bot:

[mediawiki/core@master] RevisionStore: Catch ParameterAssertionException too

https://gerrit.wikimedia.org/r/1214673

Change #1215164 had a related patch set uploaded (by Jforrester; author: Amir Sarabadani):

[mediawiki/core@wmf/1.46.0-wmf.5] RevisionStore: Catch ParameterAssertionException too

https://gerrit.wikimedia.org/r/1215164

Change #1215164 merged by jenkins-bot:

[mediawiki/core@wmf/1.46.0-wmf.5] RevisionStore: Catch ParameterAssertionException too

https://gerrit.wikimedia.org/r/1215164

Mentioned in SAL (#wikimedia-operations) [2025-12-04T14:52:35Z] <ladsgroup@deploy2002> Started scap sync-world: Backport for [[gerrit:1215164|RevisionStore: Catch ParameterAssertionException too (T351953)]]

Mentioned in SAL (#wikimedia-operations) [2025-12-04T14:54:39Z] <ladsgroup@deploy2002> jforrester, ladsgroup: Backport for [[gerrit:1215164|RevisionStore: Catch ParameterAssertionException too (T351953)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there.

Mentioned in SAL (#wikimedia-operations) [2025-12-04T15:02:01Z] <ladsgroup@deploy2002> Finished scap sync-world: Backport for [[gerrit:1215164|RevisionStore: Catch ParameterAssertionException too (T351953)]] (duration: 09m 26s)

ladsgroup@deploy2002:~$ mwscript-k8s --follow -- findBadBlobs.php --wiki guwiktionary --mark --revisions 20576
⏳ Starting findBadBlobs.php on Kubernetes as job mw-script.codfw.llaqsrum ...
🚀 Job is running.
📜 Streaming logs:
The --mark must be used together with --revisions

https://tenor.com/de/view/cmon-court-see-look-cardi-b-gif-7970983851329845307

Mentioned in SAL (#wikimedia-operations) [2025-12-05T00:26:18Z] <Amir1> ladsgroup@deploy2002:~$ mwscript-k8s --follow -- findBadBlobs.php --wiki guwiktionary --mark "Corrupted UTF-8 (T351953)" --revisions 20576

Mentioned in SAL (#wikimedia-operations) [2025-12-05T00:27:35Z] <Amir1> ladsgroup@deploy2002:~$ mwscript-k8s --follow -- findBadBlobs.php --wiki huwikiquote --mark "Corrupted UTF-8 (T351953)" --revisions 3804,3808,3811,3813,3814,3818,3825

That would be all of issues in small.dblist wikis. Now running medium wikis.

ladsgroup@deploy2002:~$ mwscript-k8s --dblist=medium -- findBadBlobs.php --scan-from 2001-01-01T00:00:00 --scan-to 2007-01-01T00:00:00
⏳ Starting findBadBlobs.php on Kubernetes as job mw-script.codfw.jmzv03wd ...
🚀 Job is running. For streaming logs, run:
K8S_CLUSTER=codfw KUBECONFIG=/etc/kubernetes/mw-script-codfw.config kubectl logs -f job/mw-script.codfw.jmzv03wd mediawiki-jmzv03wd-app

Medium breaks often, let me try this.

Marked these revisions:

mwscript-k8s --follow -- findBadBlobs.php --wiki dewiktionary --mark "Corrupted UTF-8 (T351953)" --revisions 30288,67988
mwscript-k8s --follow -- findBadBlobs.php --wiki elwiki --mark "Corrupted UTF-8 (T351953)" --revisions 26381,30551
mwscript-k8s --follow -- findBadBlobs.php --wiki enwikibooks --mark "Corrupted UTF-8 (T351953)" --revisions 12927,24553,70674,70677,70678,3401459,3403406,3403510,3405223,3405715
mwscript-k8s --follow -- findBadBlobs.php --wiki enwikiquote --mark "Corrupted UTF-8 (T351953)" --revisions 8434,10751
mwscript-k8s --follow -- findBadBlobs.php --wiki itwiktionary --mark "Corrupted UTF-8 (T351953)" --revisions 5841,5845
mwscript-k8s --follow -- findBadBlobs.php --wiki nlwiktionary --mark "Corrupted UTF-8 (T351953)" --revisions 2230,2323,2471
mwscript-k8s --follow -- findBadBlobs.php --wiki sourceswiki --mark "Corrupted UTF-8 (T351953)" --revisions 1330,11990,11992,11993,11994,11995

Progress:

Medium wikis:

  • Up to 2003: Marked
  • Between 2003 to 2004: Marked
  • Between 2004 to 2005: running
  • Between 2005 to 2006
  • Between 2006 to 2007

Large wikis:

  • Up to 2003
  • Between 2003 to 2004
  • Between 2004 to 2005
  • Between 2005 to 2006
  • Between 2006 to 2007

Mentioned in SAL (#wikimedia-operations) [2025-12-16T21:58:21Z] <Amir1> mwscript-k8s --follow -- findBadBlobs.php --wiki elwiki --mark "Corrupted UTF-8 (T351953)" --revisions 26381,30551 (T351953)

mwscript-k8s --follow -- findBadBlobs.php --wiki plwiktionary --mark "Corrupted UTF-8 (T351953)" --revisions 4182,3067,4196
mwscript-k8s --follow -- findBadBlobs.php --wiki dawiktionary --mark "Corrupted UTF-8 (T351953)" --revisions 630