Page MenuHomePhabricator

Actor Migration breaks images in 1.34.x
Open, HighPublicBUG REPORT

Description

Product Version

MediaWiki 1.34.4 (e34e7f2)
PHP 7.2.33-1+0~20201008.48+debian9~1.gbpcb3068 (apache2handler)
MariaDB 10.1.47-MariaDB-0+deb9u1
ICU 65.1
Elasticsearch 6.8.12

Entry point URLs

Entry point URL
Article path /wiki/en/$1
Script path /wiki/en
index.php /wiki/en/index.php
api.php /wiki/en/api.php
load.php /wiki/en/load.php


Brief

I upgraded 11 (public) wikis in a farm from REL1_32 to REL1_34 and encountered multiple failures. In the end, after seeming to succeed, we found that access to over 1,000 images was 'lost'. Only after manipulating the database records for users, revisions, and images were we able to restore access to the images. See below for details and attempted solutions (with code).

Steps to Reproduce:

I tried to run the maintenance/migrateActors.php script beforehand (using SCHEMA_COMPAT_NEW in LocalSettings.php). For preparation, I ran maintenance/cleanupUsersWithNoId.php --force --prefix=mw

migrateActors would fail with

Query: INSERT  INTO `actor` (actor_name) VALUES ('')
Function: MigrateActors::addActorsForRows
Error: 1062 Duplicate entry '' for key 'actor_name' (localhost)

I solved this (the wrong way) by removing the record in actor:

mysql wiki_fr -e 'DELETE FROM wiki_fr.actor WHERE CAST(actor_name AS CHAR) = "";'

But it did allow actorMigration to complete. For example:

Completed migration, updated 21262 row(s) with 1 new actor(s), 0 error(s)
Beginning migration of log_search
Completed migration, updated 0 row(s) with 0 new actor(s), 0 error(s)

real    2m48.582s
user    0m4.468s
sys     0m0.808s

But we also ran into the problem of Duplicate entry for actor_name for 'Redirect fixer' (as compared to '')

Wikimedia\Rdbms\DBQueryError from line 1603 of /opt/htdocs/mediawiki/includes/libs/rdbms/database/Database.php: A database query error has occurred. Did you forget to run your application's database schema updater after upgrading? 
Query: INSERT  INTO `actor` (actor_name) VALUES ('Redirect fixer')
Function: MigrateActors::addActorsForRows
Error: 1062 Duplicate entry 'Redirect fixer' for key 'actor_name' (localhost)

I tried fixing this (incorrectly) by changing rev_user to 0 in revision where rev_user was 7425 ('Redirect fixer); then running maintenance/cleanupUserWithNoId.php and maintenance/migrateActors.php. Not only is this approach incorrect (because we don't have a user_id 0; and rev_user 7425 was not the problem), but it still failed on similar records in the logging table.

Completed migration, updated 81604 row(s) with 0 new actor(s), 0 error(s)
Beginning migration of logging.log_user and logging.log_user_text to logging.log_actor
... log_id=100
... log_id=200
... log_id=300
Wikimedia\Rdbms\DBQueryError from line 1603 of /opt/htdocs/mediawiki/includes/libs/rdbms/database/Database.php: A database query error has occurred. Did you forget to run your application's database schema updater after upgrading? 
Query: INSERT  INTO `actor` (actor_name) VALUES ('')
Function: MigrateActors::addActorsForRows
Error: 1062 Duplicate entry '' for key 'actor_name' (localhost)

We looked at archive, logging, image, oldimage, filearchive, and ipblocks
for records with empty user_text. The only ones found were in logging (and they were old records.) These records had various 'logging_user' values which I believe were non-existent in the user table. Not knowing how to 'fix' these records, we deleted them (-- affected rows).

delete from `wiki_de`.`logging` where log_user_text = ""; -- 349
delete from `wiki_en`.`logging` where log_user_text = ""; -- 21733
delete from `wiki_es`.`logging` where log_user_text = ""; -- 22853
delete from `wiki_fr`.`logging` where log_user_text = ""; -- 332
delete from `wiki_it`.`logging` where log_user_text = ""; -- 0
delete from `wiki_ja`.`logging` where log_user_text = ""; -- 257
delete from `wiki_ko`.`logging` where log_user_text = ""; -- 24
delete from `wiki_pt`.`logging` where log_user_text = ""; -- 535
delete from `wiki_ru`.`logging` where log_user_text = ""; -- 0
delete from `wiki_sv`.`logging` where log_user_text = ""; -- 2979
delete from `wiki_zh`.`logging` where log_user_text = ""; -- 154

After all this, we could only get through the maintenance/update.php process by deleting the actor record for 'Redirect fixer' a few times. (Redirect fixer has a actor_user value of NULL)

Not exactly a smooth upgrade experience, but at least now we are running 34.x and everything "looked good".

But the images

When it was reported that many previous images were failing to appear on the wiki, we determined that it was because the system declines to show them when the proper actor relation does not exist in the image table. Here's an example query that must be satisfied for an image to show on it's File: page:

SELECT 
    img_name,
    img_size,
    img_width,
    img_height,
    img_metadata,
    img_bits,
    img_media_type,
    img_major_mime,
    img_minor_mime,
    img_timestamp,
    img_sha1,
    comment_img_description.comment_text AS `img_description_text`,
    comment_img_description.comment_data AS `img_description_data`,
    comment_img_description.comment_id AS `img_description_cid`,
    actor_img_user.actor_user AS `img_user`,
    actor_img_user.actor_name AS `img_user_text`,
    img_actor,
    img_metadata
FROM
    `image`
        JOIN
    `comment` `comment_img_description` ON ((comment_img_description.comment_id = img_description_id))
        JOIN
    `actor` `actor_img_user` ON ((actor_img_user.actor_id = img_actor))
WHERE
    img_name = 'Flag_of_Norway.png'
LIMIT 1;

The problem images had img_actor value of 0 in the image table. (Zero is a non-existent record ID in the actor table).

We tried to fix all images with this new fixImages.php maintenance script.

But, that didn't resolve all cases.

Fixed 415 out of 1185 images
570 user records missing
770 actor records missing
0 duplicate user records

Robert Vogel User:Osnard encountered the same problem with Images and offered a prepareActorMigration.php script to fix up "orphan" records (to be used prior to migration). Using his script lightly modified, even after migration to 1.34, I was able to re-run my fixImages.php script to complete fixing up the image table.

Fixed 766 out of 766 images
0 user records missing
0 actor records missing
0 duplicate user records

At this point it looks like everything is as good as it can be -- although we certainly lost data in the process. I'm reporting here to hopefully help others not make the same mistakes I did; and to possibly get improvements into the migration process or documentation so that upgrades go more smoothly.

Must purge cache

One final sticking point: The images still would not appear on the wiki without a 'purge' of the page cache for each one. Not wanting to purge the entire wiki cache (for performance reasons); nor wanting to click ~2000 URLs, I used the purgePage.php maintenance script

WIKI=en php /opt/htdocs/wiki/mediawiki/maintenance/purgePage.php < /opt/htdocs/wiki/mediawiki/maintenance/imagelist.txt

note that the purgeList.php script is for a slightly different use case.

Event Timeline

freephile updated the task description. (Show Details)
Reedy renamed this task from Actor Migration breaks images in 34.x to Actor Migration breaks images in 1.34.x.Oct 28 2020, 3:22 PM

Just a note on purgeList.php: It only purges squid/varnish, not parser cache (as noted on the manual page). Use purgePage.php for that.

About your wiki farm setup, are you using shared user tables? What tables are you sharing?

No shared tables. Just shared file repo.

The output seems to indicate that it was requesting the ?action=history page instead of ?action=purge.

It's not requesting those pages, those are the list of URLs it's sending to the cache to be purged

@Reedy @Ciencia_Al_Poder Thanks for the heads up on purgePage v. purgeList (If there's a 50/50 chance of doing the wrong thing, I do the wrong thing 100% of the time!)

Just a little more info on what a 'broken image' looks like in the database.

-- A broken image
select img_name, img_actor from image where img_name = 'Flag_of_Norway.png'; -- img_actor = 0
select page_id from page where page_title = 'Flag_of_Norway.png'; -- the page id is 15395
select rev_user, rev_user_text from revision where rev_page = 15395; -- There are 4 entries in revisions, while page info says the second one 20027 Claudiaj64 is the Creator
select rev_user, rev_user_text from revision where rev_page = 15395 having min(rev_id); -- the original creator was user 145 Jtrev
select * from user where user_id in (145, 20027); -- Between Jtrev and Claudia, Claudia is the only one in the user table
select * from user where user_name = 'Jtrev'; -- Jtrev is just plain missing. Not found by name or id
select * from actor where actor_user = 20027; -- Claudia has actor id of 29 on beta, 30 on prod
update image set img_actor = 30 where img_name = 'Flag_of_Norway.png'; -- one row affected, but img still not visible (due to caching)

@freephile Have you got SQL dumps from before the migration?

@tstarling I believe I do -- from 9/11/2020, but the main DB backup is 12GB compressed. Due to space constraints, I only have these on a 'snapshot' -- not a live system. If you give me a public SSH key, I could spin up an environment temporarily (cost us $20/day) from one of these snapshots, and give you direct SSH access. I do believe that I already deleted some records from the logging table (where log_user_text = '') prior to snapshot. I don't have earlier backups.

To be clear: this would be a VPS at Digital Ocean running the entire stack using MW 1.32 64 GB RAM / 8 CPUs
1.17 TB SSD Disk
7 TB transfer

How big is the MySQL data directory?

Never mind, I need to read T229092 a bit more closely before I'm ready for that. I'm just looking for any test case, the smaller the better, prior to running cleanupUsersWithNoId.php, in order to confirm that the corruption of log_user_text was pre-existing and if so, whether there is a hint as the to cause.