Page MenuHomePhabricator

Undeleted djvu files show incorrect metadata: 0x0 size, no page number info
Open, Needs TriagePublicBUG REPORT

Description

After undeletion of djvu files deleted in 2015 they become useless: 0x0 size, no page number information/
eg.
https://commons.wikimedia.org/wiki/File:Skibinski_pamietnik0001.djvu
https://commons.wikimedia.org/wiki/File:Le_Goffic_-_Le_Crucifi%C3%A9_de_Kerali%C3%A8s.djvu

Event Timeline

Same problem with unhiding a file version that was hidden in August 2021:
https://commons.wikimedia.org/wiki/File:PL_Miriam_-_U_poet%C3%B3w.djvu

Anoop changed the subtype of this task from "Task" to "Bug Report".Sat, Jan 1, 2:13 AM
Anoop added a project: MediaWiki-DjVu.

Same problem with unhiding a file version that was hidden in August 2021:
https://commons.wikimedia.org/wiki/File:PL_Miriam_-_U_poet%C3%B3w.djvu

This is affected by the missing split of big metadata for the filehistory (oldimage table).

The original task description is affected by the missing split of big metadata for the file archive (filearchive table)

The work was done as part of T275268, but the refreshImageMetadata.php only running for the current file versions

Same problem with unhiding a file version that was hidden in August 2021:
https://commons.wikimedia.org/wiki/File:PL_Miriam_-_U_poet%C3%B3w.djvu

This is affected by the missing split of big metadata for the filehistory (oldimage table).

The original task description is affected by the missing split of big metadata for the file archive (filearchive table)

The work was done as part of T275268, but the refreshImageMetadata.php only running for the current file versions

This is a big problem for all Wikisources as we need to restore a bunch of deleted files or revisions on the Public Domain Day each year, and missing metadata (mainly the numbed of pages) would break hundreds ot thousands existing pages that rely on it.
How can we ensure that an undeleted file will have correct metadata? It seems we need to delay the Public Domain Day until this is fixed.

An attempt to reupload (even the failed one) restores the metadata in Commons:
https://commons.wikimedia.org/wiki/File:Skibinski_pamietnik0001.djvu
But this does not restore the metadata in Wikisource:
https://pl.wikisource.org/wiki/Plik:Skibinski_pamietnik0001.djvu
so cannot be used as an ugly workaround

An attempt to reupload (even the failed one) restores the metadata in Commons:
https://commons.wikimedia.org/wiki/File:Skibinski_pamietnik0001.djvu
But this does not restore the metadata in Wikisource:
https://pl.wikisource.org/wiki/Plik:Skibinski_pamietnik0001.djvu
so cannot be used as an ugly workaround

A upload stores the metadata in the new way and it works after that, but that sound not like a good idea.

The main task was T192866, I add some reviewer from there to this task.

The api output of the metadata still contains xml key, but https://gerrit.wikimedia.org/r/c/mediawiki/core/+/740320/ removes the interpretation of that key

https://commons.wikimedia.org/w/api.php?action=query&prop=imageinfo&iiprop=metadata&titles=File:Skibinski_pamietnik0001.djvu

"pages": {
    "42588070": {
        "pageid": 42588070,
        "ns": 6,
        "title": "File:Skibinski pamietnik0001.djvu",
        "imagerepository": "local",
        "imageinfo": [
            {
                "metadata": [
                    {
                        "name": "xml",
                        "value": "<?xml version=\"1.0\" ?>\n<!DOCTYPE DjVuXML PUBLIC \"-//W3C//DTD DjVuXML 1.1//EN\" \"pubtext/DjVuXML-s.dtd\">\n<mw-djvu><DjVuXML>\n<HEAD></HEAD>\n<BODY><OBJECT

I will look into this. I suspected this would become an issue but I found only 100 deleted djvu files in the db back then. Maybe I misread something. Anyway. I will take a look.

So running the refresh metadata script on them fixes the issue (Skibinski_pamietnik0001.djvu is fixed now both on commons and plwikisource). I don't know if it's possible to do this on all of deleted files. I will take a look.

… I found only 100 deleted djvu files in the db …

I don't have actual data, but my gut feeling is that this number must be off by at least an order of magnitude. Between files that are deleted for the usual reasons (mostly copyright), there must be several thousand files that have been transwikied to Commons from a Wikisource; and all of these are candidates for being restored for various reasons. A prime example being that Commons' copyright policy requires works to be in the public domain in both its country of origin and in the US, while the Wikisourcen may only consider US status. Undeleting these locally when they get deleted on Commons isn't uncommon. Undeleting files on Commons when the relevant copyright expires can't be all that rare an event either, but I don't follow the relevant fora so I could be way off on that.

wikiadmin@10.64.0.220(commonswiki)> select count(*) from filearchive where fa_media_type = 'OFFICE' limit 5;
+----------+
| count(*) |
+----------+
|    92509 |
+----------+
1 row in set (37.633 sec)

wikiadmin@10.64.0.220(commonswiki)> select count(*) from filearchive where fa_media_type = 'OFFICE' and fa_major_mime = 'image' limit 5;
+----------+
| count(*) |
+----------+
|      736 |
+----------+
1 row in set (3.213 sec)

It doesn't matter. I will write a patch to run this on both filearchive and oldimage table.

Change 751079 had a related patch set uploaded (by Ladsgroup; author: Amir Sarabadani):

[mediawiki/core@master] maintenance: Add support for oldimage table metadata refresh

https://gerrit.wikimedia.org/r/751079

This patch and running it would fix some of the issues but not all. For "filearchive" table fixes I need to find a separate way. In the meantime, any file that has been undeleted and now has issue, just list it here and I run the script on it.

This patch and running it would fix some of the issues but not all. For "filearchive" table fixes I need to find a separate way. In the meantime, any file that has been undeleted and now has issue, just list it here and I run the script on it.

There is still problem with the 2nd version of
https://pl.wikisource.org/wiki/Plik:PL_Miriam_-_U_poet%C3%B3w.djvu
I would like to make it the top version but I am afraid to break things that depend on its metadata.

Don't worry on that, just make it top version and then let me know to run the script (until I make the patch merged and deployed)

Change 751079 merged by jenkins-bot:

[mediawiki/core@master] maintenance: Add support for oldimage table metadata refresh

https://gerrit.wikimedia.org/r/751079

Change 751526 had a related patch set uploaded (by Ladsgroup; author: Amir Sarabadani):

[mediawiki/core@wmf/1.38.0-wmf.16] maintenance: Add support for oldimage table metadata refresh

https://gerrit.wikimedia.org/r/751526

Change 751527 had a related patch set uploaded (by Ladsgroup; author: Amir Sarabadani):

[mediawiki/core@wmf/1.38.0-wmf.13] maintenance: Add support for oldimage table metadata refresh

https://gerrit.wikimedia.org/r/751527

Change 751527 merged by jenkins-bot:

[mediawiki/core@wmf/1.38.0-wmf.13] maintenance: Add support for oldimage table metadata refresh

https://gerrit.wikimedia.org/r/751527

Mentioned in SAL (#wikimedia-operations) [2022-01-05T02:09:08Z] <ladsgroup@deploy1002> Synchronized php-1.38.0-wmf.13/maintenance/refreshImageMetadata.php: Backport: [[gerrit:751527|maintenance: Add support for oldimage table metadata refresh (T298417)]] (duration: 01m 08s)

Change 751526 merged by jenkins-bot:

[mediawiki/core@wmf/1.38.0-wmf.16] maintenance: Add support for oldimage table metadata refresh

https://gerrit.wikimedia.org/r/751526

Mentioned in SAL (#wikimedia-operations) [2022-01-05T02:11:19Z] <ladsgroup@deploy1002> Synchronized php-1.38.0-wmf.16/maintenance/refreshImageMetadata.php: Backport: [[gerrit:751526|maintenance: Add support for oldimage table metadata refresh (T298417)]] (duration: 01m 07s)

Mentioned in SAL (#wikimedia-operations) [2022-01-05T02:13:23Z] <Amir1> running foreachwikiindblist all maintenance/refreshImageMetadata.php --force --verbose --mediatype=OFFICE --oldimage (T298417)

So the oldimage clean up has been done and you should now be able to see non-current metadata. Next problem is for deleted files, It's already in a much smaller scale.

Don't worry on that, just make it top version and then let me know to run the script (until I make the patch merged and deployed)

There is problem with this file, after unhidint/restoring its oldest version:
https://commons.wikimedia.org/wiki/File:PL_Nowele_obce_(antologia).djvu