Page MenuHomePhabricator

Provide a dump of PDF/DjVU metadata
Open, Needs TriagePublic

Description

With T275268 and T192866 metadata for PDF and Djvu files is stored as blobs in External Storage and no longer in the image table. Metadata in the image table just points to the text table, which points to ES content servers.

The issue with this is that now it is not possible anymore to access metadata from the image table SQL dump alone and that text table does not seem to be dumped for Mediawiki Commons. So, I would like to request that content for the PDF/DjVU metadata is dumped. Or maybe there is some other way to provide those blobs.

Event Timeline

Bugreporter subscribed.

First, contents of pages are not stored in the text table; they are stored in External Storage. In addition, The text table is going to be emptied soon; see T183490: MCR schema migration stage 4: Migrate External Store URLs (wmf production).

Note we already have dumps for all Commons page, include one for all (non-deleted, non-hidden) revisions.

(Reopen, will retitle soon)

Bugreporter renamed this task from Provide a dump of text table to be able to access image metadata to Provide a dump of PDF/DjVU metadata.Feb 5 2022, 3:41 PM
Bugreporter updated the task description. (Show Details)

I see. So those dumps would not be SQL but something else?

Thanks for retitling. This sounds good.

This seems a very reasonable request. If data is accessible in batch through full dumps, this will be lower priority- as it will be a feature request, not a bug-, but if document metadata has become completely unavailable now, while it was before, this should get a higher priority (bug), as internal structure should not really alter functionality. In the second case, this may require a separate dump process, like it is done for wikidata statements.

Adding @ArielGlenn and @Ladsgroup to the thread, to triage if this information is not longer part of the full dumps after the database restructuring.

@Mitar I ask you for patience (and really apologize for that), multimedia support is not at its best moment (hopefully that can change soon) so it may take some time for your request to be processed.

The metadata is still accessible through API but yeah, a dump would be nice.

And it was accessible through a dump in the past, before database structure change.

@Mitar can you give a list here of the data no longer available in the image table dump that is not available from an export of the File: page for a specific document? That will help to determine what is missing.

Let's take this example: https://commons.wikimedia.org/wiki/File:17de_Mai_1872.djvu

If I do API access to metadata, I get: https://commons.wikimedia.org/w/api.php?action=query&prop=imageinfo&iiprop=metadata&titles=File:17de_Mai_1872.djvu&format=json

{
    "name": "xml",
    "value": "<?xml version=\"1.0\" ?>\n<!DOCTYPE DjVuXML PUBLIC \"-//W3C//DTD DjVuXML 1.1//EN\" \"pubtext/DjVuXML-s.dtd\">\n<mw-djvu><DjVuXML>\n<HEAD></HEAD>\n<BODY><OBJECT height=\"4850\" width=\"3079\">\n<PARAM name=\"DPI\" value=\"550\" />\n<PARAM name=\"GAMMA\" value=\"2.2\" />\n</OBJECT>\n<OBJECT height=\"4850\" width=\"3079\">\n<PARAM name=\"DPI\" value=\"550\" />\n<PARAM name=\"GAMMA\" value=\"2.2\" />\n</OBJECT>\n<OBJECT height=\"4850\" width=\"3079\">\n<PARAM name=\"DPI\" value=\"550\" />\n<PARAM name=\"GAMMA\" value=\"2.2\" />\n</OBJECT>\n<OBJECT height=\"4850\" width=\"3079\">\n<PARAM name=\"DPI\" value=\"550\" />\n<PARAM name=\"GAMMA\" value=\"2.2\" />\n</OBJECT>\n</BODY>\n</DjVuXML>\n<DjVuTxt>\n<HEAD></HEAD>\n<BODY>\n<PAGE value=\"17de Mai 1872 &#10;\" />\n<PAGE value=\"&quot;^tfa vi elsker dette Landet, &#10;* Som det stiger frem &#10;Furet, veirbidt over Våndet &#10;Med de tusen Hjem, &#10;Elsker, elsker det og tænker &#10;Paa vor Far og Mor &#10;Og den Saganat, som sænker &#10;Drømmer paa vor Jord. &#10;Dette Land har Harald bjerget &#10;Med sin Kjæmperad, &#10;Dette Land har Haakon værget, &#10;Medens Øjvind kvad; &#10;Paa det Land har Olav malet &#10;Korset med sit Blod, &#10;Fra dets Høie Sverre talet &#10;Koma midt imod. &#10;Bønder sine Øxer brynte, &#10;Hvor en Hær drog frem; &#10;Tordenskjold langs Kysten lynte, &#10;Saa den lystes hjem. &#10;Kvinder selv stod op og strede, &#10;Som de vare Mænd; &#10;Andre kunde bare græde, &#10;Men det kom igjen! &#10;\" />\n<PAGE value=\"Haarde Tider har vi døiet, &#10;Blev tilsidst forstødt; &#10;Men i værste Nød blaaøiet &#10;Frihed blev os født. &#10;Det gav Faderkraft at bære &#10;Hungersnød og Krig, &#10;Det gav Døden selv sin ære — &#10;Og det gav Forlig! &#10;Fienden sit Vaaben kasted, &#10;Op Visiret foer, &#10;Vi med Undren mod ham hasted; &#10;Thi han var vor Bror. &#10;Drevne frem paa Stand af Skammen &#10;Gik vi søder paa; &#10;Nu vi staar tre Brødre sammen &#10;Og skal saadan staa! &#10;Norske Mand i Hus og Hytte, &#10;Tak din store Gud! &#10;Landet vilde han beskytte, &#10;Skjønt det mørkt saa ud. &#10;Alt, hvad Fædrene har kjæmpet, &#10;Mødrene har grædt, &#10;Har den Herre stille læmpet, &#10;Saa vi vandt vor Ret! &#10;Ja, vi elsker dette Landet, &#10;Som elet stiger frem &#10;Furet, veirbidt over Våndet &#10;Mecl de tusen Hjem. &#10;Og som Fædres Kamp har hævet &#10;Det af Nød til Seir, &#10;Ogsaa vi, nåar det blir krævet, &#10;For dets Fred slaar Leir! &#10;Bjørnstjerne Bjørnson. &#10;\" />\n<PAGE value=\"4 &#10;I KRISTIANIA. H . B.Larsene Bogtrykkeri. &#10;/ &#10;\" />\n</BODY>\n</DjVuTxt>\n</mw-djvu>"
}

I do not see that in the page export.

Another example: https://commons.wikimedia.org/w/api.php?action=query&prop=imageinfo&iiprop=metadata&titles=File:2017_week_30_Daily_Weather_Map_summary_NOAA.pdf&format=json

Important metadata which I am interested in is page count, which is available in metadata. Another useful metadata often available is extracted plain text from documents.

That was never part of the page export. You can access it through API, but there is no dump. One option would be to make a dump of the full imageinfo API, that would be nice.

Got it. I'll have a look at the live database and see what exactly lets me get to the paarticular external store entry that has the info formerly contained in the image table. Adding other information previously not dumped would be out of the scope of this task.

It might be as simple as dumping content from all slots, but we won't be ready to do that until more parallelization of existing dumps is enabled (code ready but needs much more thorough testing).

So in the metadata column of the image table there is instead now a value like tt:12345 where this 12345 is an ID of the blob in the external storage. Generally every file has two such blobs: for data (for metadata) and for text (extracted text from documents).

A dump could be an tar.bzip archive of files, where each filename would correspond to this number, and contents would be the blob contents.

When I inspect the rows for those two entries in the image table in production, I see the metadata in the image table itself. Hrm Hrm.

Example:

wikiadmin@10.64.16.175(commonswiki)> select * from image where img_name = '2017_week_30_Daily_Weather_Map_summary_NOAA.pdf';

...
| img_name                                        | img_size | img_width | img_height | img_metadata                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                     | img_bits | img_media_type | img_major_mime | img_minor_mime | img_description_id | img_actor | img_timestamp  | img_sha1                        |
...
| 2017_week_30_Daily_Weather_Map_summary_NOAA.pdf | 20031833 |      1275 |       1650 | {"data":{"Title":"DWM.fm","Author":"hpc.forecaster","Creator":"FrameMaker 11.0.2","Producer":"Acrobat Distiller 11.0 (Windows)","CreationDate":"Wed Sep 21 12:50:44 2016 UTC","ModDate":"Wed Aug  2 19:32:51 2017 UTC","Tagged":"no","UserProperties":"no","Suspects":"no","Form":"none","JavaScript":"yes","Pages":"16","Encrypted":"no","pages":{"1":{"Page size":"612 x 792 pts (letter)","Page rot":"0"},"2":{"Page size":"612 x 792 pts (letter)","Page rot":"0"},"3":{"Page size":"612 x 792 pts (letter)","Page rot":"0"},"4":{"Page size":"612 x 792 pts (letter)","Page rot":"0"},"5":{"Page size":"612 x 792 pts (letter)","Page rot":"0"},"6":{"Page size":"612 x 792 pts (letter)","Page rot":"0"},"7":{"Page size":"612 x 792 pts (letter)","Page rot":"0"},"8":{"Page size":"612 x 792 pts (letter)","Page rot":"0"},"9":{"Page size":"612 x 792 pts (letter)","Page rot":"0"},"10":{"Page size":"612 x 792 pts (letter)","Page rot":"0"},"11":{"Page size":"612 x 792 pts (letter)","Page rot":"0"},"12":{"Page size":"612 x 792 pts (letter)","Page rot":"0"},"13":{"Page size":"612 x 792 pts (letter)","Page rot":"0"},"14":{"Page size":"612 x 792 pts (letter)","Page rot":"0"},"15":{"Page size":"612 x 792 pts (letter)","Page rot":"0"},"16":{"Page size":"612 x 792 pts (letter)","Page rot":"0"}},"File size":"20031833 bytes","Optimized":"yes","PDF version":"1.6","mergedMetadata":{"ObjectName":{"x-default":"DWM.fm","_type":"lang"},"Artist":{"0":"hpc.forecaster","_type":"ol"},"DateTimeDigitized":"2016:09:21 12:50:44","Software":"FrameMaker 11.0.2","DateTime":"2017:08:02 11:32:51","DateTimeMetadata":"2017:08:02 11:32:51","pdf-Producer":"Acrobat Distiller 11.0 (Windows)","pdf-Encrypted":"no","pdf-PageSize":["612 x 792 pts (letter)"],"pdf-Version":"1.6"}},"blobs":[]} |        0 | OFFICE         | application    | pdf            |          229762306 |    838823 | 20220207102118 | r9cdycscyscr2fzwvf5ylvbd9o33c2e |

It only goes to ES if it's bigger than some certain limit see patches in T275268

It only goes to ES if it's bigger than some certain limit see patches in T275268

Ah ha! And one more question, is there any sort of MCR-ish way to get to these blobs in external store or is the only reference the one in the image table for those specific rows?

I don't know from top of my head but I can look later. In the mean time, take a look at the code behind API (with iiprop=metadata). That looks it up through file metadata functions that automatically handle es lookup.

See LocalFile::getMetadataArray

It looks like there's about 2.7 million such items on commons, and for each one of those we are talking about a separate db query to get the metadata from the external store.

We should probably figure out a better process for handling schema changes of this sort in the future, so that changes in how we dump things can be part of the work and planned for at the time.

I doubt I can get to this any time soon, unfortunately. And I realize that means that data once dumped is now no longer available, which is a regression as far as dumped data goes. We can't even feasibly tell people to download the original pdfs/djvus and generate the metadata themselves. The one silver lining in this is that at least that metadata is not required for import/setting up a mirror. But for a dumps user it's still a loss.

We should probably figure out a better process for handling schema changes of this sort in the future, so that changes in how we dump things can be part of the work and planned for at the time.

Well, it was announced back then when the work started, I don't know what to do when people don't read wikitech-l announcements https://lists.wikimedia.org/hyperkitty/list/wikitech-l@lists.wikimedia.org/message/CZOLC5IQWEHDH45DJILNQJBMML4VP65A/

I doubt I can get to this any time soon, unfortunately. And I realize that means that data once dumped is now no longer available, which is a regression as far as dumped data goes. We can't even feasibly tell people to download the original pdfs/djvus and generate the metadata themselves. The one silver lining in this is that at least that metadata is not required for import/setting up a mirror. But for a dumps user it's still a loss.

While I understand it's a loss but I need to explain this was absolutely necessary and a ticking bomb. It increased the time to take backup of commons from 3 hours to fifteen hours, basically made any alter table on image table in commons impossible and was a big factor in making primary of commons going read-only for hours (I can dig up the incident) and was mentioned as the biggest database problem back then.

The data is still accessible through API and this is only on text part of most (not all) of PDF and djvu files and not on other metadata information nor other file types.

We should probably figure out a better process for handling schema changes of this sort in the future, so that changes in how we dump things can be part of the work and planned for at the time.

Well, it was announced back then when the work started, I don't know what to do when people don't read wikitech-l announcements https://lists.wikimedia.org/hyperkitty/list/wikitech-l@lists.wikimedia.org/message/CZOLC5IQWEHDH45DJILNQJBMML4VP65A/

Oh sure, and I knew about it and of course supported the change, but it didn't occur to me that we needed to account for dumping the data once it was moved, nor (it seems) to anyone else, or at least no plans were made at the time for it by anyone. That's what Imean, we should make sure we do that in future schema changes.

It looks like there's about 2.7 million such items on commons, and for each one of those we are talking about a separate db query to get the metadata from the external store.

The data is still accessible through API and this is only on text part of most (not all) of PDF and djvu files and not on other metadata information nor other file types.

Could we maybe then limit first iteration of this blob dump to just metadata (data) blobs, not text blobs? There should be much less of those.

It still needs a special maintenance script to do the work, and that's where I have time constraints; I just can't get to it. I'm truly sorry.

No worries. For now I made it work with doing API requests.

It would be great though, if my bot would get apihighlimits flag, so that I can make more API requests in parallel. I have made a bot request on Commons some time ago but then I thought that dumps will be enough so i have purse it further. I think I would like to request apihighlimits again now, as a workaround for this issue.