Page MenuHomePhabricator

Avoid capacity issues from the image table holding the text of pdf/djvu files as part of their metadata
Open, Needs TriagePublic

Description

Image table is getting really big (T222224#6738823) and it turned out ~90% of the space taken by image table is text tag in metadata of pdf/djvu files (T28741#6750249). This needs addressing as it's causing infrastructure issues.

One way to do it can be to compress it, the other can be to push it to another storage like another table. the third way would be to delete it and read it directly from the file if needed. See T28741#6750249 onwards for more discussion.

Event Timeline

Jdforrester-WMF renamed this task from image table holds text of pdf/djvu files as part of metadata to Avoid capacity issues from the image table holding the text of pdf/djvu files as part of their metadata.Feb 20 2021, 5:52 PM
Jdforrester-WMF added a subscriber: Jdforrester-WMF.

Just to comment that compressing the table would give some benefit, but in the long run, it won't give us much.
I think the whole approach of the image table needs to be refactored, as having a single table isn't something we can support for much longer for the reasons already mentioned on the tickets you mentioned.

Yes agreed. To quote what I said there:

My ideal solution would be to first start compressing the values (or any sort of quick/easy fix) then do the alter tables like adding PK, renaming the table, etc. and once that's done find a long-term solution for pdf/djvu and at last the MCR work.

(I don't think we should compress the table altogether, we should just compress the text values. Similar to ES tables)

Just to comment that compressing the table would give some benefit, but in the long run, it won't give us much.
I think the whole approach of the image table needs to be refactored, as having a single table isn't something we can support for much longer for the reasons already mentioned on the tickets you mentioned.

I'd like to better understand in what way a single table is problematic. We have much taller tables, right?
I mean, the image table could be made much more narrow by normalizing (the low hanging fruit being pdf/djvu metadata). Would that solve the problem? If not, why not? What else would be needed?

I agree that the image table as it is is far from ideal. But I'd like to better understand what we are optimizing for, what problem we are solving, before we dive into the refactor.

Since this ticket it about pdf/djvu metadata, which is uncontroversial and straight forward, this discussion should probably live somewhere else, perhaps on T28741.

Just to comment that compressing the table would give some benefit, but in the long run, it won't give us much.
I think the whole approach of the image table needs to be refactored, as having a single table isn't something we can support for much longer for the reasons already mentioned on the tickets you mentioned.

I'd like to better understand in what way a single table is problematic. We have much taller tables, right?

Apart from what Jaime commented regarding the backups time, having single massive tables is problematic for a number of reasons:

  • Operating with huge tables are almost (or simply) impossible for many reasons, especially with alter statements cause you normally need twice the disk space to alter them. We saw that with the revision table before MCR, it was simply impossible to alter it.
  • With big massive tables, the optimizer can go unpredictable and choose weird (and costly) query plans all of a sudden, we have multiple examples at: https://phabricator.wikimedia.org/project/view/4313/
    • Unexpected full scans can take ages
  • They cannot fit in memory, and hence they might need to touch disk more often. The more things we can fit in memory (innodb_buffer_pool_size) the better for performance.
  • If we were to shard horizontally (ie: multiple tables based on user_id MOD100), that cannot be done with a single table.
  • Previously stated comments about generating backups and recovery times.

Let's discuss further steps on T28741, and not hijack this task :)

Let's discuss further steps on T28741, and not hijack this task :)

I commented there, see T28741#6848127.

This is apparently a duplicate of T32906.

Has anyone got any thoughts on the internal API? Ideally PdfHandler should not have to know about BlobStore. People who write MediaHandler subclasses should be media experts, not DB experts.

Option 1

One solution would be to make FileRepo be aware of the structure of metadata arrays and to use a size threshold to put individual metadata items into ES as necessary. Currently MediaHandler::getMetadata() is supposed to return a PHP-serialized string, which is an odd convention. Originally it was an opaque string, but PHP serialization became the recommended format starting from about 2013. We could deprecate MediaHandler::getMetadata(), and instead have MediaHandler::getMetadataArray() which would return an array of JSON-serializable values from a file on disk. Then FileRepo would decide where to put each of the values and how to serialize them.

File::getMetadata() would be deprecated in favour of File::getMetadataItem(), which would retrieve and unserialize a specified metadata item. For performance, there could also be a batch retrieval function.

Considering that the use case is OCR digitisation of entire books, and that ProofreadPage only wants the text for a given page, and that after T271493 CirrusSearch only wants a 50KB extract, it might be reasonable to have one item per page rather than putting the whole text layer in a single item. That way, ProofreadPage only needs to load the text it really wants, rather than loading the whole book and throwing away the other pages.

Option 2

A narrower solution is to add MediaHandler::getTextLayerFromPath(). The text layer would be a separate key in the array returned by MWFileProps, and LocalFile::recordUpload3() would be responsible for deciding whether to write it to ES. LocalFile::maybeUpgradeRow() could split up existing metadata. The many calls to File::getMetadata() in the MediaHandler subclasses would remain as is, except for the ones from getPageText() which would be replaced by a call to File::getTextLayer().

One difficulty with this is confusion between the MediaHandler methods that retrieve information from a file on disk (getMetadata() and getImageSize()) and those that retrieve information from a supplied File object.

Option 3

The narrowest of all solutions is to just have PdfHandler write to BlobStore directly, and to read from BlobStore in PdfHandler::getPageText(). The page text in the metadata array would be replaced by a blob reference. But note that we don't actually know when MediaHandler::getMetadata() is called whether the upload will be successful. We would be taking unverified temporary files uploaded by users, and putting their text layers into permanent archival storage, with no way to track down unused text short of scanning the whole of ES for orphans. And as I said previously, it lacks separation of concerns.

Regarding ApiQueryImageInfo and ForeignAPIRepo:

  • ApiQueryImageInfo is already depending on the metadata being JSON-serializable, since it unserializes the return value of File::getMetadata() and then adds it to the structured output.
  • In option 1, there is a backwards-compatible concept of a metadata array, so the API would work as before. Some metadata formats would change.
  • In option 2, the text layer would become a separate iiprop, and clients would have to opt in to receive it. If desired, ForeignAPIFile could retrieve the text layer from the remote API on demand in a separate request.
  • In option 3, the concept of metadata serialization and transfer is broken. Hard to see how this would work.

Change 687522 had a related patch set uploaded (by Ladsgroup; author: Ladsgroup):

[mediawiki/core@master] filerepo: Store and retrieve file metadata from blob store if it's large

https://gerrit.wikimedia.org/r/687522

I really like the first option and made the patch that does the first step. I don't think this ticket is a duplicate of T32906 but I think it's part of it. Basically, the patch I just made (and it's the first part of option one) addresses this ticket but not the ticket you mentioned, we can slowly move towards that as part of the bigger picture.

Funnily enough, the patch is extremely small and I think we can merge and start cleaning up commons soon. I hope.

Change 687544 had a related patch set uploaded (by Ladsgroup; author: Ladsgroup):

[mediawiki/core@master] Add maintenance/rebuildFileMetadata.php

https://gerrit.wikimedia.org/r/687544

Just to note, treating the OCR text layer as metadata is conceptually a bit awkward: it is a separate representation of the file, and that happens to be automatically generated. It's more akin to a MIME email message that contains both a text/html representation and a text/plain representation. Still speaking conceptually, metadata about the text layer would be stuff like "Does this file have a text layer?" and "What format/text encoding is the text layer using?" and "Is the text layer for this file in a structured format?" and "What is the size in bytes of the text layer for this file?".

As a contributor on Wikisource (that spends a lot of time mucking around inside DjVu files and their text layers) I have always thought of the text layers as a separate slot of the same page, and as an obvious candidate for something like Multi-Content Revisions for storage.

On Wikisource, the text layer embedded in the DjVu/PDF file is a starting point to save time transcribing from scratch; but once used it is replaced with a wikitext representation of that page's text. Before being used it is sometimes useful to update the text layer (because the original was broken, or because it is old and we have better OCR engines now, or...), which is currently done by regenerating the file manually and reuploading, or sometimes by generating OCR on the fly and discarding the text from the file. After the wikipage has been created, there are various use cases for either writing back the corrected text or generating new representations from it (for example, one could imagine writing back wikitext to that slot so the work Wikisource does can be reused by other projects by way of the Commons integration; or exposed to external consumers like web search engines, with full structured metadata through SDC).

And going that route would also make the solution easier to generalise such that JPEG/PNG/TIFF/etc. format images of text (of which we have a boatload) could have a text layer, created after upload either ad hoc or in a batch. This would also lay important groundwork (necessary but not sufficient) for being able to use scan images directly as the source data for Wikisource, instead of having go by way of a (usually generated from the scan images, with generational loss and deliberate cropping + compression) PDF or DjVu container.

There is opportunity here to take a nice solid step forward while solving the pressing "img_metadata is melting the DB" issue, is what I'm saying.

What you're saying is basically T96384#5234876 and should be done, but that's much further down the road. For example, we can't do ALTER TABLE on image table anymore meaning more work is blocked on that table until it reaches a good size first. We can revisit once we are there.

Change 687522 abandoned by Ladsgroup:

[mediawiki/core@master] filerepo: Store and retrieve file metadata from blob store if it's large

Reason:

https://gerrit.wikimedia.org/r/687522

Change 687544 abandoned by Ladsgroup:

[mediawiki/core@master] Add maintenance/rebuildFileMetadata.php

Reason:

https://gerrit.wikimedia.org/r/687544

Change 693298 had a related patch set uploaded (by Tim Starling; author: Tim Starling):

[mediawiki/core@master] [WIP] Use the unserialized form of image metadata internally

https://gerrit.wikimedia.org/r/693298

Change 697935 had a related patch set uploaded (by Tim Starling; author: Tim Starling):

[mediawiki/core@master] [WIP] Optionally split out parts of file metadata to BlobStore

https://gerrit.wikimedia.org/r/697935

Change 693298 merged by jenkins-bot:

[mediawiki/core@master] Use the unserialized form of image metadata internally

https://gerrit.wikimedia.org/r/693298