Page MenuHomePhabricator

PDFs and DJVU without textcontent pollute metadata with empty entries per page
Closed, ResolvedPublicBUG REPORT

Description

Steps to replicate the issue (include links if applicable):

API response: https://commons.wikimedia.org/wiki/Special:ApiSandbox#action=query&format=json&prop=imageinfo&titles=File%3ABulletin%20de%20la%20société%20des%20bibliophiles%20bretons%20et%20de%20l'histoire%20de%20Bretagne%20(8e%20année)%2C%201885.pdf&formatversion=2&iiprop=timestamp%7Cuser%7Cmetadata%7Cdimensions%7Cmediatype%7Csize

What happens?:

There will be an empty value per page in the metadata table

{
      "name": "text",
      "value": [
          {
              "name": 0,
              "value": ""
          },
          {
              "name": 1,
              "value": ""
          },
          {
              "name": 2,
              "value": ""
          },
          {
              "name": 3,
              "value": ""
          },

What should have happened instead?:
This is very wasteful. We should avoid generating these entries, and only when requesting page text, return empty string if needed.

Software version (on Special:Version page; skip for WMF-hosted wikis like Wikipedia):

Other information (browser name/version, screenshots, etc.):

Event Timeline

Change #1276420 had a related patch set uploaded (by TheDJ; author: TheDJ):

[mediawiki/extensions/PdfHandler@master] Avoid unneeded metadata entries for pages without text

https://gerrit.wikimedia.org/r/1276420

Change #1276422 had a related patch set uploaded (by TheDJ; author: TheDJ):

[mediawiki/core@master] Avoid unneeded metadata entries for pages without text

https://gerrit.wikimedia.org/r/1276422

Change #1276422 merged by jenkins-bot:

[mediawiki/core@master] Avoid unneeded metadata entries for pages without text

https://gerrit.wikimedia.org/r/1276422

Change #1276420 merged by jenkins-bot:

[mediawiki/extensions/PdfHandler@master] Avoid unneeded metadata entries for pages without text

https://gerrit.wikimedia.org/r/1276420

TheDJ claimed this task.