Page MenuHomePhabricator

img_metadata queries for PDF files saturates s4 slaves
Closed, ResolvedPublic

Description

We are seeing a re-occurrence of T96360, this time for PDF files instead of DjVu.
Most of the queries are requesting the same files, that have ~13MB of img_metadata. Some of the requests comes from search engines bots, but not all of them.
This morning (UTC) for example on db1081 there where ~12k queries that requested those fields and saturated the 1Gb link.

db1081 traffic graph:

db1081.png (436×828 px, 163 KB)

Query sample:

SELECT /* LocalFile::loadExtraFromDB xxx.xxx.xxx.xxx */ img_metadata FROM `image` WHERE img_name = 'Catalog_of_Copyright_Entries_1977_Books_and_Pamphlets_Jan-June.pdf' AND img_timestamp = '20160426090826' LIMIT 1

The content of img_metadata is a PHP serialized array.

Event Timeline

Change 314229 had a related patch set uploaded (by Aaron Schulz):
Add page dimension caching and avoid metadata tree loading use in doTransform()

https://gerrit.wikimedia.org/r/314229

Thanks @aaron - once it is pushed I will keep an eye on the graphs to see if this mitigate the spikes

I strongly recommend investing in T32906 -- storing the text blobs and such for DjVu and PDF in a structured way instead of in the metadata blob.

The workaround patch should do for now though; I've +2'd it.

Change 314229 merged by jenkins-bot:
Add page dimension caching and avoid metadata tree loading use in doTransform()

https://gerrit.wikimedia.org/r/314229

jcrespo added a subscriber: jcrespo.

As this is being worked at mediawiki level, I am going to move us into mere observers.

I don't recall seeing this issue on s4 since https://gerrit.wikimedia.org/r/314229 landed, still an issue or we can close in favor of T32906 ?

aaron claimed this task.

Makes sense.