Page MenuHomePhabricator

img_metadata queries for PDF files saturates s4 slaves
Closed, ResolvedPublic

Description

We are seeing a re-occurrence of T96360, this time for PDF files instead of DjVu.
Most of the queries are requesting the same files, that have ~13MB of img_metadata. Some of the requests comes from search engines bots, but not all of them.
This morning (UTC) for example on db1081 there where ~12k queries that requested those fields and saturated the 1Gb link.

db1081 traffic graph:

Query sample:

SELECT /* LocalFile::loadExtraFromDB xxx.xxx.xxx.xxx */ img_metadata FROM `image` WHERE img_name = 'Catalog_of_Copyright_Entries_1977_Books_and_Pamphlets_Jan-June.pdf' AND img_timestamp = '20160426090826' LIMIT 1

The content of img_metadata is a PHP serialized array.

Event Timeline

Volans created this task.Oct 4 2016, 10:12 AM
Restricted Application added a subscriber: Aklapper. · View Herald TranscriptOct 4 2016, 10:12 AM

Change 314229 had a related patch set uploaded (by Aaron Schulz):
Add page dimension caching and avoid metadata tree loading use in doTransform()

https://gerrit.wikimedia.org/r/314229

Thanks @aaron - once it is pushed I will keep an eye on the graphs to see if this mitigate the spikes

brion added a subscriber: brion.Oct 7 2016, 10:31 PM

I strongly recommend investing in T32906 -- storing the text blobs and such for DjVu and PDF in a structured way instead of in the metadata blob.

The workaround patch should do for now though; I've +2'd it.

Change 314229 merged by jenkins-bot:
Add page dimension caching and avoid metadata tree loading use in doTransform()

https://gerrit.wikimedia.org/r/314229

jcrespo added a subscriber: jcrespo.

As this is being worked at mediawiki level, I am going to move us into mere observers.

I don't recall seeing this issue on s4 since https://gerrit.wikimedia.org/r/314229 landed, still an issue or we can close in favor of T32906 ?

aaron closed this task as Resolved.Dec 2 2016, 12:04 AM
aaron claimed this task.

Makes sense.