img_metadata queries for Djvu files regularly saturate s4 slaves
Closed, ResolvedPublic
Actions

Assigned To

Authored By

	faidon
	Apr 17 2015, 11:43 AM

Description

https://wikitech.wikimedia.org/wiki/Incidents/20150423-Commons

For the past few days, we've been seeing s4 slaves saturate their network port (1GbE) occasionally, in spikes that effectively quadruple their normal network usage.

As far as I can see, these can be attributed to img_metadata queries about Djvu files, all coming from search engine IPs (i.e. Googlebot/MSNbot crawling some URL).

Example from a db1070 processlist a moment ago:

SELECT /* ForeignDBFile::loadExtraFromDB 207.46.13.84 */  img_metadata  FROM `image`   WHERE img_name = 'Dictionary_of_National_Biography_volume_27.djvu' AND img_timestamp = '20121213235036'  LIMIT 1 
SELECT /* ForeignDBFile::loadExtraFromDB 66.249.67.12 */  img_metadata  FROM `image`   WHERE img_name = 'Popular_Science_Monthly_Volume_68.djvu' AND img_timestamp = '20090517032649'  LIMIT 1  
SELECT /* ForeignDBFile::loadExtraFromDB 66.249.67.20 */  img_metadata  FROM `image`   WHERE img_name = 'United_States_Statutes_at_Large_Volume_116_Part_1.djvu' AND img_timestamp = '20110920010238'  LIMIT 1  
SELECT /* ForeignDBFile::loadExtraFromDB 66.249.67.20 */  img_metadata  FROM `image`   WHERE img_name = 'Title_3_CFR_2000_Compilation.djvu' AND img_timestamp = '20111001220528'  LIMIT 1       
SELECT /* ForeignDBFile::loadExtraFromDB 66.249.67.4 */  img_metadata  FROM `image`   WHERE img_name = 'United_States_Statutes_at_Large_Volume_115_Part_3.djvu' AND img_timestamp = '20110701153753'  LIMIT 1   
SELECT /* ForeignDBFile::loadExtraFromDB 66.249.67.20 */  img_metadata  FROM `image`   WHERE img_name = 'The_Methodist_Hymn-Book_Illustrated.djvu' AND img_timestamp = '20081102105445'  LIMIT 1        
SELECT /* ForeignDBFile::loadExtraFromDB 66.249.67.20 */  img_metadata  FROM `image`   WHERE img_name = 'The_Lusiad_(Camões,_tr._Mickle,_1791),_Volume_1.djvu' AND img_timestamp = '20100612172724'  LIMIT 1    
SELECT /* ForeignDBFile::loadExtraFromDB 66.249.67.20 */  img_metadata  FROM `image`   WHERE img_name = 'Riders_of_the_Purple_Sage.djvu' AND img_timestamp = '20130615001915'  LIMIT 1

The output of e.g. the first query is a 2.5MB <mw-djvu> XML; similar for the rest, which explains the network saturation.

Details

Subject	Repo	Branch	Lines +/-
Added common metadata caching to the djvu handler	mediawiki/core	master	+42 -16
Temporary hack to drain excess refreshLinks jobs	mediawiki/core	master	+8 -0
Added common metadata caching to the djvu handler	mediawiki/core	wmf/1.26wmf5	+42 -16
Set isExpensiveToThumbnail for the djvu handler	mediawiki/core	wmf/1.26wmf4	+11 -0
Set isExpensiveToThumbnail for the djvu handler	mediawiki/core	wmf/1.26wmf5	+11 -0
Set isExpensiveToThumbnail for the djvu handler	mediawiki/core	master	+11 -0
Removed DjVu pageCount() check and let it surface in the shell command	mediawiki/core	master	+0 -8
Removed DjVu pageCount() check and let it surface in the shell command	mediawiki/core	wmf/1.26wmf3	+0 -8
Removed DjVu pageCount() check and let it surface in the shell command	mediawiki/core	wmf/1.26wmf4	+0 -8

Customize query in gerrit

Related Objects

Mentioned In: T147296: img_metadata queries for PDF files saturates s4 replicas
T107664: Some DjVu files not being rendered in commons (showing up as "0 × 0 pixels", despite the file size in MB being nonzero)
T32906: Store DjVu, PDF extracted text in a structured table instead of img_metadata
rMW9afa7531ed23: Added common metadata caching to the djvu handler
rMW746c8ddc1d52: Added common metadata caching to the djvu handler
rMW11c79de809d0: Set isExpensiveToThumbnail for the djvu handler
rMW7f1fdec68884: Set isExpensiveToThumbnail for the djvu handler
rMW63df71ea72a1: Set isExpensiveToThumbnail for the djvu handler
rMWcaa2efc0e76c: Removed DjVu pageCount() check and let it surface in the shell command
rMW6cdb23c5d662: Removed DjVu pageCount() check and let it surface in the shell command
rMW54816e2071fc: Removed DjVu pageCount() check and let it surface in the shell command
rMWb980b0a9457b: Don't load XML metadata on DjVu thumb generation
rMW40d702b8d2d0: Don't load XML metadata on DjVu thumb generation
rMW71107e750250: Don't load XML metadata on DjVu thumb generation
T75960: Limit the number of media files used on a wiki page
Mentioned Here: rMW54816e2071fc: Removed DjVu pageCount() check and let it surface in the shell command

Event Timeline

faidon created this task.Apr 17 2015, 11:43 AM

faidon raised the priority of this task from to Unbreak Now!.

faidon updated the task description. (Show Details)

faidon added projects: MediaWiki-General, acl*sre-team.

faidon added subscribers: faidon, • Springle.

Restricted Application added a subscriber: Aklapper. · View Herald TranscriptApr 17 2015, 11:43 AM

GOIII added a project: MediaWiki-DjVu.Apr 21 2015, 12:15 AM

GOIII set Security to None.

GOIII subscribed.

these can be attributed to img_metadata queries about Djvu files, all coming from search engine IPs (i.e. Googlebot/MSNbot crawling some URL).

CC'ing Sustainability and Multimedia - anybody have any more insights?

This seems to be caused by this call chain:

DjVu::doTransform calls LocalFile::getMetaData on a ForeignDBFile, which inherits that from LocalFile
In LocalFile, getMetaData calls load with LOAD_ALL flag, which in turns loads huge XML blobs for DJVU files

It is not clear to me why the xml is loaded at all:

		// Fetch XML and check it, to give a more informative error message than the one which
		// normaliseParams will inevitably give.
		$xml = $image->getMetadata();
		if ( !$xml ) {
			$width = isset( $params['width'] ) ? $params['width'] : 0;
			$height = isset( $params['height'] ) ? $params['height'] : 0;
			return new MediaTransformError( 'thumbnail_error', $width, $height,
				wfMessage( 'djvu_no_xml' )->text() );
		}

$xml is not referenced any further in doTransform.

• GWicke mentioned this in T75960: Limit the number of media files used on a wiki page.Apr 25 2015, 7:49 PM

Patch to remove the XML loading at https://gerrit.wikimedia.org/r/#/c/206526/.

• GWicke mentioned this in rMW71107e750250: Don't load XML metadata on DjVu thumb generation.Apr 25 2015, 8:21 PM

• GWicke added a project: Incident-20150423-Commons.Apr 25 2015, 8:31 PM

aaron mentioned this in rMW40d702b8d2d0: Don't load XML metadata on DjVu thumb generation.Apr 27 2015, 5:15 PM

aaron mentioned this in rMWb980b0a9457b: Don't load XML metadata on DjVu thumb generation.

What is the story re: XMLs and DjVuImage.php ?

I thought Aaron merged this and backported it but this is happening again right now. Example query:

SELECT /* ForeignDBFile::loadExtraFromDB 66.249.79.94 */  img_metadata  FROM `image`   WHERE img_name = 'Revue_des_Deux_Mondes_-_1874_-_tome_3.djvu' AND img_timestamp = '20100802045839'  LIMIT 1

Triage-assigning this to @GWicke so it has an owner, since he's submitted the only patch for this so far. Will CC Aaron as well...

BBlack added a subscriber: aaron.Apr 29 2015, 2:51 PM

This has been happening over and over again for all this time (happened just now again). Every time it does, the s4 slaves get saturated and are slower to respond to other queries, hence the UBN severity.

Can we get some attention to this bug please? Maybe @aaron or @Gilles?

It probably just hits <<if ( $page > $this->pageCount( $image ) )>> and loads the metadata again with another code path.

img_metadata is *supposed* to be lighweight and easily loaded. The fact the Pdf/DjVu handlers don't do this makes things pretty awkward. They all need a ground-up rewrite to avoid this (and other problems).

Change 208304 had a related patch set uploaded (by Aaron Schulz):
Removed DjVu pageCount() check and let it surface in the shell command

https://gerrit.wikimedia.org/r/208304

Change 208306 had a related patch set uploaded (by Aaron Schulz):
Removed DjVu pageCount() check and let it surface in the shell command

https://gerrit.wikimedia.org/r/208306

Change 208307 had a related patch set uploaded (by Aaron Schulz):
Removed DjVu pageCount() check and let it surface in the shell command

https://gerrit.wikimedia.org/r/208307

Change 208304 merged by jenkins-bot:
Removed DjVu pageCount() check and let it surface in the shell command

https://gerrit.wikimedia.org/r/208304

aaron mentioned this in rMW54816e2071fc: Removed DjVu pageCount() check and let it surface in the shell command.May 1 2015, 11:50 PM

Change 208306 merged by jenkins-bot:
Removed DjVu pageCount() check and let it surface in the shell command

https://gerrit.wikimedia.org/r/208306

Change 208307 merged by jenkins-bot:
Removed DjVu pageCount() check and let it surface in the shell command

https://gerrit.wikimedia.org/r/208307

aaron mentioned this in rMW6cdb23c5d662: Removed DjVu pageCount() check and let it surface in the shell command.May 1 2015, 11:51 PM

aaron mentioned this in rMWcaa2efc0e76c: Removed DjVu pageCount() check and let it surface in the shell command.

In T96360#1253818, @aaron wrote:

It probably just hits <<if ( $page > $this->pageCount( $image ) )>> and loads the metadata again with another code path.

img_metadata is *supposed* to be lighweight and easily loaded. The fact the Pdf/DjVu handlers don't do this makes things pretty awkward. They all need a ground-up rewrite to avoid this (and other problems).

Thanks! Is there a separate bug for tracking this?

555 subscribed.May 2 2015, 7:22 PM

In T96360#1254487, @faidon wrote:

In T96360#1253818, @aaron wrote:

It probably just hits <<if ( $page > $this->pageCount( $image ) )>> and loads the metadata again with another code path.

img_metadata is *supposed* to be lighweight and easily loaded. The fact the Pdf/DjVu handlers don't do this makes things pretty awkward. They all need a ground-up rewrite to avoid this (and other problems).

Thanks! Is there a separate bug for tracking this?

I thought there was, but I didn't find it (just something about upgradeRow). It should be created.

Change 209983 had a related patch set uploaded (by Aaron Schulz):
Added common metadata caching to the djvu handler

https://gerrit.wikimedia.org/r/209983

Change 209984 had a related patch set uploaded (by Aaron Schulz):
Set isExpensiveToThumbnail for the djvu handler

https://gerrit.wikimedia.org/r/209984

Change 209985 had a related patch set uploaded (by Aaron Schulz):
Set isExpensiveToThumbnail for the djvu handler

https://gerrit.wikimedia.org/r/209985

Change 209986 had a related patch set uploaded (by Aaron Schulz):
Set isExpensiveToThumbnail for the djvu handler

https://gerrit.wikimedia.org/r/209986

Change 209984 merged by jenkins-bot:
Set isExpensiveToThumbnail for the djvu handler

https://gerrit.wikimedia.org/r/209984

aaron mentioned this in rMW63df71ea72a1: Set isExpensiveToThumbnail for the djvu handler.May 10 2015, 8:06 PM

Change 209985 merged by jenkins-bot:
Set isExpensiveToThumbnail for the djvu handler

https://gerrit.wikimedia.org/r/209985

aaron mentioned this in rMW7f1fdec68884: Set isExpensiveToThumbnail for the djvu handler.May 11 2015, 3:50 PM

Change 209986 merged by jenkins-bot:
Set isExpensiveToThumbnail for the djvu handler

https://gerrit.wikimedia.org/r/209986

aaron mentioned this in rMW11c79de809d0: Set isExpensiveToThumbnail for the djvu handler.May 11 2015, 3:52 PM

In T96360#1236196, @GWicke wrote:

It is not clear to me why the xml is loaded at all:

You quoted the comment that explains why:

		// Fetch XML and check it, to give a more informative error message than the one which
		// normaliseParams will inevitably give.

$image->getMetadata() returning falsey is going to cause the $this->normaliseParams() call just after the quoted block to fail (root cause likely being that $this->pageCount(), $this->getPageDimensions(), and $this->getPageText() will return false instead of something expected), and it was thought that the error message given by that failure case was too uninformative.

Whether it's really too uninformative I have no idea.

Change 209983 merged by jenkins-bot:
Added common metadata caching to the djvu handler

https://gerrit.wikimedia.org/r/209983

aaron mentioned this in rMW746c8ddc1d52: Added common metadata caching to the djvu handler.May 12 2015, 5:59 AM

In T96360#1276379, @Anomie wrote:
In T96360#1236196, @GWicke wrote:

It is not clear to me why the xml is loaded at all:

You quoted the comment that explains why:
		// Fetch XML and check it, to give a more informative error message than the one which
		// normaliseParams will inevitably give.
$image->getMetadata() returning falsey is going to cause the $this->normaliseParams() call just after the quoted block to fail (root cause likely being that $this->pageCount(), $this->getPageDimensions(), and $this->getPageText() will return false instead of something expected), and it was thought that the error message given by that failure case was too uninformative.

Whether it's really too uninformative I have no idea.

It might make sense to revert those 1 of those patches, since it still hits getMetaTree anyway. We have poolcounter (merged) and hopefully caching based patches (pending) to address this.

jftr, S4 had ~2 hours of this again today. Mostly googlebot IPs.

/srv/mediawiki/w/index.php;MediaWiki::run;MediaWiki::main;MediaWiki::performRequest;MediaWiki::performAction;ViewAction::show;CategoryPage::view;CategoryPage::closeShowCategory;CategoryViewer::getHTML;CategoryViewer::getImageSection;TraditionalImageGallery::toHTML;File::transform;File::canRender;ImageHandler::canRender;LocalFile::getWidth;DjVuHandler::getPageDimensions;DjVuHandler::getMetaTree;SimpleXMLElement::__construct 1

It might make sense to revert those 1 of those patches, since it still hits getMetaTree anyway. We have poolcounter (merged) and hopefully caching based patches (pending) to address this.

I don't think that's true for cache hits?

I was looking at 54816e2071fc3e38ad581a264967bd46cbb3647e. And the problem sean mentioned is due to the code for just getting a thumbnail URL (and doing basic checks) and not rendering it yet *still* wants to load getMetaTree, such as the category view stack above. Poolcounter does not cover this case, so something more will be needed there.

Change 210409 had a related patch set uploaded (by Aaron Schulz):
Added common metadata caching to the djvu handler

https://gerrit.wikimedia.org/r/210409

Nemo_bis subscribed.May 12 2015, 7:20 PM

Ricordisamoa subscribed.May 12 2015, 7:25 PM

Change 210608 had a related patch set uploaded (by Aaron Schulz):
Temporary hack to drain excess refreshLinks jobs

https://gerrit.wikimedia.org/r/210608

Change 210610 had a related patch set uploaded (by Aaron Schulz):
Temporary hack to drain excess refreshLinks jobs

https://gerrit.wikimedia.org/r/210610

Change 210611 had a related patch set uploaded (by Aaron Schulz):
Temporary hack to drain excess refreshLinks jobs

https://gerrit.wikimedia.org/r/210611

Ignore the last 3 gerrit comments.

Change 210409 merged by jenkins-bot:
Added common metadata caching to the djvu handler

https://gerrit.wikimedia.org/r/210409

aaron mentioned this in rMW9afa7531ed23: Added common metadata caching to the djvu handler.May 12 2015, 11:22 PM

Base i/o is down 33% and I have yet to see any new spikes at http://ganglia.wikimedia.org/latest/?r=week&cs=&ce=&m=cpu_report&c=MySQL+eqiad&h=db1064.eqiad.wmnet&tab=m&vn=&hide-hf=false&mc=2&z=medium&metric_group=ALLGROUPS

aaron mentioned this in T32906: Store DjVu, PDF extracted text in a structured table instead of img_metadata.May 15 2015, 5:42 PM

Change 210608 abandoned by Aaron Schulz:
Temporary hack to drain excess refreshLinks jobs

https://gerrit.wikimedia.org/r/210608

waldyrious mentioned this in T107664: Some DjVu files not being rendered in commons (showing up as "0 × 0 pixels", despite the file size in MB being nonzero).Aug 3 2015, 12:42 PM

Volans mentioned this in T147296: img_metadata queries for PDF files saturates s4 replicas.Oct 4 2016, 10:12 AM

Aklapper removed a subscriber: Anomie.Oct 16 2020, 5:38 PM

Krinkle updated the task description. (Show Details)Sep 10 2022, 12:01 AM

img_metadata queries for Djvu files regularly saturate s4 slavesClosed, ResolvedPublicActions

Description

Details

Related Objects

Event Timeline

img_metadata queries for Djvu files regularly saturate s4 slaves
Closed, ResolvedPublic
Actions