Page MenuHomePhabricator

img_metadata queries for Djvu files regularly saturate s4 slaves
Closed, ResolvedPublic

Description

For the past few days, we've been seeing s4 slaves saturate their network port (1GbE) occasionally, in spikes that effectively quadruple their normal network usage.

db1070.png (373×747 px, 40 KB)

As far as I can see, these can be attributed to img_metadata queries about Djvu files, all coming from search engine IPs (i.e. Googlebot/MSNbot crawling some URL).

Example from a db1070 processlist a moment ago:

SELECT /* ForeignDBFile::loadExtraFromDB 207.46.13.84 */  img_metadata  FROM `image`   WHERE img_name = 'Dictionary_of_National_Biography_volume_27.djvu' AND img_timestamp = '20121213235036'  LIMIT 1 
SELECT /* ForeignDBFile::loadExtraFromDB 66.249.67.12 */  img_metadata  FROM `image`   WHERE img_name = 'Popular_Science_Monthly_Volume_68.djvu' AND img_timestamp = '20090517032649'  LIMIT 1  
SELECT /* ForeignDBFile::loadExtraFromDB 66.249.67.20 */  img_metadata  FROM `image`   WHERE img_name = 'United_States_Statutes_at_Large_Volume_116_Part_1.djvu' AND img_timestamp = '20110920010238'  LIMIT 1  
SELECT /* ForeignDBFile::loadExtraFromDB 66.249.67.20 */  img_metadata  FROM `image`   WHERE img_name = 'Title_3_CFR_2000_Compilation.djvu' AND img_timestamp = '20111001220528'  LIMIT 1       
SELECT /* ForeignDBFile::loadExtraFromDB 66.249.67.4 */  img_metadata  FROM `image`   WHERE img_name = 'United_States_Statutes_at_Large_Volume_115_Part_3.djvu' AND img_timestamp = '20110701153753'  LIMIT 1   
SELECT /* ForeignDBFile::loadExtraFromDB 66.249.67.20 */  img_metadata  FROM `image`   WHERE img_name = 'The_Methodist_Hymn-Book_Illustrated.djvu' AND img_timestamp = '20081102105445'  LIMIT 1        
SELECT /* ForeignDBFile::loadExtraFromDB 66.249.67.20 */  img_metadata  FROM `image`   WHERE img_name = 'The_Lusiad_(Camões,_tr._Mickle,_1791),_Volume_1.djvu' AND img_timestamp = '20100612172724'  LIMIT 1    
SELECT /* ForeignDBFile::loadExtraFromDB 66.249.67.20 */  img_metadata  FROM `image`   WHERE img_name = 'Riders_of_the_Purple_Sage.djvu' AND img_timestamp = '20130615001915'  LIMIT 1

The output of e.g. the first query is a 2.5MB <mw-djvu> XML; similar for the rest, which explains the network saturation.

Event Timeline

faidon raised the priority of this task from to Unbreak Now!.
faidon updated the task description. (Show Details)
faidon added subscribers: faidon, Springle.
GOIII set Security to None.
GOIII subscribed.

these can be attributed to img_metadata queries about Djvu files, all coming from search engine IPs (i.e. Googlebot/MSNbot crawling some URL).

CC'ing Sustainability and Multimedia - anybody have any more insights?

This seems to be caused by this call chain:

It is not clear to me why the xml is loaded at all:

		// Fetch XML and check it, to give a more informative error message than the one which
		// normaliseParams will inevitably give.
		$xml = $image->getMetadata();
		if ( !$xml ) {
			$width = isset( $params['width'] ) ? $params['width'] : 0;
			$height = isset( $params['height'] ) ? $params['height'] : 0;
			return new MediaTransformError( 'thumbnail_error', $width, $height,
				wfMessage( 'djvu_no_xml' )->text() );
		}

$xml is not referenced any further in doTransform.

What is the story re: XMLs and DjVuImage.php ?

I thought Aaron merged this and backported it but this is happening again right now. Example query:

SELECT /* ForeignDBFile::loadExtraFromDB 66.249.79.94 */  img_metadata  FROM `image`   WHERE img_name = 'Revue_des_Deux_Mondes_-_1874_-_tome_3.djvu' AND img_timestamp = '20100802045839'  LIMIT 1
BBlack subscribed.

Triage-assigning this to @GWicke so it has an owner, since he's submitted the only patch for this so far. Will CC Aaron as well...

This has been happening over and over again for all this time (happened just now again). Every time it does, the s4 slaves get saturated and are slower to respond to other queries, hence the UBN severity.

Can we get some attention to this bug please? Maybe @aaron or @Gilles?

It probably just hits <<if ( $page > $this->pageCount( $image ) )>> and loads the metadata again with another code path.

img_metadata is *supposed* to be lighweight and easily loaded. The fact the Pdf/DjVu handlers don't do this makes things pretty awkward. They all need a ground-up rewrite to avoid this (and other problems).

Change 208304 had a related patch set uploaded (by Aaron Schulz):
Removed DjVu pageCount() check and let it surface in the shell command

https://gerrit.wikimedia.org/r/208304

Change 208306 had a related patch set uploaded (by Aaron Schulz):
Removed DjVu pageCount() check and let it surface in the shell command

https://gerrit.wikimedia.org/r/208306

Change 208307 had a related patch set uploaded (by Aaron Schulz):
Removed DjVu pageCount() check and let it surface in the shell command

https://gerrit.wikimedia.org/r/208307

Change 208304 merged by jenkins-bot:
Removed DjVu pageCount() check and let it surface in the shell command

https://gerrit.wikimedia.org/r/208304

Change 208306 merged by jenkins-bot:
Removed DjVu pageCount() check and let it surface in the shell command

https://gerrit.wikimedia.org/r/208306

Change 208307 merged by jenkins-bot:
Removed DjVu pageCount() check and let it surface in the shell command

https://gerrit.wikimedia.org/r/208307

It probably just hits <<if ( $page > $this->pageCount( $image ) )>> and loads the metadata again with another code path.

img_metadata is *supposed* to be lighweight and easily loaded. The fact the Pdf/DjVu handlers don't do this makes things pretty awkward. They all need a ground-up rewrite to avoid this (and other problems).

Thanks! Is there a separate bug for tracking this?

It probably just hits <<if ( $page > $this->pageCount( $image ) )>> and loads the metadata again with another code path.

img_metadata is *supposed* to be lighweight and easily loaded. The fact the Pdf/DjVu handlers don't do this makes things pretty awkward. They all need a ground-up rewrite to avoid this (and other problems).

Thanks! Is there a separate bug for tracking this?

I thought there was, but I didn't find it (just something about upgradeRow). It should be created.

Change 209983 had a related patch set uploaded (by Aaron Schulz):
Added common metadata caching to the djvu handler

https://gerrit.wikimedia.org/r/209983

Change 209984 had a related patch set uploaded (by Aaron Schulz):
Set isExpensiveToThumbnail for the djvu handler

https://gerrit.wikimedia.org/r/209984

Change 209985 had a related patch set uploaded (by Aaron Schulz):
Set isExpensiveToThumbnail for the djvu handler

https://gerrit.wikimedia.org/r/209985

Change 209986 had a related patch set uploaded (by Aaron Schulz):
Set isExpensiveToThumbnail for the djvu handler

https://gerrit.wikimedia.org/r/209986

Change 209984 merged by jenkins-bot:
Set isExpensiveToThumbnail for the djvu handler

https://gerrit.wikimedia.org/r/209984

Change 209985 merged by jenkins-bot:
Set isExpensiveToThumbnail for the djvu handler

https://gerrit.wikimedia.org/r/209985

Change 209986 merged by jenkins-bot:
Set isExpensiveToThumbnail for the djvu handler

https://gerrit.wikimedia.org/r/209986

It is not clear to me why the xml is loaded at all:

You quoted the comment that explains why:

		// Fetch XML and check it, to give a more informative error message than the one which
		// normaliseParams will inevitably give.

$image->getMetadata() returning falsey is going to cause the $this->normaliseParams() call just after the quoted block to fail (root cause likely being that $this->pageCount(), $this->getPageDimensions(), and $this->getPageText() will return false instead of something expected), and it was thought that the error message given by that failure case was too uninformative.

Whether it's really too uninformative I have no idea.

Change 209983 merged by jenkins-bot:
Added common metadata caching to the djvu handler

https://gerrit.wikimedia.org/r/209983

It is not clear to me why the xml is loaded at all:

You quoted the comment that explains why:

		// Fetch XML and check it, to give a more informative error message than the one which
		// normaliseParams will inevitably give.

$image->getMetadata() returning falsey is going to cause the $this->normaliseParams() call just after the quoted block to fail (root cause likely being that $this->pageCount(), $this->getPageDimensions(), and $this->getPageText() will return false instead of something expected), and it was thought that the error message given by that failure case was too uninformative.

Whether it's really too uninformative I have no idea.

It might make sense to revert those 1 of those patches, since it still hits getMetaTree anyway. We have poolcounter (merged) and hopefully caching based patches (pending) to address this.

jftr, S4 had ~2 hours of this again today. Mostly googlebot IPs.

/srv/mediawiki/w/index.php;MediaWiki::run;MediaWiki::main;MediaWiki::performRequest;MediaWiki::performAction;ViewAction::show;CategoryPage::view;CategoryPage::closeShowCategory;CategoryViewer::getHTML;CategoryViewer::getImageSection;TraditionalImageGallery::toHTML;File::transform;File::canRender;ImageHandler::canRender;LocalFile::getWidth;DjVuHandler::getPageDimensions;DjVuHandler::getMetaTree;SimpleXMLElement::__construct 1

It might make sense to revert those 1 of those patches, since it still hits getMetaTree anyway. We have poolcounter (merged) and hopefully caching based patches (pending) to address this.

I don't think that's true for cache hits?

I was looking at 54816e2071fc3e38ad581a264967bd46cbb3647e. And the problem sean mentioned is due to the code for just getting a thumbnail URL (and doing basic checks) and not rendering it yet *still* wants to load getMetaTree, such as the category view stack above. Poolcounter does not cover this case, so something more will be needed there.

Change 210409 had a related patch set uploaded (by Aaron Schulz):
Added common metadata caching to the djvu handler

https://gerrit.wikimedia.org/r/210409

Change 210608 had a related patch set uploaded (by Aaron Schulz):
Temporary hack to drain excess refreshLinks jobs

https://gerrit.wikimedia.org/r/210608

Change 210610 had a related patch set uploaded (by Aaron Schulz):
Temporary hack to drain excess refreshLinks jobs

https://gerrit.wikimedia.org/r/210610

Change 210611 had a related patch set uploaded (by Aaron Schulz):
Temporary hack to drain excess refreshLinks jobs

https://gerrit.wikimedia.org/r/210611

Ignore the last 3 gerrit comments.

Change 210409 merged by jenkins-bot:
Added common metadata caching to the djvu handler

https://gerrit.wikimedia.org/r/210409

Change 210608 abandoned by Aaron Schulz:
Temporary hack to drain excess refreshLinks jobs

https://gerrit.wikimedia.org/r/210608