Page MenuHomePhabricator

PageImages shouldn't return images that are not in the lead section
Closed, ResolvedPublic5 Story Points

Description

Story
As a user, I want to see the appropriate image when I preview an article, so that I know whether I want to read it or not
Background

PageImages should not return images if they're not in the lead section.

Example: [[Intellectual disability]]'s page image is way, way below the fold and is therefore not a helpful representation of the article in question. In this case it would be more helpful to not have an image at all.

See T91683: Allow editors control of the page image
Acceptance criteria

  • Only images in the lead section OR infobox should appear in related pages, hovercards, lead images
  • There should be a feature flag which allows us to turn this on or off and it should default to false
  • It should be enabled in mediawiki-config after an announcement is sent to wikitech-l

See also: T151276 [if this change fixes this problem please merge as duplicate and add to test plan]

Related Objects

Event Timeline

There are a very large number of changes, so older changes are hidden. Show Older Changes
Alsee added a subscriber: Alsee.Sep 3 2016, 5:06 PM

Sexual assault returns a seriously bad selection. The image is a living young woman which, when used as PageImage paired with the page title, looks BADLY like she was just Sexually Assaulted moments ago.
Jyllands-Posten Muhammad cartoons controversy reaches all the way down to the 15th article section, for the 4th article image, tangentially related, from 700 years ago
Fabia Drake returns an image of Tallulah Bankhead
I don't have a link handy, but there was a biography that not only returned the wrong person, it returned the opposite gender.

PageImage is being used to select a Hero image to "represent the article". Images in subsections are generally very specific to the section-topic and generally are not appropriate to represent the article subject itself.

Alsee raised the priority of this task from Low to Normal.Sep 3 2016, 5:06 PM

Wikidata has a better image in the majority of these cases, so your examples only support T95026: PageImages should be able to pick Wikidata's P18 as chosen image.

Alsee added a comment.Sep 5 2016, 12:26 AM

Wikidata was zero help. Jylland had no image, Fabia Drake returned the SAME wrong image, and Sexual Assault returned an offensively bad carton image. I removed the two bad images from Wikidata.

bd808 removed a subscriber: bd808.Sep 5 2016, 4:28 AM
ovasileva raised the priority of this task from Normal to High.Sep 15 2016, 2:55 PM
ovasileva moved this task from To Triage to Upcoming on the Readers-Web-Backlog board.
ovasileva updated the task description. (Show Details)Sep 28 2016, 2:08 PM
β€’ Jdlrobson updated the task description. (Show Details)
MBinder_WMF set the point value for this task to 5.Oct 3 2016, 4:42 PM
ovasileva updated the task description. (Show Details)
Alsee removed a subscriber: Alsee.Oct 14 2016, 7:28 AM
ovasileva moved this task from Epics/Goals to Upcoming on the Readers-Web-Backlog board.
phuedx updated the task description. (Show Details)Nov 15 2016, 5:31 PM
bmansurov moved this task from To Do to Doing on the Reading-Web-Sprint-86-πŸ”ͺπŸ¦ƒ board.
bmansurov added a subscriber: bmansurov.

Hey @Jdlrobson I didn't get around to working on this task today. I was thinking of implementing it as follows. Feel free to ignore if you disagree.

Since the feature is asked to be turned on and off at any given time, we need to make sure that the current behavior of saving page images stays the same. In addition, we'd save a new page property to the database indicating whether the page image is in the lead section. Since it's possible that both free and non-free images are different from each other and appear in the lead section, I was thinking of saving a pair of boolean values indicating each image.

Still trying to work out how to associate section number with image. PageImages is setup to listen to 3 hooks- AfterParserFetchFileAndTitle, ParserMakeImageParams, AfterParserFetchFileAndTitle and populate a list of images as images are encountered. None of these hooks provide any idea of where in the page the image is. In fact sections are done as an after thought

Right now I'm wondering if this is worth of a complete overhaul where images are extracted inside the OutputPageParserOutput hook by turning $po->mText into a DOMDocument and parsing out the images from the HTML... scary.

PageImages currently listens to 3 hooks which are run when images are processed and added to the page.
At the time these hooks execute, the contents of parser text are not available neither is knowledge of sections.

The sequence of events is

  • doHeadings is run converting = characters to heading element markup e.g. == becomes <h2>
  • hook BeforeParserFetchFileAndTitle is run
  • hook ParserMakeImageParams is executed.
  • ParserFileProcessingHookHandlers::processFile is run

I see two ways forward...

  1. Move processing to end of save process and convert HTML to a DOMTree and extract images
  2. Surface and inspect the parser HTML text.

Option #2 would involve the following post-process step via the ParserAfterParse hook and seems like it would be pretty reliable:

                $pi = $this->getOutput()->getExtensionData( 'pageImages' ) 
		if ( count($pi ) > 0 ) {
			$x = strpos( $text, '<h' ); // index of start of first heading
			$leadImages = [];
			foreach ( $pi as $p ) {
				$y = strpos( $text, $p['filename'] );
				if ( $x < $y ) {
				        // Images are parsed in order of occurance so we can avoid further strpos calls once we've found one beyond the first heading
					break;
				} else {
					$leadimages[] = $p['filename'];
				}
			}
		}
		$po->getOutput()->setExtensionData( 'pageImages', $leadImages )

#2 feels a little hacky but when you get over the fact this is way the parser works, it's not so bad.
#1 seems like a no-go due to performance concerns.

Ideally, we'd be able to avoid parsing those additional pages images. This would need a core change to run a hook after the execution of doHeadings that surfaces the text.

Thoughts needed before I actually go ahead and write any code... so moving to code review.

Maybe, extract section 0 and make a separate parse on it? Gonna be generally faster than #1 on large pages and less fragile than #2.

I'm struggling to find a suitable hook that only runs during a save and that provides access to the HTML of the lead section

Still not found the hook I need but the code is pretty simple once I do:

$poLead = $content->getSection( 0 )->getParserOutput( $article->getTitle() );
$pageImages = $poLead->getExtensionData( 'pageImages' ) ?: [];

This give images only in the lead section.. I'm just not sure which hook to put it in as I'll want to do this at a time I can save the result via setExtensionData

Change 323989 had a related patch set uploaded (by Jdlrobson):
Restrict page images to lead section

https://gerrit.wikimedia.org/r/323989

@ovasileva do we want the switch to act instantaneously? How often are we planning on turning it on and off? If we're going to use the lead section images only in the future, I suggest we remove the requirement that asks to make the feature switchable. This would simplify the implementation. If, however, we want the keep the ability to toggle between modes and want the API return the updated results immediately, then we'd have to do more work on the current patch.

@bmansurov - I am okay with switching to lead section images only completely. While I agree that use cases might exist against this, I haven't been able to identify one yet. If a significant one appears in the future, we can switch then.

phuedx added a subscriber: phuedx.Dec 2 2016, 8:24 AM

If there's no flag and the feature changes the standard response of the API without additional user interaction – in this case, a query parameter – then:

  1. It ain't a feature.
  2. The third AC should be changed to:

The change shouldn't be merged until it's announced on wikitech-l and mediawiki-api (or mediawiki-api-announce).

We agreed in standup to keep the flag and enable it in a follow up (see extension.json note in the latest commit).

I will start a thread about feature flagging.

@phuedx I've updated T152115 with your suggestion.

Change 323989 merged by jenkins-bot:
Restrict page images to lead section

https://gerrit.wikimedia.org/r/323989

I've setup the staging server for testing. Here are the two pages that I tested:

  • An API query for the first link should not return any images:

http://reading-web-staging.wmflabs.org/w/api.php?action=query&prop=pageimages&titles=PageImages&pithumbsize=100&pilicense=any

  • An API query for the second link should return an image:

http://reading-web-staging.wmflabs.org/w/api.php?action=query&prop=pageimages&titles=PageImages2&pithumbsize=100&pilicense=any

Perhaps you should also test adding an image map outside the lead section and figure out what happens then? There has been a lot of totally counterintuitive behaviour with image maps and page images, so that seems worth testing.

@Deskana, thanks for suggesting good test cases. I think I've found a bug. Basically, imagemaps are ignored when in lead section (last test case below).

The API returns the image in the lead section: http://reading-web-staging.wmflabs.org/w/api.php?action=query&prop=pageimages&titles=PageImages3&pithumbsize=100&pilicense=any

The API still returns the image in the lead section: http://reading-web-staging.wmflabs.org/w/api.php?action=query&prop=pageimages&titles=PageImages4&pithumbsize=100&pilicense=any

The API doesn't return any images either: http://reading-web-staging.wmflabs.org/w/api.php?action=query&prop=pageimages&titles=PageImages5&pithumbsize=100&pilicense=any

The API doesn't return the image map, but it should: http://reading-web-staging.wmflabs.org/w/api.php?action=query&prop=pageimages&titles=PageImages6&pithumbsize=100&pilicense=any

@Jdlrobson, do you think we should capture the problem above in a separate task as it's more related to image maps than to the current implementation?

This doesn't seem related? When $wgPageImagesLeadSectionOnly = false PageImages6 also doesn't return a pageimage and even if I put the imagemap inside a section it doesn't have one.

It wouldn't surprise me if ImageMaps (given they are provided by an extension) are not monitored by PageImages. There's an argument to be made that they shouldn't but that's a story for another ticket.

ovasileva closed this task as Resolved.Dec 14 2016, 8:54 PM

resolving this one. reminder: T152115: Restricted lead images to lead section will be scheduled for first sprint of next year