Page MenuHomePhabricator

Improve ORES articlequality feature extraction for images
Closed, ResolvedPublic


Currently, ORES wp10 uses an 'image_links' feature that counts only images in an article if they use the normal [[File:Something.jpg]] markup format. Images withing gallery tags are not counted, and lead images in infoboxes are not counted.

In both cases, counting those images would be a better reflection of the image count, in terms of article quality.

See for example this article:

ORES thinks it has one image, instead of the six it actually has:

Event Timeline

awight renamed this task from Improve ORES wp10 feature extraction for images to Improve ORES articlequality feature extraction for images.Sep 26 2018, 6:41 PM
Harej triaged this task as Medium priority.Apr 9 2019, 9:11 PM
Harej moved this task from Research & analysis to New development on the Machine-Learning-Team board.
Harej moved this task from New development to Ready to go on the Machine-Learning-Team board.

Hello @Harej This is something I'll like to work on.

Awesome! Welcome! Are familiar with all of the interesting variants of Wiki markup that might imply an image? I think listing out all of the variants is a good place to start

I can do some research on that. Thanks for the swift reply @Halfak I'll get back to you.

If you get on IRC, join us in #wikimedia-ai on freenode. My team hangs out there while we work.

It looks like we define image links here:

I think we'll want to develop some strategy for processing <gallery> tags and anything else you find. That will probably need to go in our base framework somewhere around here:

We use mwparserfromhell (docs). I couldn't find an obvious way it handles <gallery> at a glance.

I saw this I think it has all of the extended Wiki markups for images.

From a quick scan of the page, it looks like there are two syntaxes:

  • [[File:Something.jpg|...]]
  • <gallery> ... </gallery>

From the notes that @Ragesoss gave us, we'll also want to capture images that appear in infoboxes. This can get interesting. Here's a chunk of markup from Accession of Albania to the European Union:

{{Infobox EU accession bid
| logo                =  
| status              = Candidate
| nation              = Albania
| national_denonym    = Albanian
| map                 = European Union Albania Locator.svg

We probably want to look for specific parameters of infoboxes and accept a set of file extensions.


  • "image", "map", "file", etc.


  • "jpg", "png", "svg", etc.

Great. It's all coming together in my head now. How do I get to test my code changes though, to ensure that they work. I'm about to assign myself this task and commence.

I'd add tests here:

I think you'll be able to get gallery tags by scanning and processing: revscoring.features.wikitext.revision.datasources.tags

Oh. I think "test" is the wrong word here. I meant "run". So how do I run the code and see my changes in action?

@Halfak I'll like to assign myself this task. Can I go ahead?

Hello @Halfak I think this task requires PRs on the revscoring and articlequality repos.

I just opened one for the revscoring repo:

I rebuilt the models. See the result in this PR:

It doesn't look like we're seeing any meaningful performance improvement, BUT I think this is still valuable because people use the features that ORES extracts in their own analysis.

I think its valuable too. I hope the performance didn't drop either?

@Ragesoss, any thoughts as the original filer?