Page MenuHomePhabricator

Index captions as description fields not label
Closed, ResolvedPublic

Description

MediaInfo captions are being indexed as label fields, this is suboptimal as these fields are mapped for completion.
We should fix this before we start to use them more broadly as they consume much more space than needed (exact matches & prefix fields).

Event Timeline

Note however we don't have a field like labels_all for descriptions, right? Also, hascaption is now an alias to haslabel, so they both work on the same field. May be moved to be an alias for hasdescription, of course. But then not clear how hascaption:* would work if at all.

Note however we don't have a field like labels_all for descriptions, right? Also, hascaption is now an alias to haslabel, so they both work on the same field. May be moved to be an alias for hasdescription, of course. But then not clear how hascaption:* would work if at all.

it depends on the solution we decide to follow if we decide to use descriptions and hascaption:* is an important usecase then we need a new field like e.g. description_count.

Change 519602 had a related patch set uploaded (by DCausse; owner: DCausse):
[mediawiki/extensions/WikibaseMediaInfo@master] [WIP] Index MediaInfo labels as separate caption fields

https://gerrit.wikimedia.org/r/519602

We're storing captions in label because captions are essentially labels for images, and we already have a 'description' field in UploadWizard for wikitext

As you say we don't need completion, we just need people to be able to search the caption text. We're pushing all captions into opening_text which is probably adequate for most use cases (though I guess won't cover stemming).

FWIW we don't expect to ever add descriptions to MediaInfo entities. Obvs if there was a consensus that the caption data should be stored there rather than in labels we could consider moving it, but as far as we can see there is no need to have both labels and descriptions for MediaInfo entities

Just to add some clarity on something Cormac said above:

FWIW we don't expect to ever add descriptions to MediaInfo entities.

I'd amend this to say that at the moment we're about 80% sure we won't add descriptions, but it's still quite possible that Commons community (either as a group or via a prolific bot writer) decide that multilingual descriptions in Wikitext aren't good enough and start migrating over to structured descriptions.

Would representing captions as captions rather than labels in the JSON structure help?
Or rather than pretending media info entities are totally like items / properties just have totally different handling for them when it comes to indexing?

Not really sure which JSON structure you mean @Addshore - but yeah, perhaps we should just explicitly index this stuff differently when doing the indexing. That's easy enough atm, but raises 2 questions:

  1. ATM slot data is written into the elastic document by the hook onCirrusSearchDocumentParse(). If we were going to have slot data automatically written to the document (and I guess that's the plan) we'd have to come up with a way of configuring how it's indexed
  2. If we end up added structured descriptions, and we want to index those as descriptions too, what then? The current code doesn't lend itself very well to concatenating label and description and writing both into one field - I guess it could be refactored, but might be tricky

Not really sure which JSON structure you mean @Addshore - but yeah, perhaps we should just explicitly index this stuff differently when doing the indexing. That's easy enough atm, but raises 2 questions:

So, I mean in the JSON that is stored and also the JSON that is presented to consumers of the data.
If we are saying they are nothing that like labels as we know then, why are we internally storing them / treating them as labels.

Well, there are two issues here:

  1. Captions are semantically not like labels, so it's semantically wrong to store them as labels.
  2. Since captions are stored as labels, they are indexed as labels, which means they are indexed for prefix completion search. This wastes resources and applies analyzers that are wrong for the searches that people would actually do on captions.

I am so far mainly concerned with (2), as this means both unnecessary load on our index servers and maybe broken searches too. But it is kinda related to (1) because as I understand the code assumes if we're storing something in labels field, it's labels. Maybe we could override in WikibaseMediaInfo extension and index it differently, not sure.

Erm ... this is getting a bit philosophical, but I don't really see that labels and descriptions have much semantic meaning associated with them, except for one would expect a label to be shorter than a description.

If we are saying they are nothing that like labels as we know then, why are we internally storing them / treating them as labels.

Correct me if I'm wrong here, but as far as I can tell 'labels' in the wikidata sense simply means 'concise descriptions of something that may also have a longer description, and that we index for prefix completion search'. If that's correct, then captions are not "nothing like labels as we know them" - they're still short descriptions of things that may also have long descriptions. They just need to be indexed differently

ATM label data from the MediaInfo slot is written to the elastic doc via the hook onCirrusSearchBuildDocumentParse, so I don't think it'll be difficult to update it so label is indexed in a different way. Perhaps ultimately this approach would make T190066 more difficult to implement, but I don't know what the plan for implementation of that is one way or the other

If that's correct, then captions are not "nothing like labels as we know them"

Well, ok, yes, not "nothing like", but the dominating use of labels in search - namely, prefix search, is irrelevant here. So we need to adjust for that.

Change 522473 had a related patch set uploaded (by Cparle; owner: Cparle):
[mediawiki/extensions/WikibaseMediaInfo@master] Index captions as description field rather than label

https://gerrit.wikimedia.org/r/522473

Change 523679 had a related patch set uploaded (by DCausse; owner: DCausse):
[mediawiki/extensions/WikibaseCirrusSearch@master] Query description fields with incaption keyword

https://gerrit.wikimedia.org/r/523679

Change 522473 merged by jenkins-bot:
[mediawiki/extensions/WikibaseMediaInfo@master] Index captions as description field rather than label

https://gerrit.wikimedia.org/r/522473

Change 523679 merged by jenkins-bot:
[mediawiki/extensions/WikibaseCirrusSearch@master] Query description fields with incaption keyword

https://gerrit.wikimedia.org/r/523679

Change 519602 abandoned by Cparle:
[WIP] Index MediaInfo labels as separate caption fields

https://gerrit.wikimedia.org/r/519602

Change 544091 had a related patch set uploaded (by EBernhardson; owner: EBernhardson):
[mediawiki/extensions/WikibaseCirrusSearch@master] Update HasDataForLangFeature for caption move to descriptions

https://gerrit.wikimedia.org/r/544091

Change 544091 merged by jenkins-bot:
[mediawiki/extensions/WikibaseCirrusSearch@master] Update HasDataForLangFeature for caption move to descriptions

https://gerrit.wikimedia.org/r/544091