Page MenuHomePhabricator

Backfill terms index
Closed, DeclinedPublic

Description

We have this:

We'll probably be writing terms to wb_terms (see parent task T223792), but older entities will not have entries in that index until they are modified.

We want this:

We need to backfill wb_terms for older entries, from before we start(ed) storing terms there as soon as they change.
It looks like Wikibase/repo already has a maintenance script that does this (or at least something similar enough): maintenance/RebuildTermSqlIndex.php

Acceptance Criteria:

  • wb_terms has entries for all older entities

Event Timeline

Restricted Application removed a project: Patch-For-Review. · View Herald TranscriptJul 12 2019, 7:49 AM
Restricted Application added a project: Multimedia. · View Herald TranscriptJul 12 2019, 7:49 AM

I thought the wb_terms table was being deprecated ... hmmm, maybe I'm mixing it up with something else

Nope, you're not mixing it up with something else; it is indeed undergoing redesign.
I've created T227848 to look into that, and what it'd mean in this context.

Yes it is correct that wb_terms is being redesigned. It is still being used in production as for the moment and there will be announcements for when we might transition to reading from the new store.

@matthiasmullie just to clarify. As far as I know, wb_terms must have been populated long ago with old entries (I might be wrong though and still need to check back.. pinging @Addshore @hoo @Ladsgroup @Lucas_Werkmeister_WMDE). I'm curious how/where you noticed that wb_terms must be missing old entries. Or is this specifically for wikibase instance used in commons? (my current understanding is that there're no use-cases for wb_terms in commons yet, as they come from federated wikidata connection)

Ladsgroup added a subscriber: jcrespo.EditedJul 12 2019, 8:22 PM

@matthiasmullie Let me stop you right there, populating wb_terms on commons was intentionally stopped (cc. @jcrespo) (and it should be a clean table) because it's pretty inefficient and it makes the storage in commons tables even more scarce. This table is being replaced by set of a new tables (which is also populateable(!) by rebuildTermIndex.php) that are normalized and ten times (really) smaller. we are at point of read_old/write_both on properties and will go to read_new/write_both soon. Then we will start the work on items which might take months. When designing new system, I pushed to add support of mediainfo in the new system too but I don't think that passed through. @alaa_wmde knows better.

When designing new system, I pushed to add support of mediainfo in the new system too but I don't think that passed through. @alaa_wmde knows better.

Yeap we will not support other entity types at the moment. When the need arises we will consider the options. It can be extending the existing design or doing a separate one. That depends on many factors that should only be considered when they are real and not hypothetical.

Abit added a subscriber: Abit.Jul 17 2019, 5:42 PM

@alaa_wmde , as @Lucas_Werkmeister_WMDE says, we do indeed have a clear use case for MediaInfo now--we need to be able to find the M entity that is added to lua syntax on Commons. Can you please include this use case in your redesign?

The re design for items and properties gas already happened.
Mediainfo could benefit from a different design if there are only captions.

Are there only captions?
Are they unique to a media info entity?
How long can they be?
Will the use of this index only be going from caption to entity in lua? Or re there other needs.

Probably best to spin this off into a different ticket covering creating the system.

What kind of lookups there should be for MediaInfo - can ElasticSearch index that we already have be used for it?

If elastic already has captions indexed then yes this could probably just use elastic. (There would be slightly more delay and potential unpredictability using elastic than some mysql tables like wb_terms).

However I would be wary, as we have before discussed using elastic for similar lookups for item labels, and there was worry about the scale and increased traffic it would cause on elastic.

I know media info isn't there yet in terms of these caption to entity lookups, but it will continue to increase, and there is no reason it couldn't even surpass the rate of the requests of this type for items.

If elastic already has captions indexed then yes this could probably just use elastic.

Yes, we do index captions. The question is of course how you want them indexed - which depends on the type of the lookup that is required. Right now we were indexing them as labels and just moved to indexing them as descriptions instead (T226722) but would be nice to know how the index is intended to be used.

I know media info isn't there yet in terms of these caption to entity lookups, but it will continue to increase, and there is no reason it couldn't even surpass the rate of the requests of this type for items.

So is it full caption text to item lookup (exact match) or some kind of search inside, or prefix completion, or something else? I am not sure about the use case and what requirements for matching are there for this - understanding this probably would be required to seeing if Elastic is appropriate or not.

The re design for items and properties gas already happened.
Mediainfo could benefit from a different design if there are only captions.
Are there only captions?
Are they unique to a media info entity?
How long can they be?
Will the use of this index only be going from caption to entity in lua? Or re there other needs.
Probably best to spin this off into a different ticket covering creating the system.

It's called caption for the users, but if you have a look at https://commons.wikimedia.org/w/api.php?action=wbgetentities&ids=M62798946 (today's picture of the day), you'll see it's just labels. So it's basically the same as a Wikidata item, I'm not even sure the length of labels was raised. The claims section is also the same, just using the wrong key so that will bite you in LUA too.

This task is a blocker for T223792 so I don't think looking at Elastic will help.

you'll see it's just labels

Yes and no. They are stored in labels slot, but that doesn't mean they are the same in all aspects. E.g. labels are used to look up items all the time, but would captions be used to look up files? I don't think they'd be used the same way, since semantically they are more like descriptions there. So that's what I am trying to figure out. Also why captions even matter for T223792 - shouldn't getEntity look up by ID (which in case of MediaInfo is straight page id) not by caption?

Also why captions even matter for T223792 - shouldn't getEntity look up by ID (which in case of MediaInfo is straight page id) not by caption?

Indeed, if we are not trying to lookup by caption, but only by ID then none of this should be needed and it is likely just some wiring somewhere needs fixing.

matthiasmullie closed this task as Declined.Jul 26 2019, 12:07 PM

I'll decline this task, since we probably won't end up wanting or needing wb_terms (not yet, at least).
This could be reopened if at some point later on we find that we do have a valid usecase for them.