Page MenuHomePhabricator

Implement TermLookup based on Elastic
Closed, ResolvedPublic

Description

We need a scalable way to look up labels and description for an entity, in a given set of languages.
The current implementation using the wb_terms table does not perform well for large wikis.
An alternative approach would be to use Elastic for looking up labels.

Open question: can we re-use the information we store in Elastic for CirrusSearch (compare T88534: [Story] Implement EntitySearch service on top of Elastic), or do we need a special mapping just for the label lookup?

Event Timeline

daniel moved this task from Inbox to Push on the User-Daniel board.Jan 5 2017, 6:58 PM

Just to be clear, which lookup(s) we're talking about:

  • term -> entity ID
  • term+language -> entity ID
  • entity+language -> term(s)
  • entity -> term(s) in any language

I think we have all pieces for this, but not sure what API should be there or whether we still want this?

@Lydia_Pintscher, @daniel Would like a bit more clarity about this one - what needs to be done here (if at all)?

We do have a TermLookup interface with a few implementations, one that directly accesses the entity, and one that accesses wb_terms. This ticket suggests an other implementation that queries the Cirrus/Elastic index.

The features of TermLookup are currently used in quite a lot of very different places:

  • In the getLabelByLanguage function in Lua.
  • In the LinkBegin hook handler that makes sure proper labels are displayed when linking to Wikidata entities.
  • In the LabelDescriptionLookup implementations, which are again used in about two dozen different places.

It might be possible to replace some of these usages with new Cirrus/Elastic based infrastructure that does not implement TermLookup. However, there are so many places, it doesn't seem to be a good idea to replace them all. The easiest way forward seems to be a new Cirrus/Elastic based TermLookup that can then be used wherever necessary, and slowly replace especially the remaining wb_terms lookups.

Smalyshev moved this task from Backlog to Next on the User-Smalyshev board.Feb 13 2018, 8:18 PM
Smalyshev added a comment.EditedMar 1 2018, 9:38 PM

@thiemowmde Thanks, now I understand the idea much better. However, it seems to be that it might be better to implement TermIndex, not TermLookup - since otherwise we do not get to benefit from caching/prefetching functionality that specific TermLookup classes implement on top of TermIndex and PrefetchingTermLookup. Does it make sense or am I missing something here?

The fact that TermIndex does both reads and writes is a bit of a problem (as we have only one service and if we replace it with Elastic then the write part for it won't work and other clients that still use wb_terms will be broken) so there might be a need to split the TermIndex maybe?

Smalyshev added a comment.EditedMar 1 2018, 9:45 PM

Also noticing something strange: WikibaseRepo::getPrefetchingTermLookup returns PrefetchingTermLookup, but implementation uses $this->getWikibaseServices()->getTermBuffer() which returns TermBuffer - which doesn't have to be implementing PrefetchingTermLookup (even though existing concrete implementations of TermBuffer seem to implement it).

thiemowmde edited subscribers, added: WMDE-leszek; removed: Rzuwig, JanZerebecki.Mar 2 2018, 10:00 AM

Yea, there is an interface violation in WikibaseClient::getPrefetchingTermLookup as well as WikibaseRepo::getPrefetchingTermLookup. I uploaded https://gerrit.wikimedia.org/r/415816 as a reminder. For what this ticket here is about you should ignore this and the separate TermBuffer interface and only care about classes implementing PrefetchingTermLookup.

Please try to avoid fiddling with TermIndex, if possible. It is way to heavily bound to the wb_terms table. I feel it exposes to much of the wb_terms table structure, as well as details of the WHERE clause.

When looking at classes like BufferingTermLookup I have similar concerns. It is way to heavily bound to TermIndex, and not worth fiddling with.

I would recommend sticking to the PrefetchingTermLookup interface, and possibly creating a new caching layer that is different from BufferingTermLookup.

@daniel, what do you think?

So if I look into PrefetchingTermLookup, it's a sum of TermBuffer and TermLookup, which is:

public function prefetchTerms( array $entityIds, array $termTypes = null, array $languageCodes = null )
public function getPrefetchedTerm( EntityId $entityId, $termType, $languageCode )
public function getLabel( EntityId $entityId, $languageCode );
public function getLabels( EntityId $entityId, array $languageCodes );
public function getDescription( EntityId $entityId, $languageCode );
public function getDescriptions( EntityId $entityId, array $languageCodes );

I don't see where term types are defined and how they relate to labels/descriptions - though I guess 'label' is type for labels and 'description' for descriptions, but could there be more? Is Elastic implementation supposed to support them or should throw if they're encountered? There are constants in TermIndexEntry, but I assume those are implementation details for SQL part and I should stay away from them?EntityTermLookupBase seems to just use plain strings, not constants. Is it ok to keep doing this?

Also, there's an overlap of coverage - you can use getLabel or you can use getPrefetchedTerm(... 'label' ...). When which one is supposed to be used? Should getLabel use prefetched data if available?

I see there's a bunch of code dealing with prefetching, buffering, resolving different repos etc. organizing - should all this code be reimplemented for ElasticSearch implementation of prefetchTerms or can it be somehow reused? All that code relies on TermBuffer/TermIndex now.

As I understand, now TermLookup methods are supposed to be dealing with entities from federated repos too, and also working on the WikibaseClient side where there's no repo config. However, if we want to have ElasticSearch implementation, this means we need access to configuration of target repo, and more specifically, to its CirrusSearch settings (at least basic ones like index names, cluster names, etc. - we probably don't need search profiles for just fetching labels). I am not sure those are easily available. So do we need this capability for ElasticSearch implementation? If yes, we'd need to figure out how to get the right configs... Or we could duplicate config in client configs - that would probably work for index names - which are also very conventional and derived from wiki names, so we could also create them probably - but connection settings still need to come from somewhere, unless we assume everybody uses the same endpoint. Which is I think the case in production, but not sure how safe it is to rely on it.

Smalyshev moved this task from Next to Doing on the User-Smalyshev board.Mar 2 2018, 9:10 PM
@Smalyshev wrote:

I don't see where term types are defined […]

That's exactly the issue I referred to. These "term types" being strings is mostly an implementation detail of how the wb_terms table is designed. The currently specified term types can be found in TermIndexEntry. Please use these constants to make it easier to find places working with these term types.

  • TermIndexEntry::TYPE_LABEL = "label"
  • TermIndexEntry::TYPE_ALIAS = "alias"
  • TermIndexEntry::TYPE_DESCRIPTION = "description"

but could there be more?

We played around with ideas to have other types for Lexicographical data, but we are most probably not going to do this. So these 3 types are all that exist.

Is Elastic implementation supposed to support them […]

Please either return no result if a caller asks for an unknown term type, or throw an exception.

Also, there's an overlap of coverage - you can use getLabel or you can use getPrefetchedTerm(... 'label' ...).

This is just one of the many problems of this combination of older and newer interfaces. Please focus on the more narrow methods (everything without $termType) if possible.

code dealing with prefetching, buffering, resolving different repos […] should all this code be reimplemented for ElasticSearch implementation of prefetchTerms or can it be somehow reused? All that code relies on TermBuffer/TermIndex now.

We might need to split the existing implementations. But I believe it is easier to start over and not touch the existing implementations. For example, you can start by implementing nothing but a TermLookup. If this works, you can add a small TermBuffer implementation on top that allows for prefetching. Since TermBuffer alone is not very useful, you might need to use the PrefetchingTermLookup interface as a decorator instead:

class PrefetchingElasticTermLookup implements PrefetchingTermLookup {

	private $lookup;

	public function __construct( ElasticTermLookup $lookup ) {
		$this->lookup = $lookup;
	}

	public function prefetchTerms( array $entityIds, array $termTypes = null, array $languageCodes = null ) {
		/* TBD */
	}

	public function getPrefetchedTerm( EntityId $entityId, $termType, $languageCode ) {
		switch ( $termType ) {
			case TermIndexEntry::TYPE_LABEL:
				return $this->getLabel( $entityId, $languageCode );
			case TermIndexEntry::TYPE_DESCRIPTION:
				return $this->getDescription( $entityId, $languageCode );
			default:
				throw new \InvalidArgumentException( "Not defined for \$termType \"$termType\"" );
		}
	}

	public function getLabel( EntityId $entityId, $languageCode ) {
		$this->lookup->getLabel( $entityId, $languageCode );
	}

	public function getLabels( EntityId $entityId, array $languageCodes ) {
		$this->lookup->getLabels( $entityId, $languageCodes );
	}

	public function getDescription( EntityId $entityId, $languageCode ) {
		$this->lookup->getDescription( $entityId, $languageCode );
	}

	public function getDescriptions( EntityId $entityId, array $languageCodes ) {
		$this->lookup->getDescriptions( $entityId, $languageCodes );
	}

}

TermLookup methods are supposed to be dealing with entities from federated repos too, and also working on the WikibaseClient side […]

This might indeed be needed. However, I suggest to not care for these for the moment. We can add such functionality later via dispatchers, decorators, and such.

Another API quirk: getLabel(s) and getDescription(s) return null when the term does not exist. getPrefetchedTerm returns null when term has not been prefetched and false when it does not exist. Probably needs to be unified, otherwise it is annoying to integrate those two.

Change 416641 had a related patch set uploaded (by Smalyshev; owner: Smalyshev):
[mediawiki/extensions/Wikibase@master] [WIP] Implement TermLookup with ElasticSearch

https://gerrit.wikimedia.org/r/416641

Smalyshev moved this task from Doing to In review on the User-Smalyshev board.Mar 13 2018, 7:38 AM

Change 416641 merged by jenkins-bot:
[mediawiki/extensions/Wikibase@master] Implement TermLookup with ElasticSearch

https://gerrit.wikimedia.org/r/416641

Smalyshev closed this task as Resolved.Mar 18 2018, 9:23 PM

I think this is done, though not enabled anywhere yet - I imagine we need some plan with regard to where it is going to be enabled?

Note: there are several places in the code that call TermIndex::getTermsOfEntities(). These would need to be converted to using a TermLookup in order to benefit from this.

To enable this, the entry for 'PrefetchingTermLookup' in data-access/src/PerRepositoryServiceWiring.php needs to be changed to instantiate an ElasticTermLookup instead of a BufferingTermLookup, depending on config.