Page MenuHomePhabricator

Implement PropertyLabelResolver using normalized storage
Open, Needs TriagePublic

Description

The current implementation, TermPropertyLabelResolver, uses a TermIndex to get all the property labels in a certain language at once, cache that in memcached, and then look up property IDs by label. We should do something similar in the normalized schema – i. e., still pre-fetch everything (and still cache it, using the same key if possible) instead of actually searching for a single text in the term storage – but without implementing TermIndex (because we don’t want to implement full search support yet).

However, the current interfaces for the new term store would only allow us to get all the term IDs for properties (effectively the entire content of wbt_property_terms), then resolve all those terms to a giant array, and then filter for labels in a certain language afterwards. We can’t afford to transfer all that data between PHP and the database, so instead the wbt_property_terms and wbt_term_in_lang parts will have to exchange JOIN conditions somehow, so that we can load all property labels in a certain language with just one query. (It still won’t be as efficient as in wb_terms, but with caching it should do for a while, and we can investigate improvements later.)

Event Timeline

Change 516465 had a related patch set uploaded (by Hoo man; owner: Hoo man):
[mediawiki/extensions/Wikibase@master] Decouple TermPropertyLabelResolver from TermIndex

https://gerrit.wikimedia.org/r/516465

hoo added a comment.Jun 11 2019, 10:40 AM

We could change resolveTermIds( array $termIds ): array (from TermIdsResolver) to resolveTermIds( array $termIds, array $types = null, array $languages = null ): array where we restrict the result set to only contain thing of $types and/ or only things in language(s) $languages. I would change resolveGroupedTermIds accordingly.

If we don't want that, we could create new lookup methods for this like resolveAndFilterTermIds or set this option via a cloning setter (like TermIdsResolver::withLanguageFilter()) on TermIdsResolver implementations (which would then apply to all subsequent calls).

hoo added a comment.Jun 11 2019, 10:41 AM

@alaa_wmde @Lucas_Werkmeister_WMDE Does that make sense to ^?

Another thing that just came into my mind would be something like resolveTermIds( array $termIds, TermIdsResolverFilter[] $filters ): array and then have LanguageTermIdsResolverFilter and TypeTermIdsResolverFilter… but that's probably harder to integrate with the JOINs.

That makes sense to me – it also matches TermBuffer’s array $termTypes = null, array $languageCodes = null arguments. A separate method doesn’t seem necessary, and a cloning setter seems harder to use to me. And a TermIdsResolverFilter interface doesn’t really work, I think, because you can’t actually inject arbitrary instances of the interface there, the implementation will have to know how to filter for it in the database. (Also, that would open the door to a MatchingPrefixTermIdsResolverFilter, i. e. a kind of search, which we don’t actually want to support as far as I understand.)

Change 516465 merged by jenkins-bot:
[mediawiki/extensions/Wikibase@master] Decouple TermPropertyLabelResolver from TermIndex

https://gerrit.wikimedia.org/r/516465

That makes sense to me – it also matches TermBuffer’s array $termTypes = null, array $languageCodes = null arguments. A separate method doesn’t seem necessary, and a cloning setter seems harder to use to me. And a TermIdsResolverFilter interface doesn’t really work, I think, because you can’t actually inject arbitrary instances of the interface there, the implementation will have to know how to filter for it in the database. (Also, that would open the door to a MatchingPrefixTermIdsResolverFilter, i. e. a kind of search, which we don’t actually want to support as far as I understand.)

Make sense to me too!

Change 516661 had a related patch set uploaded (by Hoo man; owner: Hoo man):
[mediawiki/extensions/Wikibase@master] TermIdsResolver: Allow filtering terms by type and lang

https://gerrit.wikimedia.org/r/516661

hoo added a comment.EditedJun 13 2019, 10:36 AM

I just realized that limiting the type and language in TermIdsResolver wont do the trick, as we then still need to get all wbpt_term_in_lang_ids first.

I will thus need to join wbt_property_terms with wbt_term_in_lang and wbt_text_in_lang when getting the term ids to resolve :/

Yeah, that’s what I meant with this part of the task description:

However, the current interfaces for the new term store would only allow us to get all the term IDs for properties (effectively the entire content of wbt_property_terms), then resolve all those terms to a giant array, and then filter for labels in a certain language afterwards. We can’t afford to transfer all that data between PHP and the database, so instead the wbt_property_terms and wbt_term_in_lang parts will have to exchange JOIN conditions somehow, so that we can load all property labels in a certain language with just one query. (It still won’t be as efficient as in wb_terms, but with caching it should do for a while, and we can investigate improvements later.)

I was thinking of something like ActorMigration::getJoin() or CommentStore::getJoin(), so you can still keep the components somewhat separate.

Rough sketch with Alaa – here, DatabaseTermIdsResolver gets the extra join conditions, and DatabasePropertyTermStore would call that special method if it detects its injected TermIdsResolver is a DatabaseTermIdsResolver.

	// in DatabaseTermIdsResolver

	public function resolveGroupedTermConditions( array $groupedWhatevers ) {
		foreach ( $groupedWhatevers as $groupedWhatever ) {
			list ( $tables, $conditions ) = $groupedWhatever;

			$result = $db->select(
				array_merge( 
					[ 'wbt_term_in_lang', 'wbt_text_in_lang', 'wbt_text' ],
					$tables
				),
				[ 'wbtl_id', 'wbtl_type_id', 'wbxl_language', 'wbx_text' ],
				array_merge( [
					'wbtl_text_in_lang_id=wbxl_id',
					'wbxl_text_id=wbx_id',
				], $conditions ),
				__METHOD__
			);
			$this->preloadTypes( $result );
			foreach ( $result as $row ) {
				foreach ( $groupNamesByTermIds[$row->wbtl_id] as $groupName ) {
					$this->addResultTerms( $groupedTerms[$groupName], $row );
				}
			}
			return $groupedTerms;
		}
	}

	// in DatabasePropertyTermStore

	$this->resolver->resolveGroupedTermConditions( [
		'P1' => [
			// extra tables
			[ 'wbt_property_terms' ],
			// extra conditions
			[
				'wbpt_term_in_lang_id=wbtl_id',
				'wbpt_property_id' => $numericPropertyIds,
			],
		],
	] );

For the unkown types acquiring issue, we thought we would add another interface that DatabaseTypeIdsResolver would implement and DatabasePropertyTermStore would use.

The interface would be called, we thought TypeIdsLookup with a function lookupTypeIds( array $types) : array // returns type ids for known types

Moved back to Ready To Go until sub-tasks are done so that this tasks logic can use them

alaa_wmde added a subscriber: hoo.
alaa_wmde added a comment.EditedJun 14 2019, 2:26 PM

So the change https://gerrit.wikimedia.org/r/516661 now introduces filtering and is rebased on and adjusted to use TypeIdsLookup and DatabaseTermIdsResolver::resolveTermsViaJoin.

Change 517096 had a related patch set uploaded (by Alaa Sarhan; owner: Alaa Sarhan):
[mediawiki/extensions/Wikibase@master] WIP. Add CachedDatabasePropertyLabelResolver implementation using new store.

https://gerrit.wikimedia.org/r/517096

Change 516661 merged by jenkins-bot:
[mediawiki/extensions/Wikibase@master] TermIdsResolver: Allow filtering terms by type and lang

https://gerrit.wikimedia.org/r/516661

Change 517096 merged by jenkins-bot:
[mediawiki/extensions/Wikibase@master] Add CachedDatabasePropertyLabelResolver implementation using new store.

https://gerrit.wikimedia.org/r/517096

Addshore moved this task from incoming to in progress on the Wikidata board.Jun 21 2019, 11:37 PM