Page MenuHomePhabricator

[Checkpoint 5] Update Read Logic
Open, Needs TriagePublic

Related Objects

Event Timeline

Restricted Application added a subscriber: Aklapper. · View Herald TranscriptMar 26 2019, 4:46 PM
alaa_wmde renamed this task from [Checkpoint 6] Update Read Logic to [Checkpoint 6] Update Read Logic? or next blaze.Mar 26 2019, 4:46 PM
alaa_wmde renamed this task from [Checkpoint 6] Update Read Logic? or next blaze to [Checkpoint 5] Update Read Logic.Apr 23 2019, 9:08 PM

I think the interface that WikibaseRepo and WikibaseClient use the most is PrefetchingTermLookup, which is an intersection of the following two interfaces:

  • TermLookup, a straightforward entity→term mapping with the following functions:
    • getLabel( EntityId $entityId, $languageCode )
    • getLabels( EntityId $entityId, array $languageCodes )
    • getDescription( EntityId $entityId, $languageCode )
    • getDescriptions( EntityId $entityId, array $languageCodes )
  • TermBuffer, a more efficient interface for getting terms of several entities in batch, with the following functions:
    • prefetchTerms( array $entityIds, array $termTypes, array $languageCodes )
    • getPrefetchedTerm( EntityId $entityId, $termType, $languageCode )

(There is also a TermIndex interface, but that includes search functions too, and I don’t think we’re interested in supporting those yet.)

A TermLookup would be very straightforward to implement as a wrapper around a PropertyTermStore or ItemTermStore, and a simple TermBuffer could be implemented by iterating over the $entityIds, but that means two database queries per entity ID (get term IDs and resolve them), losing the batch aspect. I think to properly and efficiently implement this, we’ll want an implementation that’s separate from PropertyTermStore and ItemTermStore.

And unfortunately, I don’t think that implementation can use the same TermIdsResolver either, at least not in its current form: if we combine the term IDs of multiple properties/items into one batch and ask the TermIdsResolver to resolve all of them, we won’t be able to tell which term belongs to which entity. So either we still have one query per entity (and the only benefit from batching is that we got the term IDs for all the entities in one query), or we need to refine that interface somehow (or introduce a second one more suitable for batching), or our implementation just skips over all that abstraction and knows about the underlying wbt_term_in_lang etc. tables (but that’s super ugly again).

alaa_wmde added a comment.EditedJun 4 2019, 2:15 PM

(There is also a TermIndex interface, but that includes search functions too, and I don’t think we’re interested in supporting those yet.)

I thought originally that that would be included under this checkpoint (separate tasks).

we need to refine that interface somehow (or introduce a second one more suitable for batching)

yeah we can go with this for batching .. the new interface (or the existing enhanced one) would probably need to allow receiving an array of arrays of terms ids .. then replacing the array of term ids with the resolved representation (can be same as the current TermIdsResolver uses) preserving top-level keys in the input array, which by their turn can be the entity ids.

Yeah, I like the array-of-arrays idea. Could be a separate method in the TermIdsResolver or a separate interface; in either case, I think DatabaseTermIdsResolver should still be the one database implementation, and if we go for a separate interface it should implement that one as well (and not have a separate class just for that purpose).


Since Wikidata doesn’t use wb_terms for searching (T188993), and we’re not supporting third parties with the normalized schema at this point, we don’t need to implement full search yet. However, there is one class that uses TermIndex’s search functionality for other purposes: TermPropertyLabelResolver “searches” for all the property labels (case-insensitively, no prefix search, no limit), effectively preloading a map from label to property ID (which it also caches in memcached) to accelerate looking up properties by labels. (This is used in WikibaseClient, where we allow users to get data by specifying a property label instead of a property ID.)

To support this case, we’ll also have to implement a separate PropertyLabelResolver. Like TermPropertyLabelResolver, it should do a prefetch of all property labels and store it in memcached (using the same key if possible). However, the current interfaces for the new term store would only allow us to get all the term IDs for properties (effectively the entire content of wbt_property_terms), then resolve all those terms and filter for labels in a certain language afterwards. We can’t afford to transfer all that data between PHP and the database, so instead the wbt_property_terms and wbt_term_in_lang parts will have to exchange JOIN conditions somehow, so that we can load all property labels in a certain language with just one query. (It still won’t be as efficient as in wb_terms, but with caching it should do for a while, and we can investigate improvements later.)

Addshore moved this task from incoming to in progress on the Wikidata board.Jun 21 2019, 11:37 PM