Page MenuHomePhabricator

investigate how language fallbacks can be done in entity search
Closed, ResolvedPublic

Description

We want to show the user results including results from fallback languages in the entity selector and search. What needs to be done to get that?

Event Timeline

I investigate this "top down", from the API entry point to the database. These are the things that need to happen (in reverse chonological order):

  • The SearchEntities module needs another boolean parameter named fallback or languageFallback that triggers the use of fallback languages. We could also support multiple values for the language attribute.
  • The SearchEntities module needs to know a fallback chain instead of a single language code. To allow this, a LanguageFallbackChainFactory needs to be obtained from global context in the constructor.
  • the private getEntries method needs to be modified to include the actual language of the matching term in the result.
  • The call to SearchIndex::getTermsOfEntities needs to be modified to include all relevant languages
  • The signature of TermIndex::getMatchingIds needs to change, so SearchEntities::searchEntities can pass all languages in the fallback chain.
    • I suggest to change the signature to getMatchingIds( $term, $termTypes, $languageCodes, $entityTypes, $options ), no longer using "template terms" to specify the search criteria.
    • The only other user of getMatchingIds is SpecialTermDisambiguation, which could easily be adapted to the new signature
    • TermSQLIndex::getMatchingIds will need to be adopted to the new signature. No changes to the database structure are needed.

We should further investigate if it would be possible to get rid of the subsequent call to getTermsOfEntities. Currently, the matching terms are read from the database twice. If we modify and use getMatchingTerms instead of getMathingIDs, this should be possible. getMathingIDs should probably removed completely.

Note that none of the above will cause the difference in language to be considered when ranking the matches.

I consider this done. Any comments?

for searching, i would like to be able to 'fallback' more generously to more languages. For example, be able to search for arabic spelling of something in the suggester, without having to change my language in ULS. Despite the flaws in Special:Search, I am super glad at least this works there :)

the fallback chain should be considered preferentially when ranking the results, obviously, though.

I think trying to unify the backend code for Special:Search and this should be done. Maybe searchentities would use a separate index (in elastic eventually, and for now in wb_temrs) though of labels and aliases only , but the logic of what languages to consider, falling back to any imho, might be generalized and shared.

the details Daniel gives about SearchEntities seem generally good though. I am more concerned about what we consider for fallback languages for this (since we are searching, should be more broad) vs. display and formatting things.

Let's start with the fallbacks we already have in place. We can still expand later but I'd like it if we could keep it consistent at least for now.

I thought Katie did. If not then please reopen.

Some questions:

  • Who/what (else) is using SearchEntities?
  • How will we build language fallback chains? Who will configure them? I think that's pretty important for deciding what the API looks like (bool vs list of languages)
  • Can we really get away with not touching the sorting? This would create an issue when an uninteresting item has a matching label/alias in a fallback language but not in the searched language. I can imagine something like searching for ›Billion‹ (German) and finding ›billion‹ (English).