Page MenuHomePhabricator

Add an image: search keyword for articles which have infoboxes
Closed, ResolvedPublic

Description

In T285817: Add an image: load static file to search index a list of pages which have image recommendations have been imported into the search index (as recommendation.image weighted tags) for use by the GrowthExperiments suggested edits feature. Initial versions of the suggested edits code can't handle infoboxes; our plan was to filter these out via additional search terms. Unfortunately we have found (T291232: Add an image: exclude certain articles ) that the number of infobox templates is too large for that, even taking into account that some infobox templates reuse others as building blocks. So unfortunately we need to rethink our approach.

One possible option we'd like to discuss is creating a custom search keyword for whether an article has an infobox. We can generate a list of infobox templates; after filtering out redundancies (where one infobox template is a wrapper around another infobox template), we get a list of 100-200 templates for each of the four wikis where we currently need this feature. That's way too much to include as a hastemplate: query due to search string length limits; is it possible to include as a custom query, or would the high number of templates cause problems internally (like the ElasticSearch query getting too large)?

The idea would be that we list the templates on a JSON wiki page that can be kept up-to-date by the wiki community; the GrowthExperiments extension defines a hasinfobox search keyword, which is translated to a query similar to hastemplate: but instead of taking arguments, it would use the template list from the wiki configuration page. The keyword would then be combined with hasrecommendation:image when searching for pages to suggest edits to.

The other idea we could come up with is T292140: Reimport image recommendation data into search index.

Event Timeline

We certainly can add a new keyword where the list of things to filter is stored in the mediawiki-config, as for the number of terms to search for something below 1000 is acceptable. But generally speaking I think we should try to reimport the recommendations as it might allow to fix similar issues next time.

@dcausse , given that we should reimport (T292140) instead, is it safe to close:decline this ticket?

Reimporting is more flexible in that it can handle arbitrary constraints on what should be considered a valid recommendation. The new keyword would be more flexible in that the (more limited) constraints would be evaluated in real time. So it could handle scenarios where 1) an article does not have an infobox at the time of generating the search index import, but an infobox is added later; 2) due to a mistake or omission in the definition of how to detect infoboxes, some infoboxes are missed when generating the search index import.

It's a trade-off, but I think the keyword approach would be more convenient from our perspective. As for size limits, for the four pilot wikis where image recommendations will initially be deployed, we are talking about 100-200 templates. For large wikis there might be a lot more (or not; it also depends on how organized the wiki is about using common building blocks for their templates) so this is not necessarily something that could work long-term; but we also don't intend to manage the image recommendation tags in the search index by batch import in the long term, so we are not looking for a long-term solution either way.

The specific implementation would like this:

  • GrowthExperiments has a service for loading community-controlled per-wiki configuration data from pages like en:MediaWiki:NewcomerTasks.json, with a memcached layer on top.
  • The community would come up with a list of "seed templates" for infoboxes (these are either infobox templates or infobox building block templates, and together cover all infoboxes on the given wiki) and maintain the configuration page over time. On the wikis where we initially plan to deploy image recommendations, this configuration is around 100-200 templates (somewhere between 1K-10K characters added to the search query).
  • GrowthExperiments implements the CirrusSearchAddQueryFeatures hook, adds InfoboxFeature.
  • InfoboxFeature is basically HasTemplateFeature, just with a static list instead of taking the list of templates from user input.

I see three potential concerns here:

  • The new search keyword would be available on a few wikis only. This might confuse users who somehow figure out it exists (it wouldn't be publicly exposed). Doesn't seem like a significant problem for an internal feature.
  • Query performance. I think a 10K ElasticSearch request is fine and filtering by an index with a large list of values is also fine, but my understanding of ES is superficial at best.
  • Initialization performance. Retrieving the list of templates would take a memcached request usually (a few milliseconds) and a DB read rarely in the case of a miss (a few dozen milliseconds, maybe)? If that's problematic, we can lazy-load it, ie. only look up the configuration when the search keyword is actually used.

Would be happy to hear feedback; if you don't think this approach is concerning, then I'd prefer it given my comment above and that the image recommendation search index will have multiple clients eventually and most don't care about infoboxes.

@Tgr, it sounds like this solution is better for both teams, so let's go with this option. If your team is able to do the work here, Search can provide feedback and review patches.

MMiller_WMF renamed this task from Search keyword for articles which have infoboxes to Add an image: search keyword for articles which have infoboxes.Oct 18 2021, 5:56 PM
MMiller_WMF triaged this task as High priority.

Change 732708 had a related patch set uploaded (by Kosta Harlan; author: Kosta Harlan):

[mediawiki/extensions/GrowthExperiments@master] EditGrowthConfig: Add GEInfoboxTemplates

https://gerrit.wikimedia.org/r/732708

Change 732777 had a related patch set uploaded (by Kosta Harlan; author: Kosta Harlan):

[mediawiki/extensions/GrowthExperiments@master] Search: Add InfoboxFeature

https://gerrit.wikimedia.org/r/732777

Change 732781 had a related patch set uploaded (by Kosta Harlan; author: Kosta Harlan):

[mediawiki/extensions/GrowthExperiments@master] ImageRecommendation: Exclude infobox articles in search term

https://gerrit.wikimedia.org/r/732781

Change 732708 merged by jenkins-bot:

[mediawiki/extensions/GrowthExperiments@master] EditGrowthConfig: Add infobox templates field

https://gerrit.wikimedia.org/r/732708

Change 734954 had a related patch set uploaded (by Kosta Harlan; author: Kosta Harlan):

[integration/config@master] Zuul: [GrowthExperiments] Add CirrusSearch to phan dependencies

https://gerrit.wikimedia.org/r/734954

Change 734954 merged by jenkins-bot:

[integration/config@master] Zuul: [GrowthExperiments] Add CirrusSearch to phan dependencies

https://gerrit.wikimedia.org/r/734954

Change 732777 merged by jenkins-bot:

[mediawiki/extensions/GrowthExperiments@master] Search: Add TemplateCollectionFeature

https://gerrit.wikimedia.org/r/732777

Change 732781 merged by jenkins-bot:

[mediawiki/extensions/GrowthExperiments@master] ImageRecommendation: Exclude infobox articles in search term

https://gerrit.wikimedia.org/r/732781