Page MenuHomePhabricator

Expose ORES drafttopic data in ElasticSearch via a custom CirrusSearch keyword
Closed, ResolvedPublic

Description

Once ORES drafttopic data is in ElasticSearch (via some combination of T240556: Load ORES articletopic data into ElasticSearch via the weekly bulk update and T240558: Update ORES articletopic data score in ElasticSearch when an article gets edited), expose it to the search field via some keyword such as topic:<topics> (e.g. topic:Culture.Arts|Geography.Maps|STEM.Science; or maybe we'd want some user-friendlier topic names, maybe even localizable ones). This should probably happen via the CirrusSearchAddQueryFeatures hook in the ORES extension.

Event Timeline

Two things to consider here which might affect other parts of the pipeline:

  • do we want to support searching for larger groups? E.g. should there be a way to find all Culture.* topics, not just one specific subtopic such as Culture.Language and Literature? (This is probably not relevant for Newcomer tasks but relevant for the UX if we want to expose topic search as a generic reader feature.)
  • what should the topic values look like? topic:"Culture.Language and Literature" is not exactly user-friendly, so there should probably exist some mapping from user-friendly (maybe even localizable) keywords so I can type topic:literature and the extension can convert it.

I suggest a keyword slightly less ambiguous such as hastopic or hasdrafttopic.
I agree that there should be a mapping, if this keyword is going to be used directly by users it might be helpful to allow them to search a topic translated into the wiki language instead of using English.

For searching I think the keyword must act as a scoring keyword similar to what morelikethis does, the syntax could even allow a way to boost a particular:
hastopic:"Culture.Language and Literature^2|Culture.Arts".

For the extension from where to hook into CirrusSearchAddQueryFeatures, I suggest CirrusSearch instead of ORES since this is where we added the mapping already.

I suggest a keyword slightly less ambiguous such as hastopic or hasdrafttopic.

AIUI the name is from the original use case of categorizing drafts (since if it's not a draft it's probably already categorized manually, which is better quality) but there isn't anything draft-specific about the process, so having "draft" in the search keyword would just confuse people.
@Halfak any thoughts?

hasXXX would to me suggest that this keyword would filter out articles which do not have XXX but otherwise not affect scoring much - that's how for example hastemplate works, if I understand correctly. A morelike-based topic search would influence scoring, right? That is, if I search for foo topic:bar then an article that has relatively few occurences of foo but a very high bar score could be sorted before another one with lots of foo but a low bar score.

Perhaps prefer-topic:something then?
My concern here is mostly to avoid existing words in the special syntax to avoid swallowing queries that are valid sentences. For instance when I copy/paste a text and search for it, e.g. searching for Special topic: Electric aircraft I probably don't mean the keyword.

Yeah, that's a good point.

AIUI none of the existing keywords have similar semantics: inXXX and hasXXX just filter without affecting the scoring, and prefer-XXX and boost-XXX affect the scoring but do not reduce the result set. Admittedly my understanding of search UX is vague, but as a user avoiding those naming schemes would probably be less confusing for me.

Maybe something like about-topic:? Or by-topic:?

Indeed, the only keyword that will do some filtering but also affect ranking is morelike but not sure we can base any naming pattern on it. about-topic: sounds fine to me (@TJones might have some suggestions perhaps?).

My go-to answer for naming things is always Norse mythology.. so vafþrúðnir: seems like a good keyword! He knew a lot, though he lost a contest of knowledge to Odin—but Odin cheated! And it's probably equally horrible to type in all languages except Icelandic. ;)

But seriously... from the discussion so far, I'm not sure whether we're looking for a filter, a booster, or a combination of the two. And while anyone can learn to associate an arbitrary function with an arbitrary keyword, it helps when the keyword gives good hints about its function.

I don't have strong feelings about hyphens in keywords, other than maintaining consistency. If all the existing boost keywords use a hyphen, a new one should, too. If none of the has keywords have a hyphen, a new one should not. I also agree with David that we don't want something that would be too easy to accidentally cut-n-paste from somewhere else.

With all that in mind:

  • For a filter, hastopic: sounds good.
  • For a booster, I strongly prefer boost-topic: over about-topic:. about sounds a little like a filter to me. I like boost over prefer but it might be harder to localize—though both already exist, so any added translation burden is likely small.
  • A combination keyword is harder. about-topic: is okay, but does sound more like a filter. I kind of like apropos:. Like morelike: it stands outside the existing naming patterns.
    • Hmm. moreon: is too much like moron, but moreabout: would eastablish a moreXXX: pattern of boosting a bit and filtering a bit, and it isn't terrible. (And more somehow mitigates the filterishness of about.)

I hope that helps some, or at least that vafþrúðnir: was mildly amusing! Naming things is hard.

Hey folks, it looks like articletopic (a slightly different model that we now have in production) is a better option than the drafttopic model. Once we release the native models for ar, cs, ko, and viwiki, they'll only have an articletopic model for use.

abouttopic: would be a decent search keyword, too.

Open questions:

  • what should be the exact search logic? Should it act like a filter (limit the result set to pages for which all the specified topics are above some (what?) threshold), or more like a scoring function (hits where the specified topics have higher ORES scores get scored higher by ES)? Since ORES scores will be stored as a text field, the latter seems easier and more natural.

The current plan is a combination of the two: use the thresholds in the Spark job to filter out articles which do not meet a reasonable level of precision for the given topic prediction (T244297#5858557 has more details), then use a similarity score to sort the results (not needed for the Growth use case but probably useful for others).

https://gerrit.wikimedia.org/r/c/mediawiki/extensions/GrowthExperiments/+/572135 is using articletopic: as the keyword -- of course we could change it, just noting that here.

abouttopic: would be a decent search keyword, too.

Uh, I meant to say articletopic. As @kostajh noted I am going with that for the initial code, but feel free to tell me to use something else, we can change at any point.
I think it should be something with "topic" in it, though, that makes both the purpose more clear and the connection to ORES more obvious to interested power users.

Change 573432 had a related patch set uploaded (by Gergő Tisza; owner: Gergő Tisza):
[mediawiki/extensions/CirrusSearch@master] [WIP] Add articletopic feature

https://gerrit.wikimedia.org/r/573432

Change 573735 had a related patch set uploaded (by Gergő Tisza; owner: Gergő Tisza):
[mediawiki/vagrant@master] Enable ORES articletopic search in cirrussearch role

https://gerrit.wikimedia.org/r/573735

Change 573432 merged by jenkins-bot:
[mediawiki/extensions/CirrusSearch@master] Add articletopic feature

https://gerrit.wikimedia.org/r/573432

Change 573735 merged by jenkins-bot:
[mediawiki/vagrant@master] Enable ORES articletopic search in cirrussearch role

https://gerrit.wikimedia.org/r/573735

Change 574634 had a related patch set uploaded (by Gergő Tisza; owner: Gergő Tisza):
[operations/mediawiki-config@master] Enable articletopic: search keyword in CirrusSearch

https://gerrit.wikimedia.org/r/574634

Change 574634 merged by jenkins-bot:
[operations/mediawiki-config@master] Enable articletopic: search keyword in CirrusSearch

https://gerrit.wikimedia.org/r/574634

Mentioned in SAL (#wikimedia-operations) [2020-02-27T19:20:32Z] <tgr@deploy1001> Synchronized wmf-config/InitialiseSettings.php: SWAT: [[gerrit:574634|Enable articletopic: search keyword in CirrusSearch (T240559)]] (duration: 01m 05s)

This is done. The functionality is not available on all wikis since indexing and data loading is still ongoing, but a few work already: https://cs.wikipedia.org/w/index.php?search=articletopic%3Aphysics

For tech news:

A new search keyword, articletopic: is available in search for Arabic, Czech, English, Korea and Vietnamese Wikipedias, which allows searching for articles in a given topic. More documentation is available at https://wikitech.wikimedia.org/wiki/Search/articletopic and https://www.mediawiki.org/wiki/ORES/Articletopic

(we might want to adjust that wording; feedback welcome)

@kostajh -- maybe say that there are only models for certain languages?

@Johan - Arabic, Czech, Korean, Vietnamese, and English.

We could just wait until it's available in all languages, which is probably just a week. By that point we'll have user documentation too (I didn't want to add it while it's not working on most wikis).

The data will be in place this week, but the reindexing process is taking much longer than expected. The reindex is progressing alphabetically through wikidb names, and is only up to gomwiki right now (after about 14 days now). Some confusion might be because kowiki and viwki have recently become available, Since these languages were important for rollout and it was looking like perhaps two more weeks before we got to viwiki I separately started the reindex on these yesterday.

tl/dr: Available everywhere is probably about two more weeks from today unfortunately. It may be worthwhile to wait for documentation, but a note that its deployed in many places and available in more in the coming weeks might be fine.

@Tgr @EBernhardson -- but I thought that local models were only available in the five languages that @kostajh. Is it that we are putting the crosswalk models in for the other languages?

@MMiller_WMF These are enwiki scores propagated to other wikis (only ones that don't have their own models) via the wikibase_item page property

I've added an item to https://meta.wikimedia.org/wiki/Tech/News/2020/11 (feel free to sanity check). I think we can add another item when all wikis have it, including a link to user documentation.

@kostajh @Johan @EBernhardson -- I don't think it's quite right, because the enwiki scores are propagated to other wikis. So all Wikipedias have some topic scores, but only the smaller set have native models. Maybe the right way to say it is like "Models ported from English Wikipedia are available in all Wikipedias, with native-language models available in Arabic, Czech, Korean and Vietnamese Wikipedias."

@MMiller_WMF The complication is that the process that makes these searchable has only processed 378 wikis. So the data is there, but wikits with a dbname after i (except kowiki and viwiki that were run out-of-turn) don't get any results yet. Over the coming weeks those will become available. I don't know how complex we want to make a single sentence though :) I suppose my worry is someone at, say, ukwiki will read it, try it, and find it doesn't work.

Etonkovidova subscribed.

Moving to PM review column to indicate the current state of the task; no issues were found.

Checked in arwiki, kowiki, cswiki, and viwiki - search for topics can be done with searches such as articletopic:physics or articletopic:medicine-and-health(according to the Search keywords column in ORES topic mapping for newcomer tasks spreadsheet (thanks, @Tgr!).
Such search can be also done in enwiki, but not in kowiki that currently doesn't have it.

@MMiller_WMF @EBernhardson Let's include a new Tech News item once it's available in a couple of weeks, and we'll clarify then.

@Johan added user docs at mw:Help:CirrusSearch#Articletopic as this is now available in all Wikipedias. Tech News could include something like

The articletopic search word for searching articles by their topic is now available on all Wikipedias. (1)(2)

I'm resolving this, now that we're seeing users using the local language models, and since we've deployed to several more wikis that are using the crosswalk models.