Page MenuHomePhabricator

Expose ORES drafttopic data in ElasticSearch via a custom CirrusSearch keyword
Open, Needs TriagePublic

Description

Once ORES drafttopic data is in ElasticSearch (via some combination of T240556: Load ORES articletopic data into ElasticSearch via the weekly bulk update and T240558: Update ORES drafttopic data score in ElasticSearch when an article gets edited), expose it to the search field via some keyword such as topic:<topics> (e.g. topic:Culture.Arts|Geography.Maps|STEM.Science; or maybe we'd want some user-friendlier topic names, maybe even localizable ones). This should probably happen via the CirrusSearchAddQueryFeatures hook in the ORES extension.

Details

Related Gerrit Patches:
mediawiki/extensions/CirrusSearch : masterAdd articletopic feature
mediawiki/vagrant : masterEnable ORES articletopic search in cirrussearch role

Event Timeline

Tgr created this task.Dec 12 2019, 11:30 AM
Restricted Application added a subscriber: Aklapper. · View Herald TranscriptDec 12 2019, 11:30 AM
Restricted Application added a project: Scoring-platform-team. · View Herald TranscriptDec 12 2019, 11:31 AM
Tgr updated the task description. (Show Details)Dec 12 2019, 11:34 AM
Tgr added a comment.Dec 18 2019, 11:43 PM

Two things to consider here which might affect other parts of the pipeline:

  • do we want to support searching for larger groups? E.g. should there be a way to find all Culture.* topics, not just one specific subtopic such as Culture.Language and Literature? (This is probably not relevant for Newcomer tasks but relevant for the UX if we want to expose topic search as a generic reader feature.)
  • what should the topic values look like? topic:"Culture.Language and Literature" is not exactly user-friendly, so there should probably exist some mapping from user-friendly (maybe even localizable) keywords so I can type topic:literature and the extension can convert it.
dcausse added a subscriber: dcausse.Jan 2 2020, 9:40 AM

I suggest a keyword slightly less ambiguous such as hastopic or hasdrafttopic.
I agree that there should be a mapping, if this keyword is going to be used directly by users it might be helpful to allow them to search a topic translated into the wiki language instead of using English.

For searching I think the keyword must act as a scoring keyword similar to what morelikethis does, the syntax could even allow a way to boost a particular:
hastopic:"Culture.Language and Literature^2|Culture.Arts".

For the extension from where to hook into CirrusSearchAddQueryFeatures, I suggest CirrusSearch instead of ORES since this is where we added the mapping already.

Tgr added a subscriber: Halfak.Jan 10 2020, 9:55 PM

I suggest a keyword slightly less ambiguous such as hastopic or hasdrafttopic.

AIUI the name is from the original use case of categorizing drafts (since if it's not a draft it's probably already categorized manually, which is better quality) but there isn't anything draft-specific about the process, so having "draft" in the search keyword would just confuse people.
@Halfak any thoughts?

hasXXX would to me suggest that this keyword would filter out articles which do not have XXX but otherwise not affect scoring much - that's how for example hastemplate works, if I understand correctly. A morelike-based topic search would influence scoring, right? That is, if I search for foo topic:bar then an article that has relatively few occurences of foo but a very high bar score could be sorted before another one with lots of foo but a low bar score.

dcausse added a comment.EditedJan 11 2020, 10:37 AM

Perhaps prefer-topic:something then?
My concern here is mostly to avoid existing words in the special syntax to avoid swallowing queries that are valid sentences. For instance when I copy/paste a text and search for it, e.g. searching for Special topic: Electric aircraft I probably don't mean the keyword.

Tgr added a comment.Jan 13 2020, 8:46 PM

Yeah, that's a good point.

AIUI none of the existing keywords have similar semantics: inXXX and hasXXX just filter without affecting the scoring, and prefer-XXX and boost-XXX affect the scoring but do not reduce the result set. Admittedly my understanding of search UX is vague, but as a user avoiding those naming schemes would probably be less confusing for me.

Maybe something like about-topic:? Or by-topic:?

dcausse added a subscriber: TJones.Jan 16 2020, 8:54 AM

Indeed, the only keyword that will do some filtering but also affect ranking is morelike but not sure we can base any naming pattern on it. about-topic: sounds fine to me (@TJones might have some suggestions perhaps?).

My go-to answer for naming things is always Norse mythology.. so vafþrúðnir: seems like a good keyword! He knew a lot, though he lost a contest of knowledge to Odin—but Odin cheated! And it's probably equally horrible to type in all languages except Icelandic. ;)

But seriously... from the discussion so far, I'm not sure whether we're looking for a filter, a booster, or a combination of the two. And while anyone can learn to associate an arbitrary function with an arbitrary keyword, it helps when the keyword gives good hints about its function.

I don't have strong feelings about hyphens in keywords, other than maintaining consistency. If all the existing boost keywords use a hyphen, a new one should, too. If none of the has keywords have a hyphen, a new one should not. I also agree with David that we don't want something that would be too easy to accidentally cut-n-paste from somewhere else.

With all that in mind:

  • For a filter, hastopic: sounds good.
  • For a booster, I strongly prefer boost-topic: over about-topic:. about sounds a little like a filter to me. I like boost over prefer but it might be harder to localize—though both already exist, so any added translation burden is likely small.
  • A combination keyword is harder. about-topic: is okay, but does sound more like a filter. I kind of like apropos:. Like morelike: it stands outside the existing naming patterns.
    • Hmm. moreon: is too much like moron, but moreabout: would eastablish a moreXXX: pattern of boosting a bit and filtering a bit, and it isn't terrible. (And more somehow mitigates the filterishness of about.)

I hope that helps some, or at least that vafþrúðnir: was mildly amusing! Naming things is hard.

Hey folks, it looks like articletopic (a slightly different model that we now have in production) is a better option than the drafttopic model. Once we release the native models for ar, cs, ko, and viwiki, they'll only have an articletopic model for use.

Tgr added a comment.Jan 22 2020, 10:08 PM

abouttopic: would be a decent search keyword, too.

Tgr claimed this task.Jan 23 2020, 2:31 AM
Tgr removed Tgr as the assignee of this task.Tue, Feb 11, 12:21 AM
Tgr updated the task description. (Show Details)Thu, Feb 13, 12:16 AM
Tgr added a comment.Thu, Feb 13, 2:19 AM

Open questions:

  • what should be the exact search logic? Should it act like a filter (limit the result set to pages for which all the specified topics are above some (what?) threshold), or more like a scoring function (hits where the specified topics have higher ORES scores get scored higher by ES)? Since ORES scores will be stored as a text field, the latter seems easier and more natural.

The current plan is a combination of the two: use the thresholds in the Spark job to filter out articles which do not meet a reasonable level of precision for the given topic prediction (T244297#5858557 has more details), then use a similarity score to sort the results (not needed for the Growth use case but probably useful for others).

https://gerrit.wikimedia.org/r/c/mediawiki/extensions/GrowthExperiments/+/572135 is using articletopic: as the keyword -- of course we could change it, just noting that here.

Tgr claimed this task.Tue, Feb 18, 6:12 PM
Tgr added a comment.Tue, Feb 18, 10:33 PM

abouttopic: would be a decent search keyword, too.

Uh, I meant to say articletopic. As @kostajh noted I am going with that for the initial code, but feel free to tell me to use something else, we can change at any point.
I think it should be something with "topic" in it, though, that makes both the purpose more clear and the connection to ORES more obvious to interested power users.

Change 573432 had a related patch set uploaded (by Gergő Tisza; owner: Gergő Tisza):
[mediawiki/extensions/CirrusSearch@master] [WIP] Add articletopic feature

https://gerrit.wikimedia.org/r/573432

Change 573735 had a related patch set uploaded (by Gergő Tisza; owner: Gergő Tisza):
[mediawiki/vagrant@master] Enable ORES articletopic search in cirrussearch role

https://gerrit.wikimedia.org/r/573735