Page MenuHomePhabricator

Boost recent documents in search results
AbandonedPublic

Authored by mmodell on Oct 20 2017, 12:13 PM.

Details

Maniphest Tasks
T180706: Phabricator search hugely degraded in quality
Reviewers
Paladox
demon
EBernhardson
Commits
rPHABe374b1a5a50a: Boost recent documents in search results
Patch without arc
git checkout -b D830 && curl -L https://phabricator.wikimedia.org/D830?download=true | git apply
Summary

Use multiple decay functions to boost recently created and
(more importantly) recently modified documents.

This actually works in reverse, so objects older than a certain
offset get a negative boost based on a decay function.

age > 1d, reduce score by 20% each 10 days.
age > 10 days, reduce score by 20% each 90 days
age > 90 days, reduce score by 20% each 180 days

Diff Detail

Repository
rPHAB Phabricator
Branch
wmf/stable
Lint
Lint OK
Unit
Unit Tests OK
Build Status
Buildable 2382
Build 3888: differential-jessieJenkins
Build 3887: arc lint + arc unit

Event Timeline

mmodell created this revision.Oct 20 2017, 12:13 PM
mmodell updated this revision to Diff 2194.Oct 20 2017, 12:16 PM

apply lint advice

greg added a comment.Oct 25 2017, 9:57 PM

@EBernhardson what do you think of doing this for phab search results? Is this a bad idea? What unintended bad consequences didn't we think of? :)

EBernhardson added a comment.EditedOct 25 2017, 10:32 PM

At least in CirrusSearch we would ensure something like this goes in the rescore phase, as opposed to the main query. It looks like phabricator has ~2M documents in the index so it is possibly worthwhile here as well.

It would look something more like:

[
    'query' => [ 'bool' => $q->toArray() ],
    'rescore' => [
        [
            'window_size' => 8192,
            'query' => [
                'query_weight' => 1,
                'rescore_query_weight' => 1,
                'rescore_query' => [
                    'function_score' => [ ... ]
                ]
            ]
        ]
    ]

The difference here is elasticsearch will only run the additional sorting against the top 8k results per-shard. Often a way to think of this is that the main query is your filtering and fast-scoring phase. The rescore then applies more expensive sorting logic. Phabricator has 5 shards on the cluster so thats approximately the top 40k results. This prevents running the more expensive query on up to 2M matching docs. We do something similar for wikinews (or with the prefer-recent: keyword). See final rescore: https://en.wikinews.org/wiki/?search=~foo&cirrusDumpQuery Although it looks like nik/chad went with a custom script score to do the decay instead of the builtin. Not sure particularly why, possible gauss wasn't available when it was written or there was some characteristic they were trying to get.

The exact weighting is never easy to get right, without some sort of "ground truth" dataset to run queries for and evaluate the results of best you can do is tweak and see what happens.

Oh i didnt realise i was added to this. Will review.

Paladox accepted this revision.Oct 25 2017, 11:01 PM
This revision is now accepted and ready to land.Oct 25 2017, 11:01 PM

Although it looks like nik/chad went with a custom script score to do the decay instead of the builtin. Not sure particularly why, possible gauss wasn't available when it was written or there was some characteristic they were trying to get.

We were trying to duplicate the existing decay logic for wikinews from lsearchd. I think a script was kind of the quick and easiest way at the time -- builtin decay may have been lacking then?

mmodell added a comment.EditedOct 27 2017, 1:20 PM

At least in CirrusSearch we would ensure something like this goes in the rescore phase, as opposed to the main query. It looks like phabricator has ~2M documents in the index so it is possibly worthwhile here as well.

Thanks for the very helpful review, @EBernhardson! I'll try it with rescore...

BTW, that cirrusDumpQuery param is something I wasn't aware of. Thanks for sharing that tidbit, I will find it very helpful. I really have been needing some real world example queries to help understand elasticsearch query format a bit better. Even after reading all of the documentation multiple times, the elastic query DSL is still just about as clear as mud.

In D830#16800, @greg wrote:

@EBernhardson what do you think of doing this for phab search results? Is this a bad idea? What unintended bad consequences didn't we think of? :)

To answer that, it might help to know about the bad consequences that we DID think of... and for that, the only major one I could think of is that it might be too strong of an effect which prioritizes the new stuff too strongly. I hadn't thought too much about performance, fortunately though, @EBernhardson has helpfully suggested a way to improve that metric.

mmodell abandoned this revision.Nov 17 2017, 4:40 AM

This turned out to be a terrible idea ( see T180706: Phabricator search hugely degraded in quality )