Page MenuHomePhabricator

Wikidata autocomplete (wbsearchentities) results with score <= 0
Closed, ResolvedPublic5 Estimated Story Points

Description

When running the wikidata autocomplete queries it's possible for results to have 0 score due to the way it is constructed. This is not currently a critical problem, but future versions of elastic disallow negative scores.

https://www.wikidata.org//w/api.php?action=wbsearchentities&format=json&search=abstract+art&language=en&cirrusDumpQuery

{
    "bool": {
      "should": [
        {
          "bool": {
            "filter": [ { "match": { "labels_all.prefix": "albert" } } ],
            "should": [
              {
                "dis_max": {
                  "tie_breaker": 0,
                  "queries": [
                    { "constant_score": { "filter": { "match": { "labels.en.near_match": "albert" } }, "boost": 2 } },
                    { "constant_score": { "filter": { "match": { "labels.en.near_match_folded": "albert" } }, "boost": 1.6 } },
                    { "constant_score": { "filter": { "match": { "labels.en.prefix": "albert" } }, "boost": 1.1 } },
                    { "constant_score": { "filter": { "match": { "labels_all.near_match_folded": "albert" } }, "boost": 0.001 } }
                  ]
                }
              }
            ]
          }
        },
        { "term": { "title.keyword": "albert" } }
      ],
      "minimum_should_match": 1,
      "filter": [ { "term": { "content_model": "wikibase-item" } } ]
    }
  }

https://www.wikidata.org//w/api.php?action=wbsearchentities&format=json&search=abstract+art&language=en&cirrusDumpResult
The 6th result has a score of 0:

{
    _index: "wikidatawiki_content_1537536135",
    _type: "page",
    _id: "55400981",
    _score: 0,
    _source: {
        namespace: 0,
        title: "Q55370741",
        descriptions: {
            en: "exhibition"
        }
    },
    highlight: {
        labels.nl.prefix: [
            "0:0-12:40|Abstract art, Befreiung, Stil und Ironie"
        ]
    }
}

In this particular case there are only 7 results, so a score wouldn't change anything, but this likely occurs elsewhere. Item's can match the bool filter but nothing else, resulting in a score of 0. Negative rescore boosts take the 0 and turn it negative.Converting that filter into a must with tiny boost should ensure we always have some sort of score to apply basic ordering:

{
    "bool": {
      "should": [
        {
          "bool": {
            "must": [
                { "constant_score": { "filter": { "match": { "labels_all.prefix": "albert" }}, "boost": 0.001} }
            ],
            "should": [
              {
                "dis_max": {
                  "tie_breaker": 0,
                  "queries": [
                    { "constant_score": { "filter": { "match": { "labels.en.near_match": "albert" } }, "boost": 2 } },
                    { "constant_score": { "filter": { "match": { "labels.en.near_match_folded": "albert" } }, "boost": 1.6 } },
                    { "constant_score": { "filter": { "match": { "labels.en.prefix": "albert" } }, "boost": 1.1 } },
                    { "constant_score": { "filter": { "match": { "labels_all.near_match_folded": "albert" } }, "boost": 0.001 } }
                  ]
                }
              }
            ]
          }
        },
        { "term": { "title.keyword": "{{QUERY_STRING}}" } }
      ],
      "minimum_should_match": 1,
      "filter": [ { "term": { "content_model": "wikibase-item" } } ]
    }
  }

For the negative boosts, perhaps we can come up with a way to switch them from sum's to products. A product with a value < 1 will de-boost things without going negative.

Event Timeline

I suggest converting the negative boosts to a positive boost and flip the filter condition to MUST_NOT, I think we can do this automatically within cirrus.

Smalyshev triaged this task as Medium priority.Jan 29 2019, 6:48 PM
CBogen raised the priority of this task from Medium to High.Aug 27 2020, 8:10 PM
dcausse removed EJoseph as the assignee of this task.
dcausse moved this task from Wikibase Search to needs triage on the Discovery-Search board.
dcausse added a subscriber: EJoseph.
MPhamWMF set the point value for this task to 5.Apr 4 2022, 3:50 PM

Change 784646 had a related patch set uploaded (by EJoseph; author: EJoseph):

[mediawiki/extensions/CirrusSearch@master] Prevent negative weights on BoostedQueriesFunction

https://gerrit.wikimedia.org/r/784646

Change 784646 merged by jenkins-bot:

[mediawiki/extensions/CirrusSearch@master] Prevent negative weights on BoostedQueriesFunction

https://gerrit.wikimedia.org/r/784646

Change 786267 had a related patch set uploaded (by DCausse; author: EJoseph):

[mediawiki/extensions/CirrusSearch@es68] Prevent negative weights on BoostedQueriesFunction

https://gerrit.wikimedia.org/r/786267

Change 786267 merged by jenkins-bot:

[mediawiki/extensions/CirrusSearch@es68] Prevent negative weights on BoostedQueriesFunction

https://gerrit.wikimedia.org/r/786267

Change 786267 merged by jenkins-bot:

[mediawiki/extensions/CirrusSearch@es68] Prevent negative weights on BoostedQueriesFunction

https://gerrit.wikimedia.org/r/786267

Do you think there’s any chance that this change (which ended up in wmf.10) caused T307586: wbsearchentities produces no results on 1.39.0-wmf.10?

(Edit: I quoted the wrong version of the change – the commit on master, rECIRd5cf710f34ee: Prevent negative weights on BoostedQueriesFunction, is the one that ended up in wmf.10. I think.)

Change 786267 merged by jenkins-bot:

[mediawiki/extensions/CirrusSearch@es68] Prevent negative weights on BoostedQueriesFunction

https://gerrit.wikimedia.org/r/786267

Do you think there’s any chance that this change (which ended up in wmf.10) caused T307586: wbsearchentities produces no results on 1.39.0-wmf.10?

(Edit: I quoted the wrong version of the change – the commit on master, rECIRd5cf710f34ee: Prevent negative weights on BoostedQueriesFunction, is the one that ended up in wmf.10. I think.)

Nope, this would have been caused by c9c499fe19ec14e939f755e50b9f1c66805c79f4, or more generally by the in progress upgrade to elasticsearch 7.10.