Page MenuHomePhabricator

[Bug] high-ranking items seemed to have dropped significantly in Special:Search results for wikidata
Closed, ResolvedPublic


It seems recently a number of high-ranking items have dropped off significantly on Wikidata's search results. One example is life (Q3). It is now listed at around 750 but was among the first results in the past and still should be.

(Reported on the contact the dev team page on Wikdiata)

Related Objects


Event Timeline

Lydia_Pintscher raised the priority of this task from to Needs Triage.
Lydia_Pintscher updated the task description. (Show Details)
Lydia_Pintscher added a subscriber: Lydia_Pintscher.
Restricted Application added a subscriber: Aklapper. · View Herald Transcript
daniel added a subscriber: daniel.

Confirmed that Special:Search doesn't boost high profile items like Q3. Cirrus can boost based on the number of incoming links, and is already tracks the number of incoming links, so it should be easy enough to make it work.

daniel triaged this task as High priority.Sep 10 2015, 3:22 PM

This makes for a rather bad and confusing user experience on Special:Search

@Deskana: Is there anything you changed that might have caused this middle/end of August? I am at a loss what's happening here.

We should also boost matches on labels and aliases. That's probably a little harder to do, but should not be terrible.

Adding this to Discovery-Search (Current work) to investigate whether there's a cause to this that we are aware of.

First of all: sorry for all the low level details in this comment but it's always complex to tackle such relevance issues.

I assume that life is the query.

Wikidata already uses incoming_link to boost the top-N results (8196 docs per shards).

The way cirrus scores documents for wikidata is :

  1. The lucene score (applied to all docs). NOTE: When I talk about top-N docs below this is according to this ranking.
  2. The phrase rescore: if the query has more than 1 word, the doc is overboosted if it contains the same sequence of adjacent words. Only the top-N docs are analyzed (N=512 per shards here because it's very costly). This does not apply here because the query is one word.
  3. Special:Search on wikidata is configured to query 2 namespaces (0 and 120). Boost for ns 0 is 0.05 and for ns 120 is 0.2 (top-8196 docs per shards analyzed). I assume this is not related to our problem because there's only 10 properties related to life.
  4. The number of incoming links (top-8196 docs per shards analyzed).

A small note on the lucene score:
Lucene scores docs using a tf.idf formula this formula also includes a normalization based on document size. Large documents tend to be ranked lower, this understandable because large docs may have higher term frequencies and thus higher raw tf.idf scores, normalization on size helps to mitigate this problem.
Why does it affect wikidata?
Because we flatten all the data into the same field, a wikibase entity with a lot of labels in many different languages (likely to happen for high profile items) will be larger than less important items and thus have a lower lucene score.

Because of the current cirrus<->wikidata mapping problems we're trying to address (everything is in the same field so no boosts on title/redirects can be applied) it's very likely that the incoming_link boost will take precedence over lucene score and from what I see: life has a low number of incoming_link (53) compared to Encyclopedia of Life which has 1 081 079 incoming links.
On the other hand the third result has only 32 incoming_links.

Why Q3 has a bad lucene score?
Let's compare Q3 (ranked ~700) and x (ranked 4)

  • Q3 lucene score is 0.5476983
  • x lucene score is 0.85728467

This is because there's only 10 occurrences of the word life in the content for Q3 and 64 for x and Q3 is larger (length norm effect).

The boost on incoming link is :

  • Q3: should be something like log(2+53) but it's 0.69897 <- completely wrong
    • it looks it's log(2+3)
  • x: should be something like log(2+32) and it's 1.5314789 which is good.

So looks like the problem is because the number of incoming links stored in elasticsearch does not reflect the actual number.
This is normal in certain conditions: we have an optimization to not update docs too frequently, so if the number of incoming links does not change more than 20% we ignore the update.
But here it's way more than 20% it's a 1700% difference...

I'm not sure what's happened here...

Would it be possible to update Q3 to force a re-index of this entity and see if it fixes the issue?
If yes then we will certainly have to write a maintenance script to check this incoming_link consistency.

Side note: as you can see lucene score is rather bad for Q3, so scoring is very fragile on wikidata. This cannot be addressed without all the work planned to add a better cirrus<>wikidata integration. shows 57 incoming links which is same as what says.

the item has been edited recently and think cirrus is up-to-date.

Encyclopedia of Life has a lot of incoming links because I think we have a property for it and statements for it on a lot of items.

one problem is that title matches are not useful, as-is. "life" is an exact match for a label but not "Q3" (the title). We would want to be able to identify that "life" is an exact label match (in en = my user language) and boost that a bit.

If my user / search language is not English (maybe I am using Wikidata in German) and "life" is not an exact label match in my language, but is still an exact label match in a different language, it could perhaps also get some boost but not as much as a match in my language.

in our own "search" based on the wb_terms table, we use term_weight which considers the number of site links an item has and the number of labels it has. we should probably add these as properties in the cirrus index.

that might be the easiest/quickest thing to do and maybe we want this anyway in the longer term as part of the scoring.

Yes if you have numeric properties that are ready to use, we might be able to use them soon.
With we should be able to write a custom rescore profile for wikidata and try to workaround the poor lucene scores we have today.

PS. I'll continue to investigate why incoming_links is wrong for Q3 and certainly for other entities :(

Thanks so much for looking into this!

Sorry... I was completely wrong when analyzing lucene explain for Q3 (it's a pain to debug scoring issues ).
I think I've read another entity.

Q3 lucene score is 0.1824194
Boost link score is: 1.763428 ~= log(2+53) so it's OK
Namespace boost: 0.05
Final score will be : 0.1824194 * 1.763428 * 0.05 => 0.016084173

Here is few examples:

entityNumber of words (lower is better)Life freq (higher is better)LuceneLinksnsfinalrankdesc
Q383090.18241941.7634280.050.01608417~800The lucene score is very bad
x280640.85652651.53147890.050.065587614The lucene score is good and incoming_link is OK
x89341.0751650.77815130.050.04183205220Incoming link is bad but lucene score is good even if there's only 34 occurrences, this is because the size norm (89 vs 280 for x)

So clearly it's because of the bad lucene score.
So I was wrong : incoming links won't take precedence.
Note that the lowest value for the incoming link boost is log(2+0) so ~0.3 but there's no lower bounds for the lucene score.

But I can't explain why this has changed in August... :(
My only explanation so far would be:

  • New labels have been added to Q3 thus increasing the number of words and the lucene score was decreased as a direct consequence
  • Many entities with an occurrence to life have been added during that period

I'm not aware of an issue that could have affected the lucene score during August because it's internal to lucene/elasticsearch and something we never touched.

How to fix this problem:

  • fixing the bad lucene score will require a better cirrus <> wikidata integration to allow more complex queries with dedicated fields and boosts. This is a hard work that I don't think will be resolved in the short term.
  • workaround could be to write a custom rescore profile with a new numeric field or by overboosting incoming links (maybe completely inhibit lucene score for now). Could be addressed by

Thanks for the investigation @dcausse!

I've asked @EBernhardson to take a look at this and recommend a course of action.

@EBernhardson @Deskana thanks for helping look at this!

regarding better cirrus - wikidata integration, I am looking into how we might do that: T117548 (still WIP, but feedback welcome)

to help fix ranking, probably the easiest thing we can start with is to add extra fields like site link count, maybe statement count and label count. I am not yet sure how we would work this into the scoring, but surely it must be possible

What do folks think of adding a field that contains sitelink count as next step? We consider site link count (+ label count) in the wikibase entity suggester (which uses db backend now) and it does decently okay considering it is a crude measure.

I think we could then experiment with a rescoring profile / config for Wikidata that takes the field into consideration.

A big +1.
As far as I know it should be pretty straightforward, you just need to implement 2 hooks (CirrusSearchMappingConfig and CirrusSearchBuildDocumentParse).
The profiles (we may want to create multiple profiles with different weights for testing purpose) can be added to wmf-config.

Then we will have to re-index (exactly what you have done for geo coordinates recently).

This task seems to have dropped off the radar, somewhat; is further discussion needed, or is this awaiting action from someone (either Discovery or Wikidata)?

@Deskana I am working on T119066 which should help some. Suppose I need to know if it would be ok to deploy this next week, if it gets merged before the branch cut?

this doesn't help though that searching for "life" doesn't rank the exact match english label "life" above non-exact matches. that requires putting labels into cirrus in a more structured way and then further adjust how search and scoring works for wikibase content.

For indexing labels, I would like feedback on T117548. I already have chatted some with David abou this and probably more at the dev summit.

@aude I can help to write the rescore profiles when you are ready.

Also I realized that the example profiles I wrote in Cirrus are wrong: they use "multiply" to combine the scores but it makes no sense : (weight1 * score1) * (weight2 * score2). We might prefer to use weighted means/sums and use "add" or "avg".

We would like also to create multiple profiles with various weights and experiment.

If we are blocked on deployment during December we could try to use our hypothesis cluster and index wikidata there to play with it and adjust the weights.

@dcausse: If the scores are comparable, I suggest we use max, not avg or sum. If they are not comparable, then we can't use sum/avg/max, we'll have to use some sort of product. We could play with log scaling and see if it helps.

In any case, I believe we should not use if/idf scoring at all. It makes not sense for wikidata. Term frequency isn't indicative or anything in this context. Inverse document frequency is really only important to mitigate the impact of irrelevant terms in the search query - probably also not very helpful here, assuming that most searches are for phrases, not intersection of multiple keywords.

In-degree (ideally only counting "main snak" links), number of sitelinks, and number of statements are probably better indicators. I'd also love to see how Page Rank on main snak links would perform. My gut tells me that that should work pretty well.

We can inhibit tf/idf by setting the weight of the main query to 0 and use either "max" or "add". Note that tf/idf will still play a role to extract the top-N results that will be rescored. N is 8196*7 (number of shards) so if shards are well balanced we should cover queries that return less than 57372 entities (with more results we have a risk that the interesting entity is outside the window). We can increase this number but with some perf cost. I'll try to extract (from cirrus logs) the number of queries that returns more than this number and see if we should worry about that.

Adjusting all these weights and find the proper formula is not an easy task, we should find a way to evaluate the performance, we could maybe take Q1 to x and run a query with the english label and count the number of times the entity is in the top 10. But I don't know very well the wikidata content so there's certainly better tests to run.
We are building a set of tools to run those perf evaluations that could be useful in this case.

Concerning phrases, we have a rescore function with a strong weight but this one is applied only to the top-512*7 because it's very costly. There is techniques to optimize this process (word n-grams). If it makes sense for wikidata we should probably investigate in this direction as well.

Concerning PageRank I think you're right and @EBernhardson ran a test on enwiki and results are promising, we are building the tools needed to inject such data into the indices (hadoop <-> elastic). If the wikidata link graph is easy to extract it should be "easy" to do.

one approach for main snak links could be to extract them when generating parser output and then stored in yet another table similar to page links

alternatively, generating this data using hadoop could work

@aude I can help to write the rescore profiles when you are ready.

Also I realized that the example profiles I wrote in Cirrus are wrong: they use "multiply" to combine the scores but it makes no sense : (weight1 * score1) * (weight2 * score2). We might prefer to use weighted means/sums and use "add" or "avg".

We would like also to create multiple profiles with various weights and experiment.

If we are blocked on deployment during December we could try to use our hypothesis cluster and index wikidata there to play with it and adjust the weights.

We are making a new deployment branch of Wikibase on Monday or Tuesday, for deploy next week. Suppose I need to check with Erik + Greg etc if poking at cirrus on Wikidata is okay next week. (after all, we have train deployment next week and week after, so maybe)

I pulled in a wikidata dump to our hypothesis testing cluster a couple weeks ago, but haven't done anything with it. It contains 18.8M documents so should be pretty much the whole thing.

With the train rolling forward i don't see any reason we can't push things into prod. For figuring out what options should be used in rescoring the hypothesis-testing cluster might be a better bet.

The cluster is estest100{1,2,3,4}.search.eqiad.wmflabs

Change 257607 had a related patch set uploaded (by DCausse):
Add initial rescore profiles for wikidata

EBernhardson renamed this task from [Bug] high-ranking items seemed to have dropped significantly in Special:Search results to [Bug] high-ranking items seemed to have dropped significantly in Special:Search results for wikidata.Dec 8 2015, 5:41 PM

moving back to needs-review as all patches needed in wikidata have been merged.

main patch is now reviewed and merged. The change to operations/mediawiki-config to make the new profiles available is mergable but needs to be done after the deployment freeze is over.

deployment freeze is over, but it looks like @aude is still working out some adjustments to the profiles before we go live with this.

yes, I am experimenting with different configs to see what might be best.

Reassigning to @aude, since she's actively working on this. Keeping this on the Discovery-Search (Current work) board though, since we'll probably need to review the code.

Since there's been no activity here in the past week, I'm going to take this out of Discovery-Search (Current work). @aude, please ping us when you need code review, and we'll be there to help you out. :-)

i'm working in the current wikidata sprint to add labels to cirrus and experiment with rescoring.

@aude @dcausse If you still wish to test patch 257607, I've moved the settings to -labs, as suggested in Gerrit by @EBernhardson before the code freeze.

Change 257607 abandoned by DCausse:
Add initial rescore profiles for wikidata

this is not going to happen like that, we need to experiment on relforge servers.