Page MenuHomePhabricator

Related pages returns same article
Closed, ResolvedPublic

Description

Look at related pages for Kathryn Borel and you will see that one of the proposed related pages is the article itself:
https://en.wikipedia.org/wiki/Kathryn_Borel

Event Timeline

JKatzWMF created this task.May 23 2016, 9:16 PM
Restricted Application added projects: Discovery, Discovery-Search. · View Herald TranscriptMay 23 2016, 9:16 PM
Restricted Application added subscribers: Zppix, Aklapper. · View Herald Transcript

What's the query that issues? A look at the direct search results doesn't include it:

https://en.wikipedia.org/w/index.php?title=Special:Search&profile=default&fulltext=Search&search=morelike%3AKathryn_Borel

  1. Jian Ghomeshi
  2. Canadian Broadcasting Corporation
  3. Trial of Jian Ghomeshi
  4. Moxy Früvous
  5. Q (radio show)
  6. Lucy DeCoutere
  7. Marie Henein
  8. The New York Times
  9. Elizabeth May
  10. Mitt Romney
  11. Kanye West
  12. Michael Jackson
  13. Antonia Zerbisias
  14. Degrassi: The Next Generation
  15. Moors murders
  16. 1982 (book)
  17. Assata Shakur
  18. Ann Coulter
  19. M.I.A. (rapper)
  20. Barack Obama
EBernhardson added a subscriber: dcausse.EditedMay 23 2016, 9:44 PM

Quite odd, that has certainly made it into the cache, but re-running the query without the cache (manually from the cirrusDumpQuery output) doesn't get those results. Will check with @dcausse to see if this is expected behavior somehow.

Looks like it isn't isolated to this case, a couple hive queries:

select requests[0].query, hits.title from cirrussearchrequestset where year=2016 and month=5 and day=22 and requests[size(requests)-1].querytype = 'more_like' and array_contains(array_lower(
hits.title), lower(requests[size(requests)-1].query)) limit 10;
querytitles
Gundermann["Gundermann","Gundermann (Gattung)","Katzenminzen","Lippenblütler"]
Florence Foster Jenkins (film)["Florence Foster Jenkins (film)","Meryl Streep","Rebecca Ferguson","Simon Helberg"]
Florence Foster Jenkins (film)["Florence Foster Jenkins (film)","Meryl Streep","Rebecca Ferguson","Simon Helberg","Hugh Grant","Stephen Frears"]
VARTA["Varta","Vendula Vartová-Eliášová","Jickovice","Kaple svatého Františka Serafinského (Velká Jesenice)","Josef Varta","Varta (Jickovice)"]
Illegal immigration to India["Illegal immigration to India","West Bengal","India","Tripura","Dhaka","Bangladesh"]
Ciclismo en ruta["Ciclismo en ruta","Grandes Vueltas","Ciclismo","Circuitos Continentales UCI","Contrarreloj (ciclismo)","UCI ProTour"]
Torneo Esperanzas de Toulon de 2016["Torneo Esperanzas de Toulon de 2016","Selección de fútbol de Argentina","Torneo Esperanzas de Toulon de 2012","Selección de fútbol de Hungría","Torneo Esperanzas de Toulon de 2015","Selección de fútbol de Italia"]
JIT["Zimbabwe","Jit","List of Zimbabwean films","Jit (film)"]
Bios["DOS","MS-DOS","BIOS","Multiuser DOS"]
ボニーとクライド["ボニーとクライド","俺たちに明日はない","ジョン・デリンジャー","ボニーとクライド (アルバム)","クライド・ライト","クライド・クサツ"]

(note size(hits.title) > 0 filters out cached results which don't report this information)

select count(1) as queries, sum(if(array_contains(array_lower(hits.title), lower(requests[size(requests)-1].query)),1,0)) as query_in_results from cirrussearchrequestset where year=2016 and month=5 and day=22 and requests[size(requests)-1].querytype = 'more_like' and size(hits.title) > 0
queriesquery_in_results
87612942344

Some of these are probably false positives, for example BIOS and Bios are different pages (but still probably not a great result). Restricting to full title matches we get:

select count(1) as queries, sum(if(array_contains(hits.title, requests[size(requests)-1].query),1,0)) as query_in_results from cirrussearchrequestset where year=2016 and month=5 and day=22 and requests[size(requests)-1].querytype = 'more_like' and size(hits.title) > 0
queriesquery_in_results
87612941195

So, this problem effects ~1000 unique queries per day. Extrapolating from the cache hit rate, this is perhaps 3-4k requests/day.

@dcausse Seems the easiest way to fix this would be to add a query filter, or are there better options?

Yes but I think it's what we do already... a query filter should be added in the case we use setLikeText instead of setIds. IIRC with the allfield we use setLikeText by default so it's probable that this code was broken somehow...

Probably my fault: https://gerrit.wikimedia.org/r/#/c/220825/7/includes/Searcher.php
The filter was removed and setParam( 'ids', $pageIds ) is called in all cases. Maybe I thought that the ids would have acted as a filter?

Deskana triaged this task as High priority.May 24 2016, 5:48 PM
Deskana added a subscriber: Deskana.

Let's get this fixed.

I think this bug is not directly related to the morelike feature but most probably a consequence of a problem in the page update process.

i.e. the same result Porsche 911 appears twice in the results page
https://en.wikipedia.org/w/index.php?title=Special:Search&profile=default&fulltext=Search&search=intitle%3A%22Porsche+911%22&searchToken=6iio5myj4tglvr23x99hf68of

The elasticsearch index seems to contain duplicate pages with different ids, in this case 48345830 and 18300273.

According to api query with pageids 48345830 seems to be valid and 18300273 does not exist.

I checked few other examples :

This is maybe a good reason to investigate into the saneitize script which was designed to track and fix these index issues?

The saneitization process that we think will solve this problem has been deployed. It will take two weeks for it to run through all the indices though to ensure the problem is actually resolved. After the first loop through we should be able to check hive to ensure we are no longer returning duplicates.

The process is slower than we anticipated, 38362500 ids have been checked so far for enwiki. I think we'll have to wait one more week.

dcausse moved this task from Needs review to Done on the Discovery-Search (Current work) board.EditedAug 8 2016, 6:53 PM

The loop is done for enwiki and running the same hive query for today returns:

queriesquery_in_results
50861496

I think the sanitize process had a positive effect on this bug. While not fixing 100% of the duplicates the number is now significantly lower (1000 initially). Getting 0 duplicate is unfortunately very hard to achieve but we can control the aggressiveness of the sanitize process to make sure it's always close to 0.

We should continue to monitor this hive query from time to time.

select count(1) as queries, sum(if(array_contains(hits.title, requests[size(requests)-1].query),1,0)) as query_in_results from cirrussearchrequestset where year=2016 and month=8 and day=8 and requests[size(requests)-1].querytype = 'more_like' and size(hits.title) > 0;
JKatzWMF closed this task as Resolved.Aug 16 2016, 12:18 AM
JKatzWMF claimed this task.

@dcausse thank you for describing the process, David! I would say that this is a satisfying resolution for our needs.

Took another peek at this since the rollout of RelatedArticles is moving forward. The issues looks to have pretty much stayed solved.

datequeriesquery_in_results
feb 1 2017711356316
feb 2 2017701075610
debt awarded a token.Feb 4 2017, 12:45 AM