Page MenuHomePhabricator

"Morelike" query on titles with spaces are returning no results
Closed, ResolvedPublic

Event Timeline

Restricted Application added a subscriber: Aklapper. · View Herald Transcript

For the specific search against Donald Trump, it looks like the page was vandalized to say only 'Trump indeed' which isn't returning any more like results (it still probably should though?). I'm not sure yet why it hasn't been properly reindexed into elasticsearch after that vandalism edit was undone though.

Still haven't figured it if this is a wider problem.

I posted these two messages to the wrong ticket ... copying to here:

This problem may be exacerbated by the 24hr cache we have on more like queries, the existing result will live for 24hr's after it was initially retrieved after which we will issue a new query to the search cluster.

I've manually re-issued the correct query against the cluster though and it's currently returning zero results (but we still have the vandalised document in the search index):

curl -XGET localhost:9200/enwiki_content/page/_search -d '{"_source":["id","title","namespace","redirect.*","timestamp","text_bytes"],"fields":"text.word_count","query":{"more_like_this":{"min_doc_freq":2,"max_doc_freq":null,"max_query_terms":25,"min_term_freq":2,"min_word_len":0,"max_word_len":0,"percent_terms_to_match":0.3,"fields":["text"],"ids":[4848272]}},"highlight":{"pre_tags":["<span class=\"searchmatch\">"],"post_tags":["<\/span>"],"fields":{"title":{"type":"experimental","fragmenter":"none","number_of_fragments":1,"matched_fields":["title","title.plain"]},"redirect.title":{"type":"experimental","fragmenter":"none","order":"score","number_of_fragments":1,"options":{"skip_if_last_matched":true},"matched_fields":["redirect.title","redirect.title.plain"]},"category":{"type":"experimental","fragmenter":"none","order":"score","number_of_fragments":1,"options":{"skip_if_last_matched":true},"matched_fields":["category","category.plain"]},"heading":{"type":"experimental","fragmenter":"none","order":"score","number_of_fragments":1,"options":{"skip_if_last_matched":true},"matched_fields":["heading","heading.plain"]},"text":{"type":"experimental","number_of_fragments":1,"fragmenter":"scan","fragment_size":150,"options":{"top_scoring":true,"boost_before":{"20":2,"50":1.8,"200":1.5,"1000":1.2},"max_fragments_scored":5000},"no_match_size":150,"matched_fields":["text","text.plain"]},"auxiliary_text":{"type":"experimental","number_of_fragments":1,"fragmenter":"scan","fragment_size":150,"options":{"top_scoring":true,"boost_before":{"20":2,"50":1.8,"200":1.5,"1000":1.2},"max_fragments_scored":5000,"skip_if_last_matched":true},"matched_fields":["auxiliary_text","auxiliary_text.plain"]},"file_text":{"type":"experimental","number_of_fragments":1,"fragmenter":"scan","fragment_size":150,"options":{"top_scoring":true,"boost_before":{"20":2,"50":1.8,"200":1.5,"1000":1.2},"max_fragments_scored":5000,"skip_if_last_matched":true},"matched_fields":["file_text","file_text.plain"]}},"highlight_query":{"match_all":{}}},"size":20,"rescore":[{"window_size":8192,"query":{"query_weight":1,"rescore_query_weight":1,"score_mode":"multiply","rescore_query":{"function_score":{"functions":[{"field_value_factor":{"field":"incoming_links","modifier":"log2p","missing":0}},{"weight":2,"filter":{"fquery":{"_cache":true,"query":{"match":{"template":{"query":"Template:Featured article"}}}}}},{"weight":2,"filter":{"fquery":{"_cache":true,"query":{"match":{"template":{"query":"Template:Featured picture"}}}}}},{"weight":2,"filter":{"fquery":{"_cache":true,"query":{"match":{"template":{"query":"Template:Featured sound"}}}}}},{"weight":1.75,"filter":{"fquery":{"_cache":true,"query":{"match":{"template":{"query":"Template:Featured list"}}}}}},{"weight":1.5,"filter":{"fquery":{"_cache":true,"query":{"match":{"template":{"query":"Template:Good article"}}}}}}]}}}}],"stats":["more_like"]}'

Two other interesting things that are wrong:

  • The elasticsearch index reports it has revision 715300453, but the content is from a revision several edits behind: 715144349
  • I performed a null edit to the page which should have triggered a reindex. The cirrus update job ran as expected, but the index is still wrong:
2016-04-15 02:06:41 [bee626f7453ef00c8e6d3b1e] mw1008 enwiki 1.27.0-wmf.21 runJobs DEBUG: cirrusSearchLinksUpdatePrioritized Donald_Trump addedLinks=[] removedLinks=[] prioritize=1 cluster= (uuid=817570bb218142b5bd930c08438d94d5,timestamp=1460686001,QueuePartition=rdb3-6380) STARTING
2016-04-15 02:06:42 [bee626f7453ef00c8e6d3b1e] mw1008 enwiki 1.27.0-wmf.21 runJobs INFO: cirrusSearchLinksUpdatePrioritized Donald_Trump addedLinks=[] removedLinks=[] prioritize=1 cluster= (uuid=817570bb218142b5bd930c08438d94d5,timestamp=1460686001,QueuePartition=rdb3-6380) t=929 good
  • This doesn't seem to be an issue with elasticsearch rejecting due to versioning, because it has version 715300453 and we are updating to my null edit which is a brand new higher version, 715322090.

It's getting late and i've run out of time to debug this today, but i'll run a manual reindex from mwrepl tomorrow and see if i can figure out whats going on. It's unlikely this problem is restricted to the Donald Trump article, this just happens to be the first place we are seeing this indexing problem.

@dcausse Since you only had code review plans for friday, this might be a fun one to dig into :) But otherwise i'll also be spending my friday looking into why the indexing isn't working right.

@EBernhardson while debugging with mwrepl I found that nginx is rejecting our request due to a size limit (413 Request Entity Too Large). According to @Gehel the default entity size limit is 1M for nginx but 100m for elasticsearch. A patch will be deployed soon to fix the issue. However I don't understand yet why the version have been updated on elastic if the request failed.

Nginx fix is deployed and the problem seems to be fixed, Donald Trump page is now properly indexed. Concerning this specific issue (morelike) unfortunately the result is cached for 24h, we will have to wait few more hours to confirm that the problem was due to index refresh problems.

Deskana assigned this task to dcausse.
Deskana triaged this task as Medium priority.
Deskana subscribed.

The query in the description now returns results, so this problem appears to be fixed.