Page MenuHomePhabricator

Reindexing search index wikidatawiki for eqiad fails
Closed, ResolvedPublic

Description

After launching reindex on wikidatawiki for eqiad, I get this in the logs:

       Task: hCP8oBWATB-ow4_46CiODw:1486068450 Search Retries: 0 Bulk Retries: 0 Indexed: 57280514 / 57735703
        Task: hCP8oBWATB-ow4_46CiODw:1486068450 Search Retries: 0 Bulk Retries: 0 Indexed: 57282714 / 57735703
        Task: hCP8oBWATB-ow4_46CiODw:1486068450 Search Retries: 0 Bulk Retries: 0 Indexed: 57285514 / 57735703
        Task: hCP8oBWATB-ow4_46CiODw:1486068450 Search Retries: 0 Bulk Retries: 0 Indexed: 57288614 / 57735703
        Task: hCP8oBWATB-ow4_46CiODw:1486068450 Search Retries: 0 Bulk Retries: 0 Indexed: 57290132 / 57735703
Reindex task was not successfull: Failed: [{"shard":-1,"reason":{"type":"search_context_missing_exception","reason":"No search context found for id
[654279810]"}},{"shard":-1,"reason":{"type":"search_context_missing_exception","reason":"No search context found for id [890751724]"}}]

and new index doesn't seem to be there. Looks like we have some problem in wikidatawiki indexing. codfw indexing went fine, so maybe some timing related things are in play.

Event Timeline

Smalyshev updated the task description. (Show Details)

Another error I've got is:

                Validating mapping...different...failed!                                                       
Couldn't update existing mappings. You may need to reindex.       
Here is elasticsearch's error message: process_cluster_event_timeout_exception: failed to process cluster event (put-mapping) within 30s

What is worse, the script does not seem to return error code there so reindex script proceeds on this failure as if nothing happened. This should be fixed too.

Change 520446 had a related patch set uploaded (by DCausse; owner: DCausse):
[operations/mediawiki-config@master] [cirrus] Increase elastic master timeout to 5m

https://gerrit.wikimedia.org/r/520446

The process_cluster_event_timeout_exception is something that can be addressed by increasing the masterTimeout we reverted it back to 30s hoping that the multi-instance cluster setup would help here. Apparently we're still seeing this kind of issue Let's bump this value up to 5m.

As for the SearchContext issue it's hard to do something about it except stabilizing the cluster. Since no nodes went away during the reindex I suspect that a shard rebalance caused a shard holding the scroll to have been moved from a node to another. We experience a lot more rebalances in eqiad because we are a bit more constrained on disk space therefor increasing the chances to have shards being moved (esp. for very large indices like wikidata).
In other words I think we need to stabilize this graph: https://grafana.wikimedia.org/d/JLK3I_siz/elasticsearch-indexing?panelId=34&fullscreen&orgId=1
Hopefully replacing old nodes in this cluster this year will help a lot.

As for the SearchContext issue it's hard to do something about it except stabilizing the cluster. Since no nodes went away during the reindex I suspect that a shard rebalance caused a shard holding the scroll to have been moved from a node to another. We experience a lot more rebalances in eqiad because we are a bit more constrained on disk space therefor increasing the chances to have shards being moved (esp. for very large indices like wikidata).
In other words I think we need to stabilize this graph: https://grafana.wikimedia.org/d/JLK3I_siz/elasticsearch-indexing?panelId=34&fullscreen&orgId=1
Hopefully replacing old nodes in this cluster this year will help a lot.

I tested this by starting up a two node elasticsearch 6.5.4 cluster, indexing documents, and then scrolling over the documents. After starting the scroll rerouting index from node A -> B works and doesn't break the scroll. Attempting to reroute the shard from B back to A fails to acquire a lock until the scroll is closed. Once the scroll is closed the shard is free to move back to A again.

It could still have something to do with the shard movement, but i wasn't able to convince elasticsearch to reproduce this error using the cluster reroute api.

Change 520446 merged by jenkins-bot:
[operations/mediawiki-config@master] [cirrus] Increase elastic master timeout to 5m

https://gerrit.wikimedia.org/r/520446

Mentioned in SAL (#wikimedia-operations) [2019-07-08T18:06:37Z] <urbanecm@deploy1001> Synchronized wmf-config/CirrusSearch-production.php: SWAT: [[:gerrit:520446|[cirrus] Increase elastic master timeout to 5m]] (T227136) (duration: 00m 49s)

OK, process_cluster_event_timeout_exception seems to be gone now, let's see what full reindex brings.

Reindex failed again with:

        Task: 2RzPRaQORnWFGJZERdgd9A:1072910896 Search Retries: 0 Bulk Retries: 0 Indexed: 47815109 / 57882498
Reindex task was not successfull: Failed: [{"shard":-1,"reason":{"type":"es_rejected_execution_exception","reason":"rejected execution of org.elastics
earch.common.util.concurrent.TimedRunnable@149d19f0 on QueueResizingEsThreadPoolExecutor[name = elastic1051-production-search-eqiad\/search, queue cap
acity = 1000, min queue capacity = 1000, max queue capacity = 1000, frame size = 2000, targeted response rate = 1s, task execution EWMA = 2.9s, adjust
ment amount = 50, org.elasticsearch.common.util.concurrent.QueueResizingEsThreadPoolExecutor@6c0d056[Running, pool size = 61, active threads = 61, que
ued tasks = 1000, completed tasks = 1736446094]]"}},{"shard":-1,"reason":{"type":"es_rejected_execution_exception","reason":"rejected execution of org
.elasticsearch.common.util.concurrent.TimedRunnable@376ac7fc on QueueResizingEsThreadPoolExecutor[name = elastic1051-production-search-eqiad\/search,
queue capacity = 1000, min queue capacity = 1000, max queue capacity = 1000, frame size = 2000, targeted response rate = 1s, task execution EWMA = 876
.8ms, adjustment amount = 50, org.elasticsearch.common.util.concurrent.QueueResizingEsThreadPoolExecutor@6c0d056[Running, pool size = 61, active threa
ds = 61, queued tasks = 1026, completed tasks = 1739550836]]"}},{"shard":-1,"reason":{"type":"es_rejected_execution_exception","reason":"rejected exec
ution of org.elasticsearch.common.util.concurrent.TimedRunnable@666bda88 on QueueResizingEsThreadPoolExecutor[name = elastic1051-production-search-eqi
ad\/search, queue capacity = 1000, min queue capacity = 1000, max queue capacity = 1000, frame size = 2000, targeted response rate = 1s, task executio
n EWMA = 1.3s, adjustment amount = 50, org.elasticsearch.common.util.concurrent.QueueResizingEsThreadPoolExecutor@6c0d056[Running, pool size = 61, act
ive threads = 61, queued tasks = 1002, completed tasks = 1739532977]]"}},{"shard":-1,"reason":{"type":"es_rejected_execution_exception","reason":"reje
cted execution of org.elasticsearch.common.util.concurrent.TimedRunnable@1509003f on QueueResizingEsThreadPoolExecutor[name = elastic1051-production-s
earch-eqiad\/search, queue capacity = 1000, min queue capacity = 1000, max queue capacity = 1000, frame size = 2000, targeted response rate = 1s, task
 execution EWMA = 2.9s, adjustment amount = 50, org.elasticsearch.common.util.concurrent.QueueResizingEsThreadPoolExecutor@6c0d056[Running, pool size
= 61, active threads = 61, queued tasks = 1000, completed tasks = 1736446094]]"}},{"shard":-1,"reason":{"type":"es_rejected_execution_exception","reas
on":"rejected execution of org.elasticsearch.common.util.concurrent.TimedRunnable@1c557685 on QueueResizingEsThreadPoolExecutor[name = elastic1051-pro
duction-search-eqiad\/search, queue capacity = 1000, min queue capacity = 1000, max queue capacity = 1000, frame size = 2000, targeted response rate =
 1s, task execution EWMA = 90.4ms, adjustment amount = 50, org.elasticsearch.common.util.concurrent.QueueResizingEsThreadPoolExecutor@6c0d056[Running,
 pool size = 61, active threads = 61, queued tasks = 1005, completed tasks = 1739514105]]"}},{"shard":-1,"reason":{"type":"es_rejected_execution_excep
tion","reason":"rejected execution of org.elasticsearch.common.util.concurrent.TimedRunnable@6d7b5dbb on QueueResizingEsThreadPoolExecutor[name = elastic1051-production-search-eqiad\/search, queue capacity = 1000, min queue capacity = 1000, max queue capacity = 1000, frame size = 2000, targeted response rate = 1s, task execution EWMA = 1.1s, adjustment amount = 50, org.elasticsearch.common.util.concurrent.QueueResizingEsThreadPoolExecutor@6c0d056[Running, pool size = 61, active threads = 61, queued tasks = 1005, completed tasks = 1739514106]]"}},{"shard":-1,"reason":{"type":"es_rejected_execution_exception","reason":"rejected execution of org.elasticsearch.common.util.concurrent.TimedRunnable@16d8b36b on QueueResizingEsThreadPoolExecutor[name = elastic1051-production-search-eqiad\/search, queue capacity = 1000, min queue capacity = 1000, max queue capacity = 1000, frame size = 2000, targeted response rate = 1s, task execution EWMA = 1.9s, adjustment amount = 50, org.elasticsearch.common.util.concurrent.QueueResizingEsThreadPoolExecutor@6c0d056[Running, pool size = 61, active threads = 61, queued tasks = 1000, completed tasks = 1736446093]]"}}]

Hmm, that is quite unexpected. Basically the error that is being returned is that the search queue on a single node filled up and rejected a search. Reindex is supposed to handle retries natively, and we can see from the status line it reports 0 retries so far.

This could be related to: https://github.com/elastic/elasticsearch/issues/26153

It's not 100% the same situation, but it seems close.

Summary:

When a scroll fails on some shards and returns partial results, there's no current way to rewind so that the reader makes sure to  get those missing documents. In the context of the reindex API, losing documents is clearly very incorrect behavior, so the right thing to do is fail when this happens, even though it's unfortunately very inconvenient.

One option is we can simply restart the same reindex into the same target index. Anything previously imported (and not updated since) will get a noop and speed along the reindex task. I don't think our scripts currently have an option to reindex into an existing index though. Not sure what else we can do from outside elasticsearch on this one. The summary of the bug seems to imply even if we managed the scroll ourselves it's the scroll that fails, and we can't rewind or resume a scroll.

One option is we can simply restart the same reindex into the same target index.

Can we make the script to do this if reindex fails? Or that's too hard to detect? If it is, we should at least produce a failure code I think...

One option is we can simply restart the same reindex into the same target index.

Can we make the script to do this if reindex fails? Or that's too hard to detect? If it is, we should at least produce a failure code I think...

I don't see why not, we might want to try and define what a retry-able error is, but likely even retrying something that doesn't work a few times will only lead to some extra log spam?

Mentioned in SAL (#wikimedia-operations) [2019-07-16T12:42:40Z] <dcausse> deleting stale wikidata indices (elastic@eqiad) T227136

Mentioned in SAL (#wikimedia-operations) [2019-07-16T16:28:07Z] <dcausse> reindexing wikidata (elastic@eqiad) T227136

Failed again (56850851 / 58244355) with:

Reindex task was not successfull: Failed: [{"shard":-1,"reason":{"type":"search_context_missing_exception","reason":"No search context found for id [803795624]"}},{"shard":-1,"reason":{"type":"search_context_missing_exception","reason":"No search context found for id [854883345]"}},{"shard":-1,"reason":{"type":"search_context_missing_exception","reason":"No search context found for id [562256487]"}},{"shard":-1,"reason":{"type":"search_context_missing_exception","reason":"No search context found for id [894521381]"}},{"shard":-1,"reason":{"type":"search_context_missing_exception","reason":"No search context found for id [825290081]"}},{"shard":-1,"reason":{"type":"search_context_missing_exception","reason":"No search context found for id [894521383]"}},{"shard":-1,"reason":{"type":"search_context_missing_exception","reason":"No search context found for id [825290085]"}},{"shard":-1,"reason":{"type":"search_context_missing_exception","reason":"No search context found for id [894521384]"}},{"shard":-1,"reason":{"type":"search_context_missing_exception","reason":"No search context found for id [854883344]"}},{"shard":-1,"reason":{"type":"search_context_missing_exception","reason":"No search context found for id [854883343]"}},{"shard":-1,"reason":{"type":"search_context_missing_exception","reason":"No search context found for id [803795626]"}},{"shard":-1,"reason":{"type":"search_context_missing_exception","reason":"No search context found for id [803795625]"}},{"shard":-1,"reason":{"type":"search_context_missing_exception","reason":"No search context found for id [825290078]"}}]

I'll switch everything but completion to codfw and test again.

Change 523860 had a related patch set uploaded (by DCausse; owner: DCausse):
[operations/mediawiki-config@master] [cirrus] switch search traffic (except completion) to codfw

https://gerrit.wikimedia.org/r/523860

Change 523860 merged by jenkins-bot:
[operations/mediawiki-config@master] [cirrus] switch search traffic (except completion) to codfw

https://gerrit.wikimedia.org/r/523860

Mentioned in SAL (#wikimedia-operations) [2019-07-17T11:08:29Z] <dcausse@deploy1001> Synchronized wmf-config/InitialiseSettings.php: T227136: [cirrus] switch search traffic (except completion) to codfw (duration: 00m 54s)

Mentioned in SAL (#wikimedia-operations) [2019-07-17T11:16:24Z] <dcausse> reindexing wikidata (elastic@eqiad) T227136

debt claimed this task.