Page MenuHomePhabricator

Elasticsearch cluster issue / search outage
Closed, ResolvedPublic

Description

We have had an elasticsearch issue causing problems with creating search indices for new wikis and also with searching on existing wikis since around 2023-12-25.

impact

Since 2023-12-25 newly created Wikibases will have had failing search.
Since 2023-12-27 existing Wikibases may have slow, unreliable or outdated search results

timeline

2023-12-25

first wbstackElasticSearchInit error occurs

2023-12-26

more instances occur

2023-12-27

team investigates the situation
Prometheus metric exporter is deployed
Shard limit per-node is increased to 850 from 800

2023-12-28

Attempt is made to restart failing master node
It's observed that all master nodes are now logging errors relating to slow garbage collection.

2024-01-08

Memory of masters is increased from 8->10GB and heap from 4-5GB PR
Memory is further increased to be 25% of the datanode memory PR

details

The initially noticed error

RuntimeException: wbstackElasticSearchInit call for 816 was not successful:{"wbstackElasticSearchInit":{"script":"extensions/CirrusSearch/maintenance/UpdateSearchIndexConfig.php","return":1,"output":["Updating cluster default...","indexing namespaces...","\tIndexing namespaces...done","content index...","\tFetching Elasticsearch version...7.10.2...ok","\tScanning available plugins...","\t\texperimental-highlighter, extra","\tPicking analyzer...english","\tInferring index identifier...mwdb_6c0c916014_content_first","\tCreating index..."]}}:

at .App\Jobs\CirrusSearch\ElasticSearchIndexInit->handleResponse ( /var/www/html/app/Jobs/CirrusSearch/CirrusSearchJob.php:107 )
at .App\Jobs\CirrusSearch\CirrusSearchJob->handle ( /var/www/html/app/Jobs/CirrusSearch/CirrusSearchJob.php:84 )
at .Illuminate\Container\BoundMethod::Illuminate\Container\{closure} ( /var/www/html/vendor/laravel/framework/src/Illuminate/Container/BoundMethod.php:36 )
at .Illuminate\Container\Util::unwrapIfClosure ( /var/www/html/vendor/laravel/framework/src/Illuminate/Container/Util.php:40 )
at .Illuminate\Container\BoundMethod::callBoundMethod ( /var/www/html/vendor/laravel/framework/src/Illuminate/Container/BoundMethod.php:93 )
at .Illuminate\Container\BoundMethod::call ( /var/www/html/vendor/laravel/framework/src/Illuminate/Container/BoundMethod.php:37 )
at .Illuminate\Container\Container->call ( /var/www/html/vendor/laravel/framework/src/Illuminate/Container/Container.php:653 )
at .Illuminate\Bus\Dispatcher->Illuminate\Bus\{closure} ( /var/www/html/vendor/laravel/framework/src/Illuminate/Bus/Dispatcher.php:128 )
at .Illuminate\Pipeline\Pipeline->Illuminate\Pipeline\{closure} ( /var/www/html/vendor/laravel/framework/src/Illuminate/Pipeline/Pipeline.php:128 )
at .Illuminate\Pipeline\Pipeline->then ( /var/www/html/vendor/laravel/framework/src/Illuminate/Pipeline/Pipeline.php:103 )
at .Illuminate\Bus\Dispatcher->dispatchNow ( /var/www/html/vendor/laravel/framework/src/Illuminate/Bus/Dispatcher.php:132 )
at .Illuminate\Queue\CallQueuedHandler->Illuminate\Queue\{closure} ( /var/www/html/vendor/laravel/framework/src/Illuminate/Queue/CallQueuedHandler.php:120 )
at .Illuminate\Pipeline\Pipeline->Illuminate\Pipeline\{closure} ( /var/www/html/vendor/laravel/framework/src/Illuminate/Pipeline/Pipeline.php:128 )
at .Illuminate\Pipeline\Pipeline->then ( /var/www/html/vendor/laravel/framework/src/Illuminate/Pipeline/Pipeline.php:103 )
at .Illuminate\Queue\CallQueuedHandler->dispatchThroughMiddleware ( /var/www/html/vendor/laravel/framework/src/Illuminate/Queue/CallQueuedHandler.php:122 )
at .Illuminate\Queue\CallQueuedHandler->call ( /var/www/html/vendor/laravel/framework/src/Illuminate/Queue/CallQueuedHandler.php:70 )
at .Illuminate\Queue\Jobs\Job->fire ( /var/www/html/vendor/laravel/framework/src/Illuminate/Queue/Jobs/Job.php:98 )
at .Illuminate\Queue\Worker->process ( /var/www/html/vendor/laravel/framework/src/Illuminate/Queue/Worker.php:428 )
at .Illuminate\Queue\Worker->runJob ( /var/www/html/vendor/laravel/framework/src/Illuminate/Queue/Worker.php:378 )
at .Illuminate\Queue\Worker->daemon ( /var/www/html/vendor/laravel/framework/src/Illuminate/Queue/Worker.php:172 )
at .Illuminate\Queue\Console\WorkCommand->runWorker ( /var/www/html/vendor/laravel/framework/src/Illuminate/Queue/Console/WorkCommand.php:117 )
at .Illuminate\Queue\Console\WorkCommand->handle ( /var/www/html/vendor/laravel/framework/src/Illuminate/Queue/Console/WorkCommand.php:101 )
at .Illuminate\Container\BoundMethod::Illuminate\Container\{closure} ( /var/www/html/vendor/laravel/framework/src/Illuminate/Container/BoundMethod.php:36 )
at .Illuminate\Container\Util::unwrapIfClosure ( /var/www/html/vendor/laravel/framework/src/Illuminate/Container/Util.php:40 )
at .Illuminate\Container\BoundMethod::callBoundMethod ( /var/www/html/vendor/laravel/framework/src/Illuminate/Container/BoundMethod.php:93 )
at .Illuminate\Container\BoundMethod::call ( /var/www/html/vendor/laravel/framework/src/Illuminate/Container/BoundMethod.php:37 )
at .Illuminate\Container\Container->call ( /var/www/html/vendor/laravel/framework/src/Illuminate/Container/Container.php:653 )
at .Illuminate\Console\Command->execute ( /var/www/html/vendor/laravel/framework/src/Illuminate/Console/Command.php:136 )
at .Symfony\Component\Console\Command\Command->run ( /var/www/html/vendor/symfony/console/Command/Command.php:298 )
at .Illuminate\Console\Command->run ( /var/www/html/vendor/laravel/framework/src/Illuminate/Console/Command.php:121 )
at .Symfony\Component\Console\Application->doRunCommand ( /var/www/html/vendor/symfony/console/Application.php:1040 )
at .Symfony\Component\Console\Application->doRun ( /var/www/html/vendor/symfony/console/Application.php:301 )
at .Symfony\Component\Console\Application->run ( /var/www/html/vendor/symfony/console/Application.php:171 )
at .Illuminate\Console\Application->run ( /var/www/html/vendor/laravel/framework/src/Illuminate/Console/Application.php:94 )
at .Illuminate\Foundation\Console\Kernel->handle ( /var/www/html/vendor/laravel/framework/src/Illuminate/Foundation/Console/Kernel.php:129 )
at .{main} ( /var/www/html/artisan:37 )
at .App\Providers\AppServiceProvider->App\Providers\{closure}
at .Illuminate\Events\Dispatcher->Illuminate\Events\{closure} ( /var/www/html/vendor/laravel/framework/src/Illuminate/Events/Dispatcher.php:404 )
at .Illuminate\Events\Dispatcher->dispatch ( /var/www/html/vendor/laravel/framework/src/Illuminate/Events/Dispatcher.php:249 )
at .Illuminate\Queue\Jobs\Job->fail ( /var/www/html/vendor/laravel/framework/src/Illuminate/Queue/Jobs/Job.php:195 )
at .App\Jobs\Job->fail ( /var/www/html/vendor/laravel/framework/src/Illuminate/Queue/InteractsWithQueue.php:47 )
at .App\Jobs\CirrusSearch\CirrusSearchJob->validateSuccess ( /var/www/html/app/Jobs/CirrusSearch/CirrusSearchJob.php:107 )
at .App\Jobs\CirrusSearch\ElasticSearchIndexInit->handleResponse ( /var/www/html/app/Jobs/CirrusSearch/ElasticSearchIndexInit.php:26 )
at .App\Jobs\CirrusSearch\CirrusSearchJob->handle ( /var/www/html/app/Jobs/CirrusSearch/CirrusSearchJob.php:84 )
at .Illuminate\Container\BoundMethod::Illuminate\Container\{closure} ( /var/www/html/vendor/laravel/framework/src/Illuminate/Container/BoundMethod.php:36 )
at .Illuminate\Container\Util::unwrapIfClosure ( /var/www/html/vendor/laravel/framework/src/Illuminate/Container/Util.php:40 )
at .Illuminate\Container\BoundMethod::callBoundMethod ( /var/www/html/vendor/laravel/framework/src/Illuminate/Container/BoundMethod.php:93 )
at .Illuminate\Container\BoundMethod::call ( /var/www/html/vendor/laravel/framework/src/Illuminate/Container/BoundMethod.php:37 )
at .Illuminate\Container\Container->call ( /var/www/html/vendor/laravel/framework/src/Illuminate/Container/Container.php:653 )
at .Illuminate\Bus\Dispatcher->Illuminate\Bus\{closure} ( /var/www/html/vendor/laravel/framework/src/Illuminate/Bus/Dispatcher.php:128 )
at .Illuminate\Pipeline\Pipeline->Illuminate\Pipeline\{closure} ( /var/www/html/vendor/laravel/framework/src/Illuminate/Pipeline/Pipeline.php:128 )
at .Illuminate\Pipeline\Pipeline->then ( /var/www/html/vendor/laravel/framework/src/Illuminate/Pipeline/Pipeline.php:103 )
at .Illuminate\Bus\Dispatcher->dispatchNow ( /var/www/html/vendor/laravel/framework/src/Illuminate/Bus/Dispatcher.php:132 )
at .Illuminate\Queue\CallQueuedHandler->Illuminate\Queue\{closure} ( /var/www/html/vendor/laravel/framework/src/Illuminate/Queue/CallQueuedHandler.php:120 )
at .Illuminate\Pipeline\Pipeline->Illuminate\Pipeline\{closure} ( /var/www/html/vendor/laravel/framework/src/Illuminate/Pipeline/Pipeline.php:128 )
at .Illuminate\Pipeline\Pipeline->then ( /var/www/html/vendor/laravel/framework/src/Illuminate/Pipeline/Pipeline.php:103 )
at .Illuminate\Queue\CallQueuedHandler->dispatchThroughMiddleware ( /var/www/html/vendor/laravel/framework/src/Illuminate/Queue/CallQueuedHandler.php:122 )
at .Illuminate\Queue\CallQueuedHandler->call ( /var/www/html/vendor/laravel/framework/src/Illuminate/Queue/CallQueuedHandler.php:70 )
at .Illuminate\Queue\Jobs\Job->fire ( /var/www/html/vendor/laravel/framework/src/Illuminate/Queue/Jobs/Job.php:98 )
at .Illuminate\Queue\Worker->process ( /var/www/html/vendor/laravel/framework/src/Illuminate/Queue/Worker.php:428 )
at .Illuminate\Queue\Worker->runJob ( /var/www/html/vendor/laravel/framework/src/Illuminate/Queue/Worker.php:378 )
at .Illuminate\Queue\Worker->daemon ( /var/www/html/vendor/laravel/framework/src/Illuminate/Queue/Worker.php:172 )
at .Illuminate\Queue\Console\WorkCommand->runWorker ( /var/www/html/vendor/laravel/framework/src/Illuminate/Queue/Console/WorkCommand.php:117 )
at .Illuminate\Queue\Console\WorkCommand->handle ( /var/www/html/vendor/laravel/framework/src/Illuminate/Queue/Console/WorkCommand.php:101 )
at .Illuminate\Container\BoundMethod::Illuminate\Container\{closure} ( /var/www/html/vendor/laravel/framework/src/Illuminate/Container/BoundMethod.php:36 )
at .Illuminate\Container\Util::unwrapIfClosure ( /var/www/html/vendor/laravel/framework/src/Illuminate/Container/Util.php:40 )
at .Illuminate\Container\BoundMethod::callBoundMethod ( /var/www/html/vendor/laravel/framework/src/Illuminate/Container/BoundMethod.php:93 )
at .Illuminate\Container\BoundMethod::call ( /var/www/html/vendor/laravel/framework/src/Illuminate/Container/BoundMethod.php:37 )
at .Illuminate\Container\Container->call ( /var/www/html/vendor/laravel/framework/src/Illuminate/Container/Container.php:653 )
at .Illuminate\Console\Command->execute ( /var/www/html/vendor/laravel/framework/src/Illuminate/Console/Command.php:136 )
at .Symfony\Component\Console\Command\Command->run ( /var/www/html/vendor/symfony/console/Command/Command.php:298 )
at .Illuminate\Console\Command->run ( /var/www/html/vendor/laravel/framework/src/Illuminate/Console/Command.php:121 )
at .Symfony\Component\Console\Application->doRunCommand ( /var/www/html/vendor/symfony/console/Application.php:1040 )
at .Symfony\Component\Console\Application->doRun ( /var/www/html/vendor/symfony/console/Application.php:301 )
at .Symfony\Component\Console\Application->run ( /var/www/html/vendor/symfony/console/Application.php:171 )
at .Illuminate\Console\Application->run ( /var/www/html/vendor/laravel/framework/src/Illuminate/Console/Application.php:94 )
at .Illuminate\Foundation\Console\Kernel->handle ( /var/www/html/vendor/laravel/framework/src/Illuminate/Foundation/Console/Kernel.php:129 )
at .{main} ( /var/www/html/artisan:37 )

The first meaningful errors seen in the elasticsearch logs were:
on the second master elasticsearch-1-master-1 there is flurry off slow cluster state computing and the corresponding flapping in health.

[2023-12-24T06:37:10,628][INFO ][o.e.c.m.MetadataCreateIndexService] [elasticsearch-1-master-1] [mwdb_e94e6a3fa5_content_first] creating index, cause [api], templates [mwdb], shards [1]/[1]                                      [0/1994]
[2023-12-24T06:37:10,679][WARN ][o.e.c.s.MasterService    ] [elasticsearch-1-master-1] took [13.4s], which is over [10s], to compute cluster state update for [create-index [mwdb_e94e6a3fa5_content_first], cause [api]]                  
[2023-12-24T06:37:25,056][INFO ][o.e.c.m.MetadataMappingService] [elasticsearch-1-master-1] [mwdb_e94e6a3fa5_content_first/7q_O3tHqTsOSEUpMKUA9ag] create_mapping [_doc]                                                                   
[2023-12-24T06:37:25,202][INFO ][o.e.c.r.a.AllocationService] [elasticsearch-1-master-1] Cluster health status changed from [YELLOW] to [GREEN] (reason: [shards started [[mwdb_e94e6a3fa5_content_first][0]]]).                           
[2023-12-24T06:37:45,299][INFO ][o.e.c.m.MetadataCreateIndexService] [elasticsearch-1-master-1] [mwdb_e94e6a3fa5_general_first] creating index, cause [api], templates [mwdb], shards [1]/[1]                                              
[2023-12-24T06:37:45,341][WARN ][o.e.c.s.MasterService    ] [elasticsearch-1-master-1] took [13.2s], which is over [10s], to compute cluster state update for [create-index [mwdb_e94e6a3fa5_general_first], cause [api]]                  
[2023-12-24T06:37:59,459][INFO ][o.e.c.m.MetadataMappingService] [elasticsearch-1-master-1] [mwdb_e94e6a3fa5_general_first/8Knb2WMMRkyUd__AxQ6llA] create_mapping [_doc]                                                                   
[2023-12-24T06:37:59,604][INFO ][o.e.c.r.a.AllocationService] [elasticsearch-1-master-1] Cluster health status changed from [YELLOW] to [GREEN] (reason: [shards started [[mwdb_e94e6a3fa5_general_first][0]]]).                           
[2023-12-24T23:34:20,031][INFO ][o.e.c.m.MetadataCreateIndexService] [elasticsearch-1-master-1] [mwdb_cf06321a6a_content_first] creating index, cause [api], templates [mwdb], shards [1]/[1]                                              
[2023-12-24T23:34:20,082][WARN ][o.e.c.s.MasterService    ] [elasticsearch-1-master-1] took [13.8s], which is over [10s], to compute cluster state update for [create-index [mwdb_cf06321a6a_content_first], cause [api]]                  
[2023-12-24T23:34:37,149][INFO ][o.e.c.m.MetadataMappingService] [elasticsearch-1-master-1] [mwdb_cf06321a6a_content_first/zEVVMI2ZQoC9j5DpUzXwLg] create_mapping [_doc]                                                                   
[2023-12-24T23:34:37,294][INFO ][o.e.c.r.a.AllocationService] [elasticsearch-1-master-1] Cluster health status changed from [YELLOW] to [GREEN] (reason: [shards started [[mwdb_cf06321a6a_content_first][0]]]).                           
[2023-12-24T23:34:57,862][INFO ][o.e.c.m.MetadataCreateIndexService] [elasticsearch-1-master-1] [mwdb_cf06321a6a_general_first] creating index, cause [api], templates [mwdb], shards [1]/[1]                                              
[2023-12-24T23:34:57,905][WARN ][o.e.c.s.MasterService    ] [elasticsearch-1-master-1] took [13.6s], which is over [10s], to compute cluster state update for [create-index [mwdb_cf06321a6a_general_first], cause [api]]                  
[2023-12-24T23:35:12,614][INFO ][o.e.c.m.MetadataMappingService] [elasticsearch-1-master-1] [mwdb_cf06321a6a_general_first/ScAe44OHTjqT3kzJoTwvRQ] create_mapping [_doc]                                                                   
[2023-12-24T23:35:12,766][INFO ][o.e.c.r.a.AllocationService] [elasticsearch-1-master-1] Cluster health status changed from [YELLOW] to [GREEN] (reason: [shards started [[mwdb_cf06321a6a_general_first][0]]]).                           
[2023-12-25T22:22:15,885][WARN ][o.e.t.InboundHandler     ] [elasticsearch-1-master-1] handling inbound transport message [InboundMessage{Header{268644}{7.10.2}{4883636}{true}{false}{false}{false}{indices:admin/create}}] took [6803ms] 
which is above the warn threshold of [5000ms]                                                                                                                                                                                              
[2023-12-26T11:04:50,542][WARN ][o.e.t.InboundHandler     ] [elasticsearch-1-master-1] handling inbound transport message [InboundMessage{Header{268644}{7.10.2}{5425478}{true}{false}{false}{false}{indices:admin/create}}] took [6603ms] 
which is above the warn threshold of [5000ms]                                                                                                                                                                                              
[2023-12-26T17:06:54,513][WARN ][o.e.t.InboundHandler     ] [elasticsearch-1-master-1] handling inbound transport message [InboundMessage{Header{268644}{7.10.2}{4954606}{true}{false}{false}{false}{indices:admin/create}}] took [6602ms] 
which is above the warn threshold of [5000ms]

Event Timeline

This error was previously grouped and caused by T350404

After trying to dig through the Prometheus I found some confusing data, for example that our shard count was around 214.

We appear to be lacking Prometheus backed metrics for elasticsearch on wbaas-3 (they're only there on wbaas-2). I'll now look at fixing this

https://github.com/wmde/wbaas-deploy/pull/1336 - given I am the only one here I shall go ahead and merge this.

Trying to get this deployed and working required tweaking the limits somewhat. Something very noticeable is that the lack of cpu seems to have caused excess memory consumption:

image.png (323×1 px, 31 KB)

notice how memory kept growing until being killed until cpu was increased. Then memory stayed around the old limit

Well, looking at the nice metrics it's pretty obvious we're approaching shards per node exhaustion (same issue as T350404). With a limit of 800 shards per node and apparently a 2399 shard count this is probably the issue. Its surprising to me that we don't have an error reflecting this in the logs but instead just have more timeout like messages

Peaking around 27GB of heap usage we have a little head room. I shall increase the shard per node count to 850 to buy us a little time and we can then think about this more carefully when more people are around

Ran echo '{"persistent":{"cluster.max_shards_per_node": "850" }}' | http put localhost:9200/_cluster/settings after forwarding the remote master to a local port using kubectl port-forward pod/elasticsearch-1-master-0 9200:9200

This doesn't appear to help and neither did crudely restarting the misbehaving ES node with delete pod

sadly the "spending ages running GC" is back on on all nodes; given nothing is properly borked I'm not going to do any heavy engineering until more people are around. There would be no harm in looking at this Friday or Monday while I'm off though

Tarrow moved this task from Doing to To do on the Wikibase Cloud (Kanban board Q4 2023) board.
Tarrow renamed this task from wbstackElasticSearchInit is failing to Elasticsearch cluster issue / search outage.Jan 8 2024, 3:35 PM
Tarrow claimed this task.
Tarrow updated Other Assignee, added: Andrew-WMDE.
Tarrow updated the task description. (Show Details)
Tarrow updated the task description. (Show Details)

Recap: We hit our shard limit set in T350404#9340256 on 25th of December. This limit was then bumped to 850 as we still had enough heap left on our data nodes to accommodate these extra shards. Then on the 28th the master nodes started to OOM.

While Elasticsearch's official documentation doesn't provide much on memory requirements for master-only nodes, allocating 25% of the memory for data nodes is a reasonable number floating around. The cluster is healthy again after following this recommendation.

  • Bump up memory for master-nodes
  • Create initial indices for new Wikis that failed since Dec. 24th
  • Force re-index all existing Wikis from Dec. 24th onwards
Tarrow updated Other Assignee, removed: Andrew-WMDE.

Steps taken to stabilize the cluster:

  • Added an additional Elasticsearch data node on production
  • Reduced max shards per node back down to 800
  • Increased startup probe timeout to 6hrs (it now takes slightly more than 4hrs for a data node to fully rejoin)
  • Rebalanced all shards

Steps to restore search data integrity:

  • Created initial indices for new Wikis that failed since Dec. 24th
  • Force re-indexed all Wikis from Dec. 24th onwards

wmde/wbaas-deploy/pull/1350 | chore(tf): increase number of production nodes to 4
wmde/wbaas-deploy/pull/1351 | chore(elasticsearch): add 4th data node on production
wmde/wbaas-deploy/pull/1353 | chore(elasticsearch): set startup probe timeout to 6hrs

Elasticsearch shards can be rebalanced using:

POST /_cluster/reroute
Tarrow claimed this task.