Page MenuHomePhabricator

Restart cirrus elasticsearch servers for java upgrade
Closed, ResolvedPublic

Description

  • relforge
  • cloudelastic
  • eqiad
  • codfw

Event Timeline

Note: Relforge was restarted yesterday

Mentioned in SAL (#wikimedia-operations) [2020-11-25T17:05:43Z] <ryankemper> T268770 Begin rolling restart of eqiad cirrus elasticsearch, 3 nodes at a time

Mentioned in SAL (#wikimedia-operations) [2020-11-25T17:09:06Z] <ryankemper> T268770 [cloudelastic] Downtimed cloudelastic100[1-6] in icinga in preparation for cloudelastic search elasticsearch cluster restart

Mentioned in SAL (#wikimedia-operations) [2020-11-25T17:22:33Z] <ryankemper> T268770 Freezing writes to cloudelastic in preparation for restarts: /usr/local/bin/mwscript extensions/CirrusSearch/maintenance/FreezeWritesToCluster.php --wiki=enwiki --cluster=cloudelastic on mwmaint1002

Mentioned in SAL (#wikimedia-operations) [2020-11-25T17:28:48Z] <ryankemper> T268770 [cloudelastic] restarts on cloudelastic1001 complete and all 3 elasticsearch clusters are green, proceeding to next instance

Mentioned in SAL (#wikimedia-operations) [2020-11-25T17:39:13Z] <ryankemper> T268770 [cloudelastic] restarts on cloudelastic1002 complete and all 3 elasticsearch clusters are green, proceeding to next instance

Mentioned in SAL (#wikimedia-operations) [2020-11-25T17:39:23Z] <ryankemper> T268770 [cloudelastic] restarts on cloudelastic1003 complete and all 3 elasticsearch clusters are green, proceeding to next instance

Mentioned in SAL (#wikimedia-operations) [2020-11-25T17:44:27Z] <ryankemper> T268770 [cloudelastic] restarts on cloudelastic1004 complete and all 3 elasticsearch clusters are green, proceeding to next instance

Mentioned in SAL (#wikimedia-operations) [2020-11-25T17:49:42Z] <ryankemper> T268770 [cloudelastic] restarts on cloudelastic1005 complete and all 3 elasticsearch clusters are green, proceeding to next instance

Mentioned in SAL (#wikimedia-operations) [2020-11-25T17:55:47Z] <ryankemper> T268770 [cloudelastic] restarts on cloudelastic1006 complete and all 3 elasticsearch clusters are green, all cloudelastic instances are now complete

Mentioned in SAL (#wikimedia-operations) [2020-11-25T17:58:00Z] <ryankemper> T268770 [cloudelastic] restarts complete, service is healthy. This is done.

RKemper triaged this task as High priority.Nov 25 2020, 6:04 PM
RKemper updated the task description. (Show Details)
RKemper moved this task from Incoming to In Progress on the Discovery-Search (Current work) board.

Mentioned in SAL (#wikimedia-operations) [2020-11-25T18:05:11Z] <ryankemper> T268770 [cloudelastic] Thawed writes to cloudelastic cluster following restarts: /usr/local/bin/mwscript extensions/CirrusSearch/maintenance/FreezeWritesToCluster.php --wiki=enwiki --cluster=cloudelastic --thaw on mwmaint1002

Enabling Puppet with reason "eqiad cluster restart - ryankemper@cumin1001 - T268770" on 3 hosts: elastic[1033,1044,1054].eqiad.wmnet
Wait for green in search_eqiad before fetching next set of nodes
waiting for clusters to be green
Allow time to consume write queue
Failed to call 'spicerack.elasticsearch_cluster.wait_for_all_write_queues_empty' [1/60, retrying in 60.00s]: Write queue not empty (had value of 29885) for partition 0of topic eqiad.cpjobqueue.partitioned.mediawiki.job.cirrusSearchElasticaWrite.
Failed to call 'spicerack.elasticsearch_cluster.wait_for_all_write_queues_empty' [2/60, retrying in 60.00s]: Write queue not empty (had value of 43562) for partition 0of topic eqiad.cpjobqueue.partitioned.mediawiki.job.cirrusSearchElasticaWrite.
Failed to call 'spicerack.elasticsearch_cluster.wait_for_all_write_queues_empty' [3/60, retrying in 60.00s]: Write queue not empty (had value of 61351) for partition 0of topic eqiad.cpjobqueue.partitioned.mediawiki.job.cirrusSearchElasticaWrite.
Failed to call 'spicerack.elasticsearch_cluster.wait_for_all_write_queues_empty' [4/60, retrying in 60.00s]: Write queue not empty (had value of 71522) for partition 0of topic eqiad.cpjobqueue.partitioned.mediawiki.job.cirrusSearchElasticaWrite.
Failed to call 'spicerack.elasticsearch_cluster.wait_for_all_write_queues_empty' [5/60, retrying in 60.00s]: Write queue not empty (had value of 93973) for partition 0of topic eqiad.cpjobqueue.partitioned.mediawiki.job.cirrusSearchElasticaWrite.
Failed to call 'spicerack.elasticsearch_cluster.wait_for_all_write_queues_empty' [6/60, retrying in 60.00s]: Write queue not empty (had value of 92583) for partition 0of topic eqiad.cpjobqueue.partitioned.mediawiki.job.cirrusSearchElasticaWrite.
Failed to call 'spicerack.elasticsearch_cluster.wait_for_all_write_queues_empty' [7/60, retrying in 60.00s]: Write queue not empty (had value of 85654) for partition 0of topic eqiad.cpjobqueue.partitioned.mediawiki.job.cirrusSearchElasticaWrite.
Failed to call 'spicerack.elasticsearch_cluster.wait_for_all_write_queues_empty' [8/60, retrying in 60.00s]: Write queue not empty (had value of 89902) for partition 0of topic eqiad.cpjobqueue.partitioned.mediawiki.job.cirrusSearchElasticaWrite.
Failed to call 'spicerack.elasticsearch_cluster.wait_for_all_write_queues_empty' [9/60, retrying in 60.00s]: Write queue not empty (had value of 99068) for partition 0of topic eqiad.cpjobqueue.partitioned.mediawiki.job.cirrusSearchElasticaWrite.
Failed to call 'spicerack.elasticsearch_cluster.wait_for_all_write_queues_empty' [10/60, retrying in 60.00s]: Write queue not empty (had value of 118867) for partition 0of topic eqiad.cpjobqueue.partitioned.mediawiki.job.cirrusSearchElasticaWrite.
Failed to call 'spicerack.elasticsearch_cluster.wait_for_all_write_queues_empty' [11/60, retrying in 60.00s]: Write queue not empty (had value of 140057) for partition 0of topic eqiad.cpjobqueue.partitioned.mediawiki.job.cirrusSearchElasticaWrite.
Failed to call 'spicerack.elasticsearch_cluster.wait_for_all_write_queues_empty' [12/60, retrying in 60.00s]: Write queue not empty (had value of 134659) for partition 0of topic eqiad.cpjobqueue.partitioned.mediawiki.job.cirrusSearchElasticaWrite.
Failed to call 'spicerack.elasticsearch_cluster.wait_for_all_write_queues_empty' [13/60, retrying in 60.00s]: Write queue not empty (had value of 133429) for partition 0of topic eqiad.cpjobqueue.partitioned.mediawiki.job.cirrusSearchElasticaWrite.
Failed to call 'spicerack.elasticsearch_cluster.wait_for_all_write_queues_empty' [14/60, retrying in 60.00s]: Write queue not empty (had value of 115158) for partition 0of topic eqiad.cpjobqueue.partitioned.mediawiki.job.cirrusSearchElasticaWrite.
Failed to call 'spicerack.elasticsearch_cluster.wait_for_all_write_queues_empty' [15/60, retrying in 60.00s]: Write queue not empty (had value of 122665) for partition 0of topic eqiad.cpjobqueue.partitioned.mediawiki.job.cirrusSearchElasticaWrite.
Failed to call 'spicerack.elasticsearch_cluster.wait_for_all_write_queues_empty' [16/60, retrying in 60.00s]: Write queue not empty (had value of 144766) for partition 0of topic eqiad.cpjobqueue.partitioned.mediawiki.job.cirrusSearchElasticaWrite.
Failed to call 'spicerack.elasticsearch_cluster.wait_for_all_write_queues_empty' [17/60, retrying in 60.00s]: Write queue not empty (had value of 154368) for partition 0of topic eqiad.cpjobqueue.partitioned.mediawiki.job.cirrusSearchElasticaWrite.
Failed to call 'spicerack.elasticsearch_cluster.wait_for_all_write_queues_empty' [18/60, retrying in 60.00s]: Write queue not empty (had value of 149773) for partition 0of topic eqiad.cpjobqueue.partitioned.mediawiki.job.cirrusSearchElasticaWrite.
Failed to call 'spicerack.elasticsearch_cluster.wait_for_all_write_queues_empty' [19/60, retrying in 60.00s]: Write queue not empty (had value of 157820) for partition 0of topic eqiad.cpjobqueue.partitioned.mediawiki.job.cirrusSearchElasticaWrite.
Failed to call 'spicerack.elasticsearch_cluster.wait_for_all_write_queues_empty' [20/60, retrying in 60.00s]: Write queue not empty (had value of 176395) for partition 0of topic eqiad.cpjobqueue.partitioned.mediawiki.job.cirrusSearchElasticaWrite.
Failed to call 'spicerack.elasticsearch_cluster.wait_for_all_write_queues_empty' [21/60, retrying in 60.00s]: Write queue not empty (had value of 189585) for partition 0of topic eqiad.cpjobqueue.partitioned.mediawiki.job.cirrusSearchElasticaWrite.
Failed to call 'spicerack.elasticsearch_cluster.wait_for_all_write_queues_empty' [22/60, retrying in 60.00s]: Write queue not empty (had value of 190471) for partition 0of topic eqiad.cpjobqueue.partitioned.mediawiki.job.cirrusSearchElasticaWrite.
Failed to call 'spicerack.elasticsearch_cluster.wait_for_all_write_queues_empty' [23/60, retrying in 60.00s]: Write queue not empty (had value of 182710) for partition 0of topic eqiad.cpjobqueue.partitioned.mediawiki.job.cirrusSearchElasticaWrite.
Failed to call 'spicerack.elasticsearch_cluster.wait_for_all_write_queues_empty' [24/60, retrying in 60.00s]: Write queue not empty (had value of 187793) for partition 0of topic eqiad.cpjobqueue.partitioned.mediawiki.job.cirrusSearchElasticaWrite.
Failed to call 'spicerack.elasticsearch_cluster.wait_for_all_write_queues_empty' [25/60, retrying in 60.00s]: Write queue not empty (had value of 191431) for partition 0of topic eqiad.cpjobqueue.partitioned.mediawiki.job.cirrusSearchElasticaWrite.
Failed to call 'spicerack.elasticsearch_cluster.wait_for_all_write_queues_empty' [26/60, retrying in 60.00s]: Write queue not empty (had value of 194832) for partition 0of topic eqiad.cpjobqueue.partitioned.mediawiki.job.cirrusSearchElasticaWrite.
Failed to call 'spicerack.elasticsearch_cluster.wait_for_all_write_queues_empty' [27/60, retrying in 60.00s]: Write queue not empty (had value of 198215) for partition 0of topic eqiad.cpjobqueue.partitioned.mediawiki.job.cirrusSearchElasticaWrite.
Failed to call 'spicerack.elasticsearch_cluster.wait_for_all_write_queues_empty' [28/60, retrying in 60.00s]: Write queue not empty (had value of 195794) for partition 0of topic eqiad.cpjobqueue.partitioned.mediawiki.job.cirrusSearchElasticaWrite.
Failed to call 'spicerack.elasticsearch_cluster.wait_for_all_write_queues_empty' [29/60, retrying in 60.00s]: Write queue not empty (had value of 202599) for partition 0of topic eqiad.cpjobqueue.partitioned.mediawiki.job.cirrusSearchElasticaWrite.
Failed to call 'spicerack.elasticsearch_cluster.wait_for_all_write_queues_empty' [30/60, retrying in 60.00s]: Write queue not empty (had value of 209374) for partition 0of topic eqiad.cpjobqueue.partitioned.mediawiki.job.cirrusSearchElasticaWrite.
Failed to call 'spicerack.elasticsearch_cluster.wait_for_all_write_queues_empty' [31/60, retrying in 60.00s]: Write queue not empty (had value of 206416) for partition 0of topic eqiad.cpjobqueue.partitioned.mediawiki.job.cirrusSearchElasticaWrite.
Failed to call 'spicerack.elasticsearch_cluster.wait_for_all_write_queues_empty' [32/60, retrying in 60.00s]: Write queue not empty (had value of 203025) for partition 0of topic eqiad.cpjobqueue.partitioned.mediawiki.job.cirrusSearchElasticaWrite.
Failed to call 'spicerack.elasticsearch_cluster.wait_for_all_write_queues_empty' [33/60, retrying in 60.00s]: Write queue not empty (had value of 203233) for partition 0of topic eqiad.cpjobqueue.partitioned.mediawiki.job.cirrusSearchElasticaWrite.
Failed to call 'spicerack.elasticsearch_cluster.wait_for_all_write_queues_empty' [34/60, retrying in 60.00s]: Write queue not empty (had value of 208650) for partition 0of topic eqiad.cpjobqueue.partitioned.mediawiki.job.cirrusSearchElasticaWrite.
Failed to call 'spicerack.elasticsearch_cluster.wait_for_all_write_queues_empty' [35/60, retrying in 60.00s]: Write queue not empty (had value of 191199) for partition 0of topic eqiad.cpjobqueue.partitioned.mediawiki.job.cirrusSearchElasticaWrite.
Failed to call 'spicerack.elasticsearch_cluster.wait_for_all_write_queues_empty' [36/60, retrying in 60.00s]: Write queue not empty (had value of 170560) for partition 0of topic eqiad.cpjobqueue.partitioned.mediawiki.job.cirrusSearchElasticaWrite.
Failed to call 'spicerack.elasticsearch_cluster.wait_for_all_write_queues_empty' [37/60, retrying in 60.00s]: Write queue not empty (had value of 150363) for partition 0of topic eqiad.cpjobqueue.partitioned.mediawiki.job.cirrusSearchElasticaWrite.
Failed to call 'spicerack.elasticsearch_cluster.wait_for_all_write_queues_empty' [38/60, retrying in 60.00s]: Write queue not empty (had value of 124900) for partition 0of topic eqiad.cpjobqueue.partitioned.mediawiki.job.cirrusSearchElasticaWrite.
Failed to call 'spicerack.elasticsearch_cluster.wait_for_all_write_queues_empty' [39/60, retrying in 60.00s]: Write queue not empty (had value of 99332) for partition 0of topic eqiad.cpjobqueue.partitioned.mediawiki.job.cirrusSearchElasticaWrite.
Failed to call 'spicerack.elasticsearch_cluster.wait_for_all_write_queues_empty' [40/60, retrying in 60.00s]: Write queue not empty (had value of 73467) for partition 0of topic eqiad.cpjobqueue.partitioned.mediawiki.job.cirrusSearchElasticaWrite.
Failed to call 'spicerack.elasticsearch_cluster.wait_for_all_write_queues_empty' [41/60, retrying in 60.00s]: Write queue not empty (had value of 60503) for partition 0of topic eqiad.cpjobqueue.partitioned.mediawiki.job.cirrusSearchElasticaWrite.
Failed to call 'spicerack.elasticsearch_cluster.wait_for_all_write_queues_empty' [42/60, retrying in 60.00s]: Write queue not empty (had value of 50072) for partition 0of topic eqiad.cpjobqueue.partitioned.mediawiki.job.cirrusSearchElasticaWrite.
Failed to call 'spicerack.elasticsearch_cluster.wait_for_all_write_queues_empty' [43/60, retrying in 60.00s]: Write queue not empty (had value of 36030) for partition 0of topic eqiad.cpjobqueue.partitioned.mediawiki.job.cirrusSearchElasticaWrite.
Failed to call 'spicerack.elasticsearch_cluster.wait_for_all_write_queues_empty' [44/60, retrying in 60.00s]: Write queue not empty (had value of 30911) for partition 0of topic eqiad.cpjobqueue.partitioned.mediawiki.job.cirrusSearchElasticaWrite.
Failed to call 'spicerack.elasticsearch_cluster.wait_for_all_write_queues_empty' [45/60, retrying in 60.00s]: Write queue not empty (had value of 46079) for partition 0of topic eqiad.cpjobqueue.partitioned.mediawiki.job.cirrusSearchElasticaWrite.
Failed to call 'spicerack.elasticsearch_cluster.wait_for_all_write_queues_empty' [46/60, retrying in 60.00s]: Write queue not empty (had value of 61242) for partition 0of topic eqiad.cpjobqueue.partitioned.mediawiki.job.cirrusSearchElasticaWrite.
Failed to call 'spicerack.elasticsearch_cluster.wait_for_all_write_queues_empty' [47/60, retrying in 60.00s]: Write queue not empty (had value of 77687) for partition 0of topic eqiad.cpjobqueue.partitioned.mediawiki.job.cirrusSearchElasticaWrite.
Failed to call 'spicerack.elasticsearch_cluster.wait_for_all_write_queues_empty' [48/60, retrying in 60.00s]: Write queue not empty (had value of 96813) for partition 0of topic eqiad.cpjobqueue.partitioned.mediawiki.job.cirrusSearchElasticaWrite.
Failed to call 'spicerack.elasticsearch_cluster.wait_for_all_write_queues_empty' [49/60, retrying in 60.00s]: Write queue not empty (had value of 109100) for partition 0of topic eqiad.cpjobqueue.partitioned.mediawiki.job.cirrusSearchElasticaWrite.
Failed to call 'spicerack.elasticsearch_cluster.wait_for_all_write_queues_empty' [50/60, retrying in 60.00s]: Write queue not empty (had value of 121914) for partition 0of topic eqiad.cpjobqueue.partitioned.mediawiki.job.cirrusSearchElasticaWrite.
Failed to call 'spicerack.elasticsearch_cluster.wait_for_all_write_queues_empty' [51/60, retrying in 60.00s]: Write queue not empty (had value of 120279) for partition 0of topic eqiad.cpjobqueue.partitioned.mediawiki.job.cirrusSearchElasticaWrite.
Failed to call 'spicerack.elasticsearch_cluster.wait_for_all_write_queues_empty' [52/60, retrying in 60.00s]: Write queue not empty (had value of 118607) for partition 0of topic eqiad.cpjobqueue.partitioned.mediawiki.job.cirrusSearchElasticaWrite.
Failed to call 'spicerack.elasticsearch_cluster.wait_for_all_write_queues_empty' [53/60, retrying in 60.00s]: Write queue not empty (had value of 114313) for partition 0of topic eqiad.cpjobqueue.partitioned.mediawiki.job.cirrusSearchElasticaWrite.
Failed to call 'spicerack.elasticsearch_cluster.wait_for_all_write_queues_empty' [54/60, retrying in 60.00s]: Write queue not empty (had value of 107770) for partition 0of topic eqiad.cpjobqueue.partitioned.mediawiki.job.cirrusSearchElasticaWrite.
Failed to call 'spicerack.elasticsearch_cluster.wait_for_all_write_queues_empty' [55/60, retrying in 60.00s]: Write queue not empty (had value of 100950) for partition 0of topic eqiad.cpjobqueue.partitioned.mediawiki.job.cirrusSearchElasticaWrite.
Failed to call 'spicerack.elasticsearch_cluster.wait_for_all_write_queues_empty' [56/60, retrying in 60.00s]: Write queue not empty (had value of 95492) for partition 0of topic eqiad.cpjobqueue.partitioned.mediawiki.job.cirrusSearchElasticaWrite.
Failed to call 'spicerack.elasticsearch_cluster.wait_for_all_write_queues_empty' [57/60, retrying in 60.00s]: Write queue not empty (had value of 90017) for partition 0of topic eqiad.cpjobqueue.partitioned.mediawiki.job.cirrusSearchElasticaWrite.
Failed to call 'spicerack.elasticsearch_cluster.wait_for_all_write_queues_empty' [58/60, retrying in 60.00s]: Write queue not empty (had value of 86316) for partition 0of topic eqiad.cpjobqueue.partitioned.mediawiki.job.cirrusSearchElasticaWrite.
Failed to call 'spicerack.elasticsearch_cluster.wait_for_all_write_queues_empty' [59/60, retrying in 60.00s]: Write queue not empty (had value of 79347) for partition 0of topic eqiad.cpjobqueue.partitioned.mediawiki.job.cirrusSearchElasticaWrite.
Exception raised while executing cookbook sre.elasticsearch.rolling-restart:
Traceback (most recent call last):
  File "/usr/lib/python3/dist-packages/spicerack/cookbook.py", line 410, in _run
    ret = self.module.run(args, self.spicerack)
  File "/srv/deployment/spicerack/cookbooks/sre/elasticsearch/rolling-restart.py", line 30, in run
    restart_elasticsearch
  File "/srv/deployment/spicerack/cookbooks/sre/elasticsearch/__init__.py", line 123, in execute_on_clusters
    elasticsearch_clusters.wait_for_all_write_queues_empty()
  File "/usr/lib/python3/dist-packages/spicerack/decorators.py", line 103, in wrapper
    return func(*args, **kwargs)  # type: ignore
  File "/usr/lib/python3/dist-packages/spicerack/elasticsearch_cluster.py", line 357, in wait_for_all_write_queues_empty
    "of topic {}.".format(queue_size, partition, topic))
spicerack.elasticsearch_cluster.ElasticsearchClusterCheckError: Write queue not empty (had value of 69928) for partition 0of topic eqiad.cpjobqueue.partitioned.mediawiki.job.cirrusSearchElasticaWrite.
END (FAIL) - Cookbook sre.elasticsearch.rolling-restart (exit_code=99)

Eqiad restart just failed. I'm guessing this is just an issue of performing the rolling restart during peak load. Will dig in a bit to confirm.

Since we're during peak (ish) usage, we're going to do codfw now and then eqiad later tonight.

Mentioned in SAL (#wikimedia-operations) [2020-11-26T04:24:11Z] <ryankemper> T268770 [eqiad] Begin rolling restart of cirrus eqiad, 3 nodes at a time

Mentioned in SAL (#wikimedia-operations) [2020-11-26T06:08:49Z] <ryankemper> T268770 [eqiad] Finished rolling restart of cirrus eqiad. All cirrus elasticsearch restarts are now complete (cloudelastic, relforge, eqiad, codfw)

(Please add project tags as project tags instead of subscribers - thanks!)