Page MenuHomePhabricator

Reboot (restart) Elasticsearch nodes
Closed, ResolvedPublic5 Estimated Story Points

Description

We need to do an Elasticsearch restart to unblock plugin upgrades, among a few other things. Restarts have been/are currently being blocked by the fact that the prometheus query which verifies all relevant write queues are empty is out of date and thus failing the restart.

There's also some java8 security updates that have been rolled out and require a full reboot take effect.

AC:

  • Spicerack & cookbooks repo updated to unblock restarts by fixing the broken prometheus query
  • Full reboot performed on all relevant elasticsearch nodes (relforge, cloudforge, eqiad, codfw)
    • relforge
    • cloudelastic
    • eqiad
    • codfw

DEPLOY NOTES:

We'll deploy in order of ascending criticality, giving us this:

relforge->cloudelastic->codfw->eqiad

We'll want to use relforge to verify that the new java version doesn't give any problems before proceeding.

Status

Done with all restarts

Event Timeline

Change 619781 had a related patch set uploaded (by Ryan Kemper; owner: Ryan Kemper):
[operations/software/spicerack@master] elasticsearch: verify all write queues are empty

https://gerrit.wikimedia.org/r/619781

Change 603731 had a related patch set uploaded (by Ryan Kemper; owner: Ryan Kemper):
[operations/cookbooks@master] elasticsearch: Let spicerack handle wait for all write queues to clear

https://gerrit.wikimedia.org/r/603731

RKemper triaged this task as High priority.Aug 26 2020, 4:33 PM
RKemper updated the task description. (Show Details)
RKemper set the point value for this task to 5.Aug 26 2020, 5:19 PM

Change 619781 merged by Gehel:
[operations/software/spicerack@master] elasticsearch: verify all write queues are empty

https://gerrit.wikimedia.org/r/619781

sudo -i python3
import logging
logging.basicConfig(level=logging.DEBUG)
from spicerack import Spicerack
s = Spicerack(verbose=True, dry_run=True)
sreal = Spicerack(verbose=True, dry_run=False)


CLUSTERGROUPS = ('search_eqiad', 'search_codfw', 'relforge')

ecs = s.elasticsearch_clusters(CLUSTERGROUPS[0])
ecs.wait_for_all_write_queues_empty()

The above is what we need to do on cumin1001 (or similar) to test the new functionality, but the new method wasn't present so I might need to do something to get the latest version of Spicerack. Will figure it out on Monday.

(Forgot to press submit on this comment a couple days ago)

Realized the issue above was that I needed to sudo apt-get install --only-upgrade spicerack on cumin2001 to get access to the new function.

import logging
logging.basicConfig(level=logging.DEBUG)
from spicerack import Spicerack
s = Spicerack(verbose=True, dry_run=True)
ecs = s.elasticsearch_clusters('search_codfw')
ecs.wait_for_all_write_queue_empty()

WARNING:spicerack.decorators:Reduce tries from 60 to 1 in DRY-RUN mode
DEBUG:urllib3.connectionpool:Starting new HTTP connection (1): prometheus.svc.eqiad.wmnet:80
DEBUG:urllib3.connectionpool:http://prometheus.svc.eqiad.wmnet:80 "GET /ops/api/v1/query?query=kafka_burrow_partition_lag%7B%7B++++group%3D%22cpjobqueue-cirrusSearchElasticaWrite%22%2C++++topic%3D~%22%5B%5B%3Aalpha%3A%5D%5D%2A.cpjobqueue.partitioned.mediawiki.job.cirrusSearchElasticaWrite%22%7D%7D HTTP/1.1" 400 114
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/usr/lib/python3/dist-packages/spicerack/decorators.py", line 103, in wrapper
    return func(*args, **kwargs)  # type: ignore
  File "/usr/lib/python3/dist-packages/spicerack/elasticsearch_cluster.py", line 337, in wait_for_all_write_queues_empty
    results = self._prometheus.query(query, dc)
  File "/usr/lib/python3/dist-packages/spicerack/prometheus.py", line 54, in query
    raise PrometheusError(msg)
spicerack.prometheus.PrometheusError: Unable to get metric: HTTP 400: {"status":"error","errorType":"bad_data","error":"parse error at char 28: unexpected left brace '{'"}

I think the issue might be that a given cumin can only talk to the prometheus in its DC, as opposed to being able to talk to any DC. Will figure out if that's the case on Monday.

(Note if I'd done cumin1001 the first query would probably have worked, but the subsequent one to codfw would have failed)

EDIT: Actually at second glance since it's a 400 response and complaining about parse error it might have nothing to do with DCs and just be a problem with the query construction itself.

Here's how we build the query:

>>> query = ('kafka_burrow_partition_lag{{'
...                      '    group="cpjobqueue-cirrusSearchElasticaWrite",'
...                      '    topic=~"[[:alpha:]]*.cpjobqueue.partitioned.mediawiki.job.cirrusSearchElasticaWrite"'
...                      '}}')
>>> query
'kafka_burrow_partition_lag{{    group="cpjobqueue-cirrusSearchElasticaWrite",    topic=~"[[:alpha:]]*.cpjobqueue.partitioned.mediawiki.job.cirrusSearchElasticaWrite"}}'

So the actual query is kafka_burrow_partition_lag{{ group="cpjobqueue-cirrusSearchElasticaWrite", topic=~"[[:alpha:]]*.cpjobqueue.partitioned.mediawiki.job.cirrusSearchElasticaWrite"}}

ryankemper@cumin2001:~$ sudo -i python3
Python 3.7.3 (default, Dec 20 2019, 18:57:59)
[GCC 8.3.0] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import logging
>>> logging.basicConfig(level=logging.DEBUG)
>>> from spicerack import Spicerack
>>> s = Spicerack(verbose=True, dry_run=True)
>>>
>>>
>>> CLUSTERGROUPS = ('search_eqiad', 'search_codfw', 'relforge')
>>>
>>> ecs = s.elasticsearch_clusters(CLUSTERGROUPS[1])
>>> ecs.wait_for_all_write_queues_empty()
WARNING:spicerack.decorators:Reduce tries from 60 to 1 in DRY-RUN mode
DEBUG:urllib3.connectionpool:Starting new HTTP connection (1): prometheus.svc.eqiad.wmnet:80
DEBUG:urllib3.connectionpool:http://prometheus.svc.eqiad.wmnet:80 "GET /ops/api/v1/query?query=kafka_burrow_partition_lag%7B++++group%3D%22cpjobqueue-cirrusSearchElasticaWrite%22%2C++++topic%3D~%22%5B%5B%3Aalpha%3A%5D%5D%2A.cpjobqueue.partitioned.mediawiki.job.cirrusSearchElasticaWrite%22%7D HTTP/1.1" 200 303
DEBUG:spicerack.elasticsearch_cluster:Prom query kafka_burrow_partition_lag{    group="cpjobqueue-cirrusSearchElasticaWrite",    topic=~"[[:alpha:]]*.cpjobqueue.partitioned.mediawiki.job.cirrusSearchElasticaWrite"} returned queue_results of [('eqiad.cpjobqueue.partitioned.mediawiki.job.cirrusSearchElasticaWrite', '0', 4699), ('eqiad.cpjobqueue.partitioned.mediawiki.job.cirrusSearchElasticaWrite', '1', 4640), ('eqiad.cpjobqueue.partitioned.mediawiki.job.cirrusSearchElasticaWrite', '2', 5993)]
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/usr/lib/python3/dist-packages/spicerack/decorators.py", line 103, in wrapper
    return func(*args, **kwargs)  # type: ignore
  File "/usr/lib/python3/dist-packages/spicerack/elasticsearch_cluster.py", line 355, in wait_for_all_write_queues_empty
    "of topic {}.".format(queue_size, partition, topic))
spicerack.elasticsearch_cluster.ElasticsearchClusterCheckError: Write queue not empty (had value of 4699) for partition 0of topic eqiad.cpjobqueue.partitioned.mediawiki.job.cirrusSearchElasticaWrite.

The "happily unhappy path" (prometheus query works but queues aren't empty) is behaving as expected.

I'll get the cookbooks repo merged and then I think we're ready to start doing the restarts of relforge tomorrow.

Change 626240 had a related patch set uploaded (by Ryan Kemper; owner: Ryan Kemper):
[operations/software/spicerack@master] elasticsearch: new --write-queue-datacenter flag

https://gerrit.wikimedia.org/r/626240

Change 626240 merged by Gehel:
[operations/software/spicerack@master] elasticsearch: Store which dcs to query in class

https://gerrit.wikimedia.org/r/626240

Change 603731 merged by Ryan Kemper:
[operations/cookbooks@master] elasticsearch: Let spicerack handle wait for all write queues to clear

https://gerrit.wikimedia.org/r/603731

Mentioned in SAL (#wikimedia-operations) [2021-04-30T04:42:17Z] <ryankemper> T261239 elastic2033, which is known to be in a state of hardware failure (we have a ticket open), is holding up the reboot of codfw. I don't think we have a good way to exclude a node currently. Going to just proceed to eqiad for now