Page MenuHomePhabricator

Service implementation for relforge100[34]
Open, HighPublic3 Estimated Story Points

Description

Now that T241791 is fixed, we need to configure those servers. See T241791 for details.

  • relforge100[3,4] moved from staged to active once the service implementation is done

Event Timeline

Gehel created this task.Sep 7 2020, 3:04 PM
Restricted Application added a subscriber: Aklapper. · View Herald TranscriptSep 7 2020, 3:04 PM
Gehel triaged this task as High priority.Sep 7 2020, 3:17 PM
Gehel moved this task from needs triage to Ops / SRE on the Discovery-Search board.
RKemper updated the task description. (Show Details)Oct 5 2020, 5:46 PM
Gehel edited projects, added Discovery-Search (Current work); removed Discovery-Search.
CBogen set the point value for this task to 3.Nov 16 2020, 6:33 PM
RKemper claimed this task.Jan 15 2021, 7:44 AM

Change 656369 had a related patch set uploaded (by Ryan Kemper; owner: Ryan Kemper):
[operations/puppet@production] T262211: bring "new" relforge hosts into service

https://gerrit.wikimedia.org/r/656369

Change 656369 had a related patch set uploaded (by Ryan Kemper; owner: Ryan Kemper):
[operations/puppet@production] search: bring "new" relforge hosts into service

https://gerrit.wikimedia.org/r/656369

Change 661229 had a related patch set uploaded (by Ryan Kemper; owner: Ryan Kemper):
[operations/puppet@production] relforge: service impl of relforge100[3,4]

https://gerrit.wikimedia.org/r/661229

Change 656369 abandoned by Ryan Kemper:
[operations/puppet@production] search: bring "new" relforge hosts into service

Reason:
https://gerrit.wikimedia.org/r/c/operations/puppet/ /661229 will be the new patch to commission relforge100[3,4]

https://gerrit.wikimedia.org/r/656369

Change 661229 merged by Ryan Kemper:
[operations/puppet@production] relforge: service impl of relforge100[3,4]

https://gerrit.wikimedia.org/r/661229

Mentioned in SAL (#wikimedia-operations) [2021-02-09T19:19:53Z] <ryankemper> T262211 Attempting to bring relforge100[3,4] into service; merging https://gerrit.wikimedia.org/r/661229

Mentioned in SAL (#wikimedia-operations) [2021-02-09T19:21:14Z] <ryankemper> T262211 sudo cumin 'P{relforge*}' 'sudo run-puppet-agent' on ryankemper@cumin1001

Prometheus exporters are having trouble:

ryankemper@relforge1003:~$ sudo systemctl list-units --failed
  UNIT                                               LOAD   ACTIVE SUB    DESCRIPTION
● prometheus-wmf-elasticsearch-exporter-9200.service loaded failed failed Prometheus WMF Elasticsearch Exporter
● prometheus-wmf-elasticsearch-exporter-9400.service loaded failed failed Prometheus WMF Elasticsearch Exporter

LOAD   = Reflects whether the unit definition was properly loaded.
ACTIVE = The high-level unit activation state, i.e. generalization of SUB.
SUB    = The low-level unit activation state, values depend on unit type.

Here's some recent log lines:

Feb 09 19:22:16 relforge1003 systemd[1]: Stopped Prometheus WMF Elasticsearch Exporter.
Feb 09 19:22:16 relforge1003 systemd[1]: Started Prometheus WMF Elasticsearch Exporter.
Feb 09 19:22:16 relforge1003 prometheus-wmf-elasticsearch-exporter[24673]: ERROR:__main__:Encountered HTTPConnectionPool(host='localhost', port=9200): Max retries exceeded with url: /_nodes/latencyStats (Caused by NewConnectionError('<requests.packages.urllib3.connection.HTTPConnection object at 0x7f4724185860>: Failed to establish a new connection: [Errno 111
Feb 09 19:22:16 relforge1003 prometheus-wmf-elasticsearch-exporter[24673]: Traceback (most recent call last):
Feb 09 19:22:16 relforge1003 prometheus-wmf-elasticsearch-exporter[24673]:   File "/usr/lib/python3/dist-packages/urllib3/connection.py", line 138, in _new_conn
Feb 09 19:22:16 relforge1003 prometheus-wmf-elasticsearch-exporter[24673]:     (self.host, self.port), self.timeout, **extra_kw)
Feb 09 19:22:16 relforge1003 prometheus-wmf-elasticsearch-exporter[24673]:   File "/usr/lib/python3/dist-packages/urllib3/util/connection.py", line 98, in create_connection
Feb 09 19:22:16 relforge1003 prometheus-wmf-elasticsearch-exporter[24673]:     raise err
Feb 09 19:22:16 relforge1003 prometheus-wmf-elasticsearch-exporter[24673]:   File "/usr/lib/python3/dist-packages/urllib3/util/connection.py", line 88, in create_connection
Feb 09 19:22:16 relforge1003 prometheus-wmf-elasticsearch-exporter[24673]:     sock.connect(sa)
Feb 09 19:22:16 relforge1003 prometheus-wmf-elasticsearch-exporter[24673]: ConnectionRefusedError: [Errno 111] Connection refused
Feb 09 19:22:16 relforge1003 prometheus-wmf-elasticsearch-exporter[24673]: During handling of the above exception, another exception occurred:
Feb 09 19:22:16 relforge1003 prometheus-wmf-elasticsearch-exporter[24673]: Traceback (most recent call last):
Feb 09 19:22:16 relforge1003 prometheus-wmf-elasticsearch-exporter[24673]:   File "/usr/lib/python3/dist-packages/urllib3/connectionpool.py", line 594, in urlopen
Feb 09 19:22:16 relforge1003 prometheus-wmf-elasticsearch-exporter[24673]:     chunked=chunked)
Feb 09 19:22:16 relforge1003 prometheus-wmf-elasticsearch-exporter[24673]:   File "/usr/lib/python3/dist-packages/urllib3/connectionpool.py", line 361, in _make_request
Feb 09 19:22:16 relforge1003 prometheus-wmf-elasticsearch-exporter[24673]:     conn.request(method, url, **httplib_request_kw)
Feb 09 19:22:16 relforge1003 prometheus-wmf-elasticsearch-exporter[24673]:   File "/usr/lib/python3.5/http/client.py", line 1161, in request
Feb 09 19:22:16 relforge1003 prometheus-wmf-elasticsearch-exporter[24673]:     self._send_request(method, url, body, headers)
Feb 09 19:22:16 relforge1003 prometheus-wmf-elasticsearch-exporter[24673]:   File "/usr/lib/python3.5/http/client.py", line 1206, in _send_request
Feb 09 19:22:16 relforge1003 prometheus-wmf-elasticsearch-exporter[24673]:     self.endheaders(body)
Feb 09 19:22:16 relforge1003 prometheus-wmf-elasticsearch-exporter[24673]:   File "/usr/lib/python3.5/http/client.py", line 1157, in endheaders
Feb 09 19:22:16 relforge1003 prometheus-wmf-elasticsearch-exporter[24673]:     self._send_output(message_body)
Feb 09 19:22:16 relforge1003 prometheus-wmf-elasticsearch-exporter[24673]:   File "/usr/lib/python3.5/http/client.py", line 949, in _send_output
Feb 09 19:22:16 relforge1003 prometheus-wmf-elasticsearch-exporter[24673]:     self.send(msg)
Feb 09 19:22:16 relforge1003 prometheus-wmf-elasticsearch-exporter[24673]:   File "/usr/lib/python3.5/http/client.py", line 892, in send
Feb 09 19:22:16 relforge1003 prometheus-wmf-elasticsearch-exporter[24673]:     self.connect()
Feb 09 19:22:16 relforge1003 prometheus-wmf-elasticsearch-exporter[24673]:   File "/usr/lib/python3/dist-packages/urllib3/connection.py", line 163, in connect
Feb 09 19:22:16 relforge1003 prometheus-wmf-elasticsearch-exporter[24673]:     conn = self._new_conn()
Feb 09 19:22:16 relforge1003 prometheus-wmf-elasticsearch-exporter[24673]:   File "/usr/lib/python3/dist-packages/urllib3/connection.py", line 147, in _new_conn
Feb 09 19:22:16 relforge1003 prometheus-wmf-elasticsearch-exporter[24673]:     self, "Failed to establish a new connection: %s" % e)
Feb 09 19:22:16 relforge1003 prometheus-wmf-elasticsearch-exporter[24673]: requests.packages.urllib3.exceptions.NewConnectionError: <requests.packages.urllib3.connection.HTTPConnection object at 0x7f4724185860>: Failed to establish a new connection: [Errno 111] Connection refused
Feb 09 19:22:16 relforge1003 prometheus-wmf-elasticsearch-exporter[24673]: During handling of the above exception, another exception occurred:
Feb 09 19:22:16 relforge1003 prometheus-wmf-elasticsearch-exporter[24673]: Traceback (most recent call last):
Feb 09 19:22:16 relforge1003 prometheus-wmf-elasticsearch-exporter[24673]:   File "/usr/lib/python3/dist-packages/requests/adapters.py", line 423, in send
Feb 09 19:22:16 relforge1003 prometheus-wmf-elasticsearch-exporter[24673]:     timeout=timeout
Feb 09 19:22:16 relforge1003 prometheus-wmf-elasticsearch-exporter[24673]:   File "/usr/lib/python3/dist-packages/urllib3/connectionpool.py", line 643, in urlopen
Feb 09 19:22:16 relforge1003 prometheus-wmf-elasticsearch-exporter[24673]:     _stacktrace=sys.exc_info()[2])
Feb 09 19:22:16 relforge1003 prometheus-wmf-elasticsearch-exporter[24673]:   File "/usr/lib/python3/dist-packages/urllib3/util/retry.py", line 363, in increment
Feb 09 19:22:16 relforge1003 prometheus-wmf-elasticsearch-exporter[24673]:     raise MaxRetryError(_pool, url, error or ResponseError(cause))
Feb 09 19:22:16 relforge1003 prometheus-wmf-elasticsearch-exporter[24673]: requests.packages.urllib3.exceptions.MaxRetryError: HTTPConnectionPool(host='localhost', port=9200): Max retries exceeded with url: /_nodes/latencyStats (Caused by NewConnectionError('<requests.packages.urllib3.connection.HTTPConnection object at 0x7f4724185860>: Failed to establish a n
Feb 09 19:22:16 relforge1003 prometheus-wmf-elasticsearch-exporter[24673]: During handling of the above exception, another exception occurred:
Feb 09 19:22:16 relforge1003 prometheus-wmf-elasticsearch-exporter[24673]: Traceback (most recent call last):
Feb 09 19:22:16 relforge1003 prometheus-wmf-elasticsearch-exporter[24673]:   File "/usr/local/bin/prometheus-wmf-elasticsearch-exporter", line 629, in <module>
Feb 09 19:22:16 relforge1003 prometheus-wmf-elasticsearch-exporter[24673]:     sys.exit(main())
Feb 09 19:22:16 relforge1003 prometheus-wmf-elasticsearch-exporter[24673]:   File "/usr/local/bin/prometheus-wmf-elasticsearch-exporter", line 618, in main
Feb 09 19:22:16 relforge1003 prometheus-wmf-elasticsearch-exporter[24673]:     args.port, args.indices))
Feb 09 19:22:16 relforge1003 prometheus-wmf-elasticsearch-exporter[24673]:   File "/usr/lib/python3/dist-packages/prometheus_client/core.py", line 50, in register
Feb 09 19:22:16 relforge1003 prometheus-wmf-elasticsearch-exporter[24673]:     names = self._get_names(collector)
Feb 09 19:22:16 relforge1003 prometheus-wmf-elasticsearch-exporter[24673]:   File "/usr/lib/python3/dist-packages/prometheus_client/core.py", line 86, in _get_names
Feb 09 19:22:16 relforge1003 prometheus-wmf-elasticsearch-exporter[24673]:     for metric in desc_func():
Feb 09 19:22:16 relforge1003 prometheus-wmf-elasticsearch-exporter[24673]:   File "/usr/local/bin/prometheus-wmf-elasticsearch-exporter", line 105, in collect
Feb 09 19:22:16 relforge1003 prometheus-wmf-elasticsearch-exporter[24673]:     yield from self.collect_latency(connection)
Feb 09 19:22:16 relforge1003 prometheus-wmf-elasticsearch-exporter[24673]:   File "/usr/local/bin/prometheus-wmf-elasticsearch-exporter", line 576, in collect_latency
Feb 09 19:22:16 relforge1003 prometheus-wmf-elasticsearch-exporter[24673]:     nodes = conn.request(self.latency_path)['nodes']
Feb 09 19:22:16 relforge1003 prometheus-wmf-elasticsearch-exporter[24673]:   File "/usr/local/bin/prometheus-wmf-elasticsearch-exporter", line 68, in request
Feb 09 19:22:16 relforge1003 prometheus-wmf-elasticsearch-exporter[24673]:     'Accept': 'application/json',
Feb 09 19:22:16 relforge1003 prometheus-wmf-elasticsearch-exporter[24673]:   File "/usr/lib/python3/dist-packages/requests/api.py", line 70, in get
Feb 09 19:22:16 relforge1003 prometheus-wmf-elasticsearch-exporter[24673]:     return request('get', url, params=params, **kwargs)
Feb 09 19:22:16 relforge1003 prometheus-wmf-elasticsearch-exporter[24673]:   File "/usr/lib/python3/dist-packages/requests/api.py", line 56, in request
Feb 09 19:22:16 relforge1003 prometheus-wmf-elasticsearch-exporter[24673]:     return session.request(method=method, url=url, **kwargs)
Feb 09 19:22:16 relforge1003 prometheus-wmf-elasticsearch-exporter[24673]:   File "/usr/lib/python3/dist-packages/requests/sessions.py", line 488, in request
Feb 09 19:22:16 relforge1003 prometheus-wmf-elasticsearch-exporter[24673]:     resp = self.send(prep, **send_kwargs)
Feb 09 19:22:16 relforge1003 prometheus-wmf-elasticsearch-exporter[24673]:   File "/usr/lib/python3/dist-packages/requests/sessions.py", line 609, in send
Feb 09 19:22:16 relforge1003 prometheus-wmf-elasticsearch-exporter[24673]:     r = adapter.send(request, **kwargs)
Feb 09 19:22:16 relforge1003 prometheus-wmf-elasticsearch-exporter[24673]:   File "/usr/lib/python3/dist-packages/requests/adapters.py", line 487, in send
Feb 09 19:22:16 relforge1003 prometheus-wmf-elasticsearch-exporter[24673]:     raise ConnectionError(e, request=request)
Feb 09 19:22:16 relforge1003 prometheus-wmf-elasticsearch-exporter[24673]: requests.exceptions.ConnectionError: HTTPConnectionPool(host='localhost', port=9200): Max retries exceeded with url: /_nodes/latencyStats (Caused by NewConnectionError('<requests.packages.urllib3.connection.HTTPConnection object at 0x7f4724185860>: Failed to establish a new connection:
Feb 09 19:22:16 relforge1003 systemd[1]: prometheus-wmf-elasticsearch-exporter-9200.service: Main process exited, code=exited, status=1/FAILURE
Feb 09 19:22:16 relforge1003 systemd[1]: prometheus-wmf-elasticsearch-exporter-9200.service: Unit entered failed state.
Feb 09 19:22:16 relforge1003 systemd[1]: prometheus-wmf-elasticsearch-exporter-9200.service: Failed with result 'exit-code'.
Feb 09 19:22:16 relforge1003 systemd[1]: prometheus-wmf-elasticsearch-exporter-9200.service: Service hold-off time over, scheduling restart.
Feb 09 19:22:16 relforge1003 systemd[1]: Stopped Prometheus WMF Elasticsearch Exporter.
Feb 09 19:22:16 relforge1003 systemd[1]: prometheus-wmf-elasticsearch-exporter-9200.service: Start request repeated too quickly.
Feb 09 19:22:16 relforge1003 systemd[1]: Failed to start Prometheus WMF Elasticsearch Exporter.
Feb 09 19:22:16 relforge1003 systemd[1]: prometheus-wmf-elasticsearch-exporter-9200.service: Unit entered failed state.
Feb 09 19:22:16 relforge1003 systemd[1]: prometheus-wmf-elasticsearch-exporter-9200.service: Failed with result 'exit-code'.

Above issue is resolved; our order of operations is a bit flawed and will result in puppet trying to install packages such as elasticsearch-oss before it can "see" the package (presumably due to lack of an apt-get update being ran in time). Issues self-healed in ~30 minutes; I manually restarted the failing services once the state in puppet-land had resolved itself.

It looks like having ran puppet twice in a row would have made things work - unless the self-healing was due to a cronjob exclusively. But I think it was the former.

RKemper updated the task description. (Show Details)Tue, Feb 9, 7:52 PM

Updated netbox entries to mark the servers as active.

Moving back to in-progress - I'd thought that all systemd units were working properly now, but kibana.service is still failing on relforge100[3,4].

Feb 09 23:38:03 relforge1004 systemd[1]: Started Kibana.
Feb 09 23:38:05 relforge1004 kibana[23539]: {"type":"error","@timestamp":"2021-02-09T23:38:05Z","tags":["fatal","root"],"pid":23539,"level":"fatal","error":{"message":"\"newsfeed.e
Feb 09 23:38:05 relforge1004 kibana[23539]:  FATAL  Error: "newsfeed.enabled" and "telemetry.enabled" settings were not applied. Check for spelling errors and ensure that expected
Feb 09 23:38:05 relforge1004 systemd[1]: kibana.service: Main process exited, code=exited, status=64/n/a
Feb 09 23:38:05 relforge1004 systemd[1]: kibana.service: Unit entered failed state.
Feb 09 23:38:05 relforge1004 systemd[1]: kibana.service: Failed with result 'exit-code'.
Feb 09 23:38:05 relforge1004 systemd[1]: kibana.service: Service hold-off time over, scheduling restart.
Feb 09 23:38:05 relforge1004 systemd[1]: Stopped Kibana.
Feb 09 23:38:05 relforge1004 systemd[1]: Started Kibana.
Feb 09 23:38:07 relforge1004 kibana[23558]: {"type":"error","@timestamp":"2021-02-09T23:38:07Z","tags":["fatal","root"],"pid":23558,"level":"fatal","error":{"message":"\"newsfeed.e
Feb 09 23:38:07 relforge1004 kibana[23558]:  FATAL  Error: "newsfeed.enabled" and "telemetry.enabled" settings were not applied. Check for spelling errors and ensure that expected
Feb 09 23:38:07 relforge1004 systemd[1]: kibana.service: Main process exited, code=exited, status=64/n/a
Feb 09 23:38:07 relforge1004 systemd[1]: kibana.service: Unit entered failed state.
Feb 09 23:38:07 relforge1004 systemd[1]: kibana.service: Failed with result 'exit-code'.
Feb 09 23:38:07 relforge1004 systemd[1]: kibana.service: Service hold-off time over, scheduling restart.
Feb 09 23:38:07 relforge1004 systemd[1]: Stopped Kibana.
Feb 09 23:38:07 relforge1004 systemd[1]: Started Kibana.
Feb 09 23:38:09 relforge1004 kibana[24392]: {"type":"error","@timestamp":"2021-02-09T23:38:09Z","tags":["fatal","root"],"pid":24392,"level":"fatal","error":{"message":"\"newsfeed.e
Feb 09 23:38:09 relforge1004 kibana[24392]:  FATAL  Error: "newsfeed.enabled" and "telemetry.enabled" settings were not applied. Check for spelling errors and ensure that expected
Feb 09 23:38:09 relforge1004 systemd[1]: kibana.service: Main process exited, code=exited, status=64/n/a
Feb 09 23:38:09 relforge1004 systemd[1]: kibana.service: Unit entered failed state.
Feb 09 23:38:09 relforge1004 systemd[1]: kibana.service: Failed with result 'exit-code'.
Feb 09 23:38:09 relforge1004 systemd[1]: kibana.service: Service hold-off time over, scheduling restart.
Feb 09 23:38:09 relforge1004 systemd[1]: Stopped Kibana.
Feb 09 23:38:09 relforge1004 systemd[1]: kibana.service: Start request repeated too quickly.
Feb 09 23:38:09 relforge1004 systemd[1]: Failed to start Kibana.
Feb 09 23:38:09 relforge1004 systemd[1]: kibana.service: Unit entered failed state.
Feb 09 23:38:09 relforge1004 systemd[1]: kibana.service: Failed with result 'exit-code'.

Looks like we might be using a deprecated setting or something

Closing the loop on the above, it looks like newsfeed.enabled exists in Elasticsearch 7 but not in Elasticsearch 6.

For example newsfeed.enabled is present in https://www.elastic.co/guide/en/kibana/7.x/settings.html but absent in https://www.elastic.co/guide/en/kibana/6.5/settings.html.

(Note that we're running Elasticsearch 6.5.4 currently)

So we either need to bite the bullet and upgrade to Elasticsearch 7.1.0 right now, or we need figure out what's trying to set these values and override them. The latter seems like the obvious choice currently.

From

ryankemper@relforge1004:~$ sudo systemctl status kibana.service
● kibana.service - Kibana
   Loaded: loaded (/etc/systemd/system/kibana.service; enabled; vendor preset: enabled)
   Active: failed (Result: exit-code) since Wed 2021-02-10 00:38:09 UTC; 2min 41s ago
  Process: 1040 ExecStart=/usr/share/kibana/bin/kibana -c /etc/kibana/kibana.yml (code=exited, status=64)
 Main PID: 1040 (code=exited, status=64)

Feb 10 00:38:08 relforge1004 systemd[1]: kibana.service: Failed with result 'exit-code'.
Feb 10 00:38:09 relforge1004 systemd[1]: kibana.service: Service hold-off time over, scheduling restart.
Feb 10 00:38:09 relforge1004 systemd[1]: Stopped Kibana.
Feb 10 00:38:09 relforge1004 systemd[1]: kibana.service: Start request repeated too quickly.
Feb 10 00:38:09 relforge1004 systemd[1]: Failed to start Kibana.
Feb 10 00:38:09 relforge1004 systemd[1]: kibana.service: Unit entered failed state.
Feb 10 00:38:09 relforge1004 systemd[1]: kibana.service: Failed with result 'exit-code'.

We can see that the systemd unit is /etc/systemd/system/kibana.service:

ryankemper@relforge1004:~$ cat /etc/systemd/system/kibana.service
[Unit]
Description=Kibana

[Service]
Type=simple
User=kibana
Group=kibana
# Load env vars from /etc/default/ and /etc/sysconfig/ if they exist.
# Prefixing the path with '-' makes it try to load, but if the file doesn't
# exist, it continues onward.
EnvironmentFile=-/etc/default/kibana
EnvironmentFile=-/etc/sysconfig/kibana
ExecStart=/usr/share/kibana/bin/kibana "-c /etc/kibana/kibana.yml"
Restart=always
WorkingDirectory=/

[Install]
WantedBy=multi-user.target

which references this file:

ryankemper@relforge1004:~$ cat /etc/kibana/kibana.yml
kibana.defaultAppId: dashboard/default
logging.quiet: false
metrics.enabled: false
newsfeed.enabled: false
server.maxPayloadBytes: 4194304
telemetry.enabled: false
timelion.enabled: false

Searching around in the puppet repo for where that file gets provisioned, I found the patch https://phabricator.wikimedia.org/T259794 for ticket https://gerrit.wikimedia.org/r/c/operations/puppet/+/618948/ which adds these and sets them to false.

Dzahn added a subscriber: Dzahn.Wed, Feb 10, 9:57 PM

ACKed Icinga unhandled CRIT alerts for relforge1003/1004 (systemd state) with a link to the comment above.

Change 665058 had a related patch set uploaded (by Ryan Kemper; owner: Ryan Kemper):
[operations/puppet@production] kibana: only render if explicitly set

https://gerrit.wikimedia.org/r/665058