Page MenuHomePhabricator

Logs and events produced by the WMF are consumed using the Elastic Common Schema by OpenSearch
Open, In Progress, MediumPublic

Description

This is a tracking task for OKR Work for this quarter:

  • T288618 - Deploy OpenSearch for Beta following production observability configurations
  • T288619 - Improve the process to consume and use API.LOG to filter out bad performing queries either by extra tooling within o11y or analytics
  • T288620 - Document path forward for how to Retire all non-Kafka Logstash inputs

Event Timeline

lmata updated Other Assignee, added: lmata.
lmata updated Other Assignee, added: herron; removed: lmata.

Change 742778 had a related patch set uploaded (by Cwhite; author: Cwhite):

[operations/puppet@production] site: consolidate logstash node definitions

https://gerrit.wikimedia.org/r/742778

Change 742779 had a related patch set uploaded (by Cwhite; author: Cwhite):

[operations/puppet@production] site: reprovision codfw logging cluster to opensearch

https://gerrit.wikimedia.org/r/742779

Change 742780 had a related patch set uploaded (by Cwhite; author: Cwhite):

[operations/puppet@production] hiera: add opensearch production configuration

https://gerrit.wikimedia.org/r/742780

Change 742781 had a related patch set uploaded (by Cwhite; author: Cwhite):

[operations/puppet@production] opensearch-dashboards: enable phatality on beta-logs

https://gerrit.wikimedia.org/r/742781

Change 742781 merged by Cwhite:

[operations/puppet@production] opensearch-dashboards: enable phatality on beta-logs

https://gerrit.wikimedia.org/r/742781

Change 742778 merged by Cwhite:

[operations/puppet@production] site: consolidate logstash node definitions

https://gerrit.wikimedia.org/r/742778

Change 742780 merged by Cwhite:

[operations/puppet@production] hiera: add opensearch production configuration

https://gerrit.wikimedia.org/r/742780

Change 743049 had a related patch set uploaded (by Cwhite; author: Cwhite):

[operations/puppet@production] site: reprovision codfw logging cluster to opensearch

https://gerrit.wikimedia.org/r/743049

https://gerrit.wikimedia.org/r/743049

This is the last automated step in provisioning OpenSearch.

Before merging this change, Puppet should be disabled on the whole codfw ES cluster.

Optionally, disable shard allocation for the cluster. Between each node join, shard allocation will have to be re-enabled to allow the node to allocate and update its own shards.

After merge, the first nodes to migrate are data nodes. Do serially:

  • Stop elasticsearch
  • Enable and Run Puppet

After Puppet applies this change, ensure OpenSearch is stopped. Prior to joining the cluster, we'll want to put the ES index data into place:

  • mv /etc/elasticsearch/production-elk7-codfw /srv/opensearch/production-elk7-codfw
  • chown -R opensearch:opensearch /srv/opensearch/production-elk7-codfw Once data is in place, start OpenSearch and watch logs and api endpoints for a successful cluster join and shard provisioning.

After merge, the last nodes to migrate are collector nodes. Do serially:

  • Stop Logstash
  • Stop elasticsearch
  • Enable and Run Puppet

Some manual steps once complete:

  • Purge elasticsearch-oss and kibana packages
  • Disable and stop lingering services
    • sudo systemctl disable elasticsearch_7@production-elk7-codfw
    • sudo systemctl disable elasticsearch-production-elk7-codfw-gc-log-cleanup.timer
    • sudo systemctl stop elasticsearch-production-elk7-codfw-gc-log-cleanup.timer
  • Check for and possibly remove lingering files:
    • /etc/logrotate.d/elastic*
    • /etc/elasticsearch
    • /lib/systemd/system/elasticsearch*
    • /var/log/elasticsearch-production-elk7-codfw-gc-log-cleanup
    • /etc/kibana
    • /etc/default/kibana
    • /etc/sudoers.d/kibana-deploy-phatality

Lastly, check for functionality of OpenSearch Dashboards and ingest pipeline. Once determined functional, we'll restore a kibana backup and point the UI at codfw.

Mentioned in SAL (#wikimedia-operations) [2021-12-06T20:14:36Z] <cwhite> begin codfw opensearch upgrade T288621

Change 743049 merged by Cwhite:

[operations/puppet@production] site: reprovision codfw logging cluster to opensearch

https://gerrit.wikimedia.org/r/743049

Change 744088 had a related patch set uploaded (by Cwhite; author: Cwhite):

[operations/puppet@production] hiera: synchronize cluster name

https://gerrit.wikimedia.org/r/744088

Change 744088 merged by Cwhite:

[operations/puppet@production] hiera: synchronize cluster name

https://gerrit.wikimedia.org/r/744088

Mentioned in SAL (#wikimedia-operations) [2021-12-07T00:10:18Z] <cwhite> end codfw opensearch upgrade T288621

Change 744845 had a related patch set uploaded (by Cwhite; author: Cwhite):

[operations/puppet@production] opensearch_dashboards: allow up to 64mb restore payload

https://gerrit.wikimedia.org/r/744845

Change 745284 had a related patch set uploaded (by Cwhite; author: Cwhite):

[operations/puppet@production] hiera: map logstash.wm.o to kibana7.codfw

https://gerrit.wikimedia.org/r/745284

Change 744845 merged by Cwhite:

[operations/puppet@production] opensearch_dashboards: allow up to 64mb restore payload

https://gerrit.wikimedia.org/r/744845

colewhite changed the task status from Open to In Progress.Dec 8 2021, 5:22 PM
colewhite triaged this task as Medium priority.

Change 745284 merged by Cwhite:

[operations/puppet@production] hiera: map logstash.wm.o to kibana7.codfw

https://gerrit.wikimedia.org/r/745284

Mentioned in SAL (#wikimedia-operations) [2021-12-09T17:48:38Z] <cwhite> point kibana7 to OpenSearch in codfw T288621

Change 742779 abandoned by Cwhite:

[operations/puppet@production] site: reprovision codfw logging cluster to opensearch

Reason:

https://gerrit.wikimedia.org/r/742779

Change 752755 had a related patch set uploaded (by Cwhite; author: Cwhite):

[operations/puppet@production] hiera: add opensearch production configuration (eqiad)

https://gerrit.wikimedia.org/r/752755

Change 752756 had a related patch set uploaded (by Cwhite; author: Cwhite):

[operations/puppet@production] site: reprovision eqiad logging cluster to opensearch

https://gerrit.wikimedia.org/r/752756

Change 752755 merged by Cwhite:

[operations/puppet@production] hiera: add opensearch production configuration (eqiad)

https://gerrit.wikimedia.org/r/752755

Mentioned in SAL (#wikimedia-operations) [2022-01-12T19:25:05Z] <cwhite> begin eqiad opensearch upgrade T288621

Change 752756 merged by Cwhite:

[operations/puppet@production] site: reprovision eqiad logging cluster to opensearch

https://gerrit.wikimedia.org/r/752756

Change 753547 had a related patch set uploaded (by Cwhite; author: Cwhite):

[operations/puppet@production] hiera: fix opensearch common_settings namespace

https://gerrit.wikimedia.org/r/753547

Change 753547 merged by Cwhite:

[operations/puppet@production] hiera: fix opensearch common_settings namespace

https://gerrit.wikimedia.org/r/753547

Mentioned in SAL (#wikimedia-operations) [2022-01-12T22:48:04Z] <cwhite> end eqiad opensearch upgrade T288621

Change 754035 had a related patch set uploaded (by Cwhite; author: Cwhite):

[operations/puppet@production] hiera: allow logstash1032 through kafka-jumbo firewall

https://gerrit.wikimedia.org/r/754035

Change 754035 merged by Cwhite:

[operations/puppet@production] hiera: allow logstash1032 through kafka-jumbo firewall

https://gerrit.wikimedia.org/r/754035