Page MenuHomePhabricator

Set up opensearch cluster for datahub
Closed, ResolvedPublic

Description

  • 3 ganeti vms, for high availability 2-server quorum

Event Timeline

I added datahubsearch as a new prefix for servers here: https://wikitech.wikimedia.org/wiki/SRE/Infrastructure_naming_conventions#Servers
So we can call these servers: datahubsearch100[1-3]

Change 762957 had a related patch set uploaded (by Razzi; author: Razzi):

[operations/puppet@production] analytics_cluster::datahub::opensearch: start of puppet role

https://gerrit.wikimedia.org/r/762957

Change 762957 merged by Razzi:

[operations/puppet@production] analytics_cluster::datahub::opensearch: start of puppet role

https://gerrit.wikimedia.org/r/762957

Change 763575 had a related patch set uploaded (by Razzi; author: Razzi):

[operations/puppet@production] datahub::opensearch: Fix sdd typo to be ssd

https://gerrit.wikimedia.org/r/763575

Change 763575 merged by Razzi:

[operations/puppet@production] datahub::opensearch: Fix sdd typo to be ssd

https://gerrit.wikimedia.org/r/763575

Icinga downtime set by razzi@cumin1001 for 7 days, 0:00:00 1 host(s) and their services with reason: Node is being set up for first time and puppet run failed

datahubsearch1001.eqiad.wmnet

I ran puppet and got an error installing the opensearch package.

I ran puppet twice to see if it was missing an implicit dependency, but it failed on the second run as well.

Full paste: https://phabricator.wikimedia.org/P20996

Relevant bits

Error: Execution of '/usr/bin/apt-get -q -y -o DPkg::Options::=--force-confold install opensearch' returned 100: Reading package lists...
Building dependency tree...
Reading state information...
E: Unable to locate package opensearch
Error: /Stage[main]/Opensearch::Packages/Package[opensearch]/ensure: change from 'purged' to 'present' failed: Execution of '/usr/bin/apt-get -q -y -o DPkg::Options::=--force-confold install opensearch' returned 100: Reading package lists...
Building dependency tree...
Reading state information...
E: Unable to locate package opensearch
Error: Execution of '/usr/bin/apt-get -q -y -o DPkg::Options::=--force-confold install liblogstash-gelf-java' returned 100: Reading package lists...
Building dependency tree...
Reading state information...
E: Unable to locate package liblogstash-gelf-java
Error: /Stage[main]/Opensearch::Packages/Package[liblogstash-gelf-java]/ensure: change from 'purged' to 'present' failed: Execution of '/usr/bin/apt-get -q -y -o DPkg::Options::=--force-confold install liblogstash-gelf-java' returned 100: Reading package lists...
Building dependency tree...
Reading state information...
E: Unable to locate package liblogstash-gelf-java
Notice: /Stage[main]/Opensearch::Packages/Package[libjson-simple-java]/ensure: created
Error: Could not set 'link' on ensure: No such file or directory @ dir_chdir - /usr/share/opensearch/lib (file: /etc/puppet/modules/opensearch/manifests/packages.pp, line: 26)
Error: Could not set 'link' on ensure: No such file or directory @ dir_chdir - /usr/share/opensearch/lib (file: /etc/puppet/modules/opensearch/manifests/packages.pp, line: 26)
Wrapped exception:
No such file or directory @ dir_chdir - /usr/share/opensearch/lib
Error: /Stage[main]/Opensearch::Packages/File[/usr/share/opensearch/lib/logstash-gelf.jar]/ensure: change from 'absent' to 'link' failed: Could not set 'link' on ensure: No such file or directory @ dir_chdir - /usr/share/opensearch/lib (file: /etc/puppet/modules/opensearch/manifests/packages.pp, line: 26)
Error: Could not set 'link' on ensure: No such file or directory @ dir_chdir - /usr/share/opensearch/lib (file: /etc/puppet/modules/opensearch/manifests/packages.pp, line: 30)
Error: Could not set 'link' on ensure: No such file or directory @ dir_chdir - /usr/share/opensearch/lib (file: /etc/puppet/modules/opensearch/manifests/packages.pp, line: 30)
Wrapped exception:
No such file or directory @ dir_chdir - /usr/share/opensearch/lib
Error: /Stage[main]/Opensearch::Packages/File[/usr/share/opensearch/lib/json-simple.jar]/ensure: change from 'absent' to 'link' failed: Could not set 'link' on ensure: No such file or directory @ dir_chdir - /usr/share/opensearch/lib (file: /etc/puppet/modules/opensearch/manifests/packages.pp, line: 30)
Info: Class[Opensearch::Packages]: Unscheduling all events on Class[Opensearch::Packages]
Error: Could not update: Execution of '/usr/bin/apt-get -q -y -o DPkg::Options::=--force-confold --force-yes install elasticsearch-curator=5.8.1' returned 100: Reading package lists...
Building dependency tree...
Reading state information...
W: --force-yes is deprecated, use one of the options starting with --allow instead.
E: Version '5.8.1' for 'elasticsearch-curator' was not found
Error: /Stage[main]/Opensearch::Curator/Package[elasticsearch-curator]/ensure: change from 'purged' to '5.8.1' failed: Could not update: Execution of '/usr/bin/apt-get -q -y -o DPkg::Options::=--force-confold --force-yes install elasticsearch-curator=5.8.1' returned 100: Reading package lists...
Building dependency tree...
Reading state information...
W: --force-yes is deprecated, use one of the options starting with --allow instead.
E: Version '5.8.1' for 'elasticsearch-curator' was not found
elukey@apt1001:/srv/wikimedia$ sudo reprepro lsbycomponent opensearch
opensearch | 1.2.4 | buster-wikimedia | thirdparty/opensearch1 | amd64

The datahub nodes are Bullseye afaics, so there are no packages in apt for it :)

Change 763587 had a related patch set uploaded (by Razzi; author: Razzi):

[operations/puppet@production] datahub::opensearch: Change curator version to 5.8.1-1 for

https://gerrit.wikimedia.org/r/763587

Change 763587 merged by Razzi:

[operations/puppet@production] opensearch: make curator version bullseye compatible

https://gerrit.wikimedia.org/r/763587

Change 763815 had a related patch set uploaded (by Razzi; author: Razzi):

[operations/puppet@production] analytics_cluster::datahub::opensearch: Enable syslog transport

https://gerrit.wikimedia.org/r/763815

Change 763815 merged by Razzi:

[operations/puppet@production] analytics_cluster::datahub::opensearch: Enable syslog transport

https://gerrit.wikimedia.org/r/763815

Change 763844 had a related patch set uploaded (by Razzi; author: Razzi):

[operations/puppet@production] opensearch: change log4j appender.ship_to_logstash.layout

https://gerrit.wikimedia.org/r/763844

Change 763844 merged by Razzi:

[operations/puppet@production] opensearch: change log4j appender.ship_to_logstash.layout

https://gerrit.wikimedia.org/r/763844

Vms have been created and are all running opensearch 1.2.4!

Side note: puppet needed to be run two times for some reason. As I understand puppet should apply everything on a single run. If anybody's interested, I saved the output of the 2 puppet runs here: https://phabricator.wikimedia.org/P21613 (visibility is wmf-restricted in case I missed some secret output)

Looks good, but I think that we need to sort out the firewall between these hosts.

btullis@datahubsearch1001:/etc/opensearch/datahub$ curl http://127.0.0.1:9200/_cat/health
{"error":{"root_cause":[{"type":"master_not_discovered_exception","reason":null}],"type":"master_not_discovered_exception","reason":null},"status":503}

Looking at the existing logstash nodes, they have a ferm configuration fragment present, which is not present on the datahubsearch hosts.

image.png (346×387 px, 21 KB)
image.png (324×427 px, 20 KB)

So we're going to need to work out why profile::opensearch::server isn't generating this file.

BTullis triaged this task as High priority.Mar 2 2022, 10:38 AM

@colewhite could you perhaps weigh in on how to get the ferm rules accepting traffic between the elasticsearch nodes?

It looks like opensearch instances are filtered at https://gerrit.wikimedia.org/r/plugins/gitiles/operations/puppet/+/refs/heads/production/modules/profile/manifests/opensearch/server.pp#70 and those nodes are the ones that include the ferm::service class.

One thing I've never figured out how to do is see the value of a Is there an easy way to see the value of $filtered_instances that puppet evaluates to see if that's missing the datahubsearch100x nodes? Maybe something like this: https://www.puppetcookbook.com/posts/see-all-client-variables.html

@colewhite could you perhaps weigh in on how to get the ferm rules accepting traffic between the elasticsearch nodes?

The role probably needs profile::base::firewall. I think the ferm rules on those hosts may have been left over from time spent in role::insetup.

The presence of the old jvm gc checks in icinga indicate the ferm rule should be managed given their proximity.

Change 767611 had a related patch set uploaded (by Razzi; author: Razzi):

[operations/puppet@production] analytics_cluster::datahub::opensearch: add firewall and base_checks

https://gerrit.wikimedia.org/r/767611

Cool, thanks @colewhite. I also threw in the ::profile::opensearch::monitoring::base_checks since it seems to implement a simple, useful http check.

Change 767611 merged by Razzi:

[operations/puppet@production] analytics_cluster::datahub::opensearch: add firewall and base_checks

https://gerrit.wikimedia.org/r/767611

Icinga downtime set by razzi@cumin1001 for 7 days, 0:00:00 3 host(s) and their services with reason: Still having errors setting up opensearch

datahubsearch[1001-1003].eqiad.wmnet

The firewall changes appear to be successful:

razzi@datahubsearch1001:/etc/ferm$ nc -z datahubsearch1002 9300
razzi@datahubsearch1001:/etc/ferm$ echo $?
0

However the health check is still coming up 503:

razzi@datahubsearch1001:/etc/ferm$ curl localhost:9200/_cat/health
{"error":{"root_cause":[{"type":"master_not_discovered_exception","reason":null}],"type":"master_not_discovered_exception","reason":null},"status":503}

I thought maybe restarting opensearch would cause it to properly discover nodes, but strangely got a different error of missing files:

Mar 03 01:37:27 datahubsearch1001 systemd[1]: Starting OpenSearch...
Mar 03 01:37:27 datahubsearch1001 systemd-entrypoint[1856577]: /usr/share/opensearch/bin/opensearch-env: line 89: cd: ${OPENSEARCH_PATH_CONF-/etc/opensearch}: No such file or directory
Mar 03 01:37:29 datahubsearch1001 systemd-entrypoint[1856740]: Exception in thread "main" java.nio.file.NoSuchFileException: /usr/share/opensearch/jvm.options
Mar 03 01:37:29 datahubsearch1001 systemd-entrypoint[1856740]:         at java.base/sun.nio.fs.UnixException.translateToIOException(UnixException.java:92)
Mar 03 01:37:29 datahubsearch1001 systemd-entrypoint[1856740]:         at java.base/sun.nio.fs.UnixException.rethrowAsIOException(UnixException.java:106)
Mar 03 01:37:29 datahubsearch1001 systemd-entrypoint[1856740]:         at java.base/sun.nio.fs.UnixException.rethrowAsIOException(UnixException.java:111)
Mar 03 01:37:29 datahubsearch1001 systemd-entrypoint[1856740]:         at java.base/sun.nio.fs.UnixFileSystemProvider.newByteChannel(UnixFileSystemProvider.java:218)
Mar 03 01:37:29 datahubsearch1001 systemd-entrypoint[1856740]:         at java.base/java.nio.file.Files.newByteChannel(Files.java:375)
Mar 03 01:37:29 datahubsearch1001 systemd-entrypoint[1856740]:         at java.base/java.nio.file.Files.newByteChannel(Files.java:426)
Mar 03 01:37:29 datahubsearch1001 systemd-entrypoint[1856740]:         at java.base/java.nio.file.spi.FileSystemProvider.newInputStream(FileSystemProvider.java:420)
Mar 03 01:37:29 datahubsearch1001 systemd-entrypoint[1856740]:         at java.base/java.nio.file.Files.newInputStream(Files.java:160)
Mar 03 01:37:29 datahubsearch1001 systemd-entrypoint[1856740]:         at org.opensearch.tools.launchers.JvmOptionsParser.readJvmOptionsFiles(JvmOptionsParser.java:182)
Mar 03 01:37:29 datahubsearch1001 systemd-entrypoint[1856740]:         at org.opensearch.tools.launchers.JvmOptionsParser.jvmOptions(JvmOptionsParser.java:143)
Mar 03 01:37:29 datahubsearch1001 systemd-entrypoint[1856740]:         at org.opensearch.tools.launchers.JvmOptionsParser.main(JvmOptionsParser.java:110)
Mar 03 01:37:29 datahubsearch1001 systemd[1]: opensearch.service: Main process exited, code=exited, status=1/FAILURE
Mar 03 01:37:29 datahubsearch1001 systemd[1]: opensearch.service: Failed with result 'exit-code'.
Mar 03 01:37:29 datahubsearch1001 systemd[1]: Failed to start OpenSearch.
Mar 03 01:37:29 datahubsearch1001 systemd[1]: opensearch.service: Consumed 2.772s CPU time.

Ok, at least the missing jvm.options is that I did a sudo systemctl restart opensearch when I should have used the service opensearch_1@datahub.service.

There are a bunch of different errors in the /var/log/opensearch/datahub.log, such as

java.nio.file.NoSuchFileException: /usr/share/opensearch/lib/logstash-gelf.jar

which I wouldn't expect, since we disabled gelf, and

[2022-03-03T01:37:33,175][WARN ][stderr                   ] [datahubsearch1001-datahub] Exception in thread "main" org.opensearch.bootstrap.BootstrapException: java.nio.file.AccessDeniedException: /var/run/opensearch
[2022-03-03T01:37:33,177][WARN ][stderr                   ] [datahubsearch1001-datahub] Likely root cause: java.nio.file.AccessDeniedException: /var/run/opensearch

This is all on datahubsearch1001; I haven't touched the others.

I see there's an extra unit on datahubsearch1001:

  UNIT                                    LOAD   ACTIVE SUB     DESCRIPTION
● opensearch.service                      loaded failed failed  OpenSearch
● opensearch_1@datahub.service            loaded failed failed  OpenSearch (cluster datahub)
  opensearch-datahub-gc-log-cleanup.timer loaded active waiting Periodic execution of opensearch-datahub-gc-log-cleanup.service

versus 1002:

razzi@datahubsearch1002:~$ systemctl list-units opensearch
  UNIT LOAD ACTIVE SUB DESCRIPTION
0 loaded units listed. Pass --all to see loaded but inactive units, too.
To show all installed unit files use 'systemctl list-unit-files'.
razzi@datahubsearch1002:~$ systemctl list-units opensearch'*'
  UNIT                                    LOAD   ACTIVE SUB     DESCRIPTION
  opensearch_1@datahub.service            loaded active running OpenSearch (cluster datahub)
  opensearch-datahub-gc-log-cleanup.timer loaded active waiting Periodic execution of opensearch-datahub-gc-log-cleanup.service

I wonder if that extra unit created when I did restart on a nonexistent unit; if so systemctl continues to surprise...

I downtimed the node for the next week so hopefully this messiness won't bother anybody unless they choose to look into it...

Ok, at least the missing jvm.options is that I did a sudo systemctl restart opensearch when I should have used the service opensearch_1@datahub.service.

There are a bunch of different errors in the /var/log/opensearch/datahub.log, such as

java.nio.file.NoSuchFileException: /usr/share/opensearch/lib/logstash-gelf.jar

which I wouldn't expect, since we disabled gelf

The symlink to the gelf jar is not managed by puppet anymore and has to be removed manually. I did this on datahubsearch1001 and restarted opensearch_1@datahub and observed the logs. The service now starts.

I wonder if that extra unit created when I did restart on a nonexistent unit; if so systemctl continues to surprise...

I downtimed the node for the next week so hopefully this messiness won't bother anybody unless they choose to look into it...

The opensearch debian package creates opensearch.service and Puppet creates opensearch_1@datahub.service. Puppet configures opensearch nodes to be "multi-instance capable" T198351 mirroring what the elasticsearch module does, but does not remove the broken unit. Given this is confusing, it may be worthwhile to have Puppet remove opensearch.service.

Hi Cole!

Ok, at least the missing jvm.options is that I did a sudo systemctl restart opensearch when I should have used the service opensearch_1@datahub.service.

There are a bunch of different errors in the /var/log/opensearch/datahub.log, such as

java.nio.file.NoSuchFileException: /usr/share/opensearch/lib/logstash-gelf.jar

which I wouldn't expect, since we disabled gelf

The symlink to the gelf jar is not managed by puppet anymore and has to be removed manually. I did this on datahubsearch1001 and restarted opensearch_1@datahub and observed the logs. The service now starts.

What is the symlink removed? I checked today and didn't find anything in /usr/share/opensearch/lib (curious)

I wonder if that extra unit created when I did restart on a nonexistent unit; if so systemctl continues to surprise...

I downtimed the node for the next week so hopefully this messiness won't bother anybody unless they choose to look into it...

The opensearch debian package creates opensearch.service and Puppet creates opensearch_1@datahub.service. Puppet configures opensearch nodes to be "multi-instance capable" T198351 mirroring what the elasticsearch module does, but does not remove the broken unit. Given this is confusing, it may be worthwhile to have Puppet remove opensearch.service.

This could be a good candidate for systemctl mask (there are some examples in the puppet repo, easy and convenient to avoid headaches in my opinion).

The log file for the service had lots of messages like this:

[2022-03-04T00:00:17,947][WARN ][o.o.c.c.ClusterFormationFailureHelper] [datahubsearch1001-datahub] master not discovered yet, this node has not previously joined a bootstrapped (v7+) cluster, and [cluster.initial_master_nodes] is empty on this node:
<snip>

Although it's not specific to OpenSearch, this is the most useful reference: https://www.elastic.co/guide/en/elasticsearch/reference/7.17/modules-discovery-bootstrap-cluster.html
The key item is that we haven't defined any initial_master_nodes so the cluster is refusing to bootstrap itself.
Our current manifests don't include this as a configurable variable: ref so I'm just going to bootstrap the server manually on this occasion.

  • I stopped puppet
  • I modified the file /etc/opensearch/datahub/opensearch.yml to add the parameter.
btullis@datahubsearch1001:/var/log/opensearch$ grep initial_master_nodes /etc/opensearch/datahub/opensearch.yml
#cluster.initial_master_nodes: ["node-1", "node-2"]
cluster.initial_master_nodes: ["datahubsearch1001-datahub","datahubsearch1002-datahub","datahubsearch1003-datahub"]
  • I started the service.

Now this node has elected itself.

[2022-03-04T16:07:37,540][INFO ][o.o.c.s.MasterService    ] [datahubsearch1001-datahub] elected-as-master ([3] nodes joined)
<snip>
cluster UUID set to [XD2SLPtvRKqdo0wphCaXRQ]

The CAT API calls now return as expected.

btullis@datahubsearch1001:~$ curl http://127.0.0.1:9200/_cat/nodes
10.64.16.45  2 96 0 0.00 0.00 0.00 dimr - datahubsearch1002-datahub
10.64.32.27 13 96 0 0.00 0.00 0.00 dimr - datahubsearch1003-datahub
10.64.0.85   1 98 0 0.03 0.15 0.09 dimr * datahubsearch1001-datahub
btullis@datahubsearch1001:~$ curl http://127.0.0.1:9200/_cat/health
1646410362 16:12:42 datahub green 3 3 true 0 0 0 0 0 0 - 100.0%

Re-enabled puppet and ran, reverting the changes to /etc/opensearch/datahub/opensearch.yml
Restarted the service with sudo systemctl restart opensearch_1@datahub.service to make sure that the settings were picked up.
Now most of the icinga alerts associated with port 9200 have resolved as well.

We still have the JVM related alert that was mentioned in T302818 but we can come back to that once we have the prometheus exporter running.

Change 768702 had a related patch set uploaded (by Btullis; author: Btullis):

[operations/puppet@production] Move some common resources to the opensearch::server profile

https://gerrit.wikimedia.org/r/768702

Change 768736 had a related patch set uploaded (by Btullis; author: Btullis):

[operations/puppet@production] Add a profile specific to datahubsearch servers

https://gerrit.wikimedia.org/r/768736

Change 768736 merged by Btullis:

[operations/puppet@production] Add a profile specific to datahubsearch servers

https://gerrit.wikimedia.org/r/768736

Change 768702 merged by Btullis:

[operations/puppet@production] Move some common resources to the opensearch::server profile

https://gerrit.wikimedia.org/r/768702

I've added the firewall rule, so port 9200 is now open to the production networks. Not the analytics vlan, but that's OK.

BTullis moved this task from In Review to Done on the Data-Catalog board.