- 3 ganeti vms, for high availability 2-server quorum
Description
Details
Status | Subtype | Assigned | Task | ||
---|---|---|---|---|---|
Resolved | BTullis | T299910 Data Catalog MVP | |||
Resolved | • razzi | T301382 Set up opensearch cluster for datahub | |||
Resolved | • razzi | T301383 eqiad: 3 VMs requested for datahub opensearch cluster | |||
Resolved | BTullis | T301458 Define LVS load-balancing for OpenSearch cluster | |||
Resolved | BTullis | T302818 Complete monitoring setup of datahubsearch nodes |
Event Timeline
I added datahubsearch as a new prefix for servers here: https://wikitech.wikimedia.org/wiki/SRE/Infrastructure_naming_conventions#Servers
So we can call these servers: datahubsearch100[1-3]
Change 762957 had a related patch set uploaded (by Razzi; author: Razzi):
[operations/puppet@production] analytics_cluster::datahub::opensearch: start of puppet role
Change 762957 merged by Razzi:
[operations/puppet@production] analytics_cluster::datahub::opensearch: start of puppet role
Change 763575 had a related patch set uploaded (by Razzi; author: Razzi):
[operations/puppet@production] datahub::opensearch: Fix sdd typo to be ssd
Change 763575 merged by Razzi:
[operations/puppet@production] datahub::opensearch: Fix sdd typo to be ssd
Icinga downtime set by razzi@cumin1001 for 7 days, 0:00:00 1 host(s) and their services with reason: Node is being set up for first time and puppet run failed
datahubsearch1001.eqiad.wmnet
I ran puppet and got an error installing the opensearch package.
I ran puppet twice to see if it was missing an implicit dependency, but it failed on the second run as well.
Full paste: https://phabricator.wikimedia.org/P20996
Relevant bits
Error: Execution of '/usr/bin/apt-get -q -y -o DPkg::Options::=--force-confold install opensearch' returned 100: Reading package lists... Building dependency tree... Reading state information... E: Unable to locate package opensearch Error: /Stage[main]/Opensearch::Packages/Package[opensearch]/ensure: change from 'purged' to 'present' failed: Execution of '/usr/bin/apt-get -q -y -o DPkg::Options::=--force-confold install opensearch' returned 100: Reading package lists... Building dependency tree... Reading state information... E: Unable to locate package opensearch Error: Execution of '/usr/bin/apt-get -q -y -o DPkg::Options::=--force-confold install liblogstash-gelf-java' returned 100: Reading package lists... Building dependency tree... Reading state information... E: Unable to locate package liblogstash-gelf-java Error: /Stage[main]/Opensearch::Packages/Package[liblogstash-gelf-java]/ensure: change from 'purged' to 'present' failed: Execution of '/usr/bin/apt-get -q -y -o DPkg::Options::=--force-confold install liblogstash-gelf-java' returned 100: Reading package lists... Building dependency tree... Reading state information... E: Unable to locate package liblogstash-gelf-java Notice: /Stage[main]/Opensearch::Packages/Package[libjson-simple-java]/ensure: created Error: Could not set 'link' on ensure: No such file or directory @ dir_chdir - /usr/share/opensearch/lib (file: /etc/puppet/modules/opensearch/manifests/packages.pp, line: 26) Error: Could not set 'link' on ensure: No such file or directory @ dir_chdir - /usr/share/opensearch/lib (file: /etc/puppet/modules/opensearch/manifests/packages.pp, line: 26) Wrapped exception: No such file or directory @ dir_chdir - /usr/share/opensearch/lib Error: /Stage[main]/Opensearch::Packages/File[/usr/share/opensearch/lib/logstash-gelf.jar]/ensure: change from 'absent' to 'link' failed: Could not set 'link' on ensure: No such file or directory @ dir_chdir - /usr/share/opensearch/lib (file: /etc/puppet/modules/opensearch/manifests/packages.pp, line: 26) Error: Could not set 'link' on ensure: No such file or directory @ dir_chdir - /usr/share/opensearch/lib (file: /etc/puppet/modules/opensearch/manifests/packages.pp, line: 30) Error: Could not set 'link' on ensure: No such file or directory @ dir_chdir - /usr/share/opensearch/lib (file: /etc/puppet/modules/opensearch/manifests/packages.pp, line: 30) Wrapped exception: No such file or directory @ dir_chdir - /usr/share/opensearch/lib Error: /Stage[main]/Opensearch::Packages/File[/usr/share/opensearch/lib/json-simple.jar]/ensure: change from 'absent' to 'link' failed: Could not set 'link' on ensure: No such file or directory @ dir_chdir - /usr/share/opensearch/lib (file: /etc/puppet/modules/opensearch/manifests/packages.pp, line: 30) Info: Class[Opensearch::Packages]: Unscheduling all events on Class[Opensearch::Packages] Error: Could not update: Execution of '/usr/bin/apt-get -q -y -o DPkg::Options::=--force-confold --force-yes install elasticsearch-curator=5.8.1' returned 100: Reading package lists... Building dependency tree... Reading state information... W: --force-yes is deprecated, use one of the options starting with --allow instead. E: Version '5.8.1' for 'elasticsearch-curator' was not found Error: /Stage[main]/Opensearch::Curator/Package[elasticsearch-curator]/ensure: change from 'purged' to '5.8.1' failed: Could not update: Execution of '/usr/bin/apt-get -q -y -o DPkg::Options::=--force-confold --force-yes install elasticsearch-curator=5.8.1' returned 100: Reading package lists... Building dependency tree... Reading state information... W: --force-yes is deprecated, use one of the options starting with --allow instead. E: Version '5.8.1' for 'elasticsearch-curator' was not found
elukey@apt1001:/srv/wikimedia$ sudo reprepro lsbycomponent opensearch opensearch | 1.2.4 | buster-wikimedia | thirdparty/opensearch1 | amd64
The datahub nodes are Bullseye afaics, so there are no packages in apt for it :)
Change 763587 had a related patch set uploaded (by Razzi; author: Razzi):
[operations/puppet@production] datahub::opensearch: Change curator version to 5.8.1-1 for
Change 763587 merged by Razzi:
[operations/puppet@production] opensearch: make curator version bullseye compatible
Change 763815 had a related patch set uploaded (by Razzi; author: Razzi):
[operations/puppet@production] analytics_cluster::datahub::opensearch: Enable syslog transport
Change 763815 merged by Razzi:
[operations/puppet@production] analytics_cluster::datahub::opensearch: Enable syslog transport
Change 763844 had a related patch set uploaded (by Razzi; author: Razzi):
[operations/puppet@production] opensearch: change log4j appender.ship_to_logstash.layout
Change 763844 merged by Razzi:
[operations/puppet@production] opensearch: change log4j appender.ship_to_logstash.layout
Vms have been created and are all running opensearch 1.2.4!
Side note: puppet needed to be run two times for some reason. As I understand puppet should apply everything on a single run. If anybody's interested, I saved the output of the 2 puppet runs here: https://phabricator.wikimedia.org/P21613 (visibility is wmf-restricted in case I missed some secret output)
Looks good, but I think that we need to sort out the firewall between these hosts.
btullis@datahubsearch1001:/etc/opensearch/datahub$ curl http://127.0.0.1:9200/_cat/health {"error":{"root_cause":[{"type":"master_not_discovered_exception","reason":null}],"type":"master_not_discovered_exception","reason":null},"status":503}
Looking at the existing logstash nodes, they have a ferm configuration fragment present, which is not present on the datahubsearch hosts.
So we're going to need to work out why profile::opensearch::server isn't generating this file.
@colewhite could you perhaps weigh in on how to get the ferm rules accepting traffic between the elasticsearch nodes?
It looks like opensearch instances are filtered at https://gerrit.wikimedia.org/r/plugins/gitiles/operations/puppet/+/refs/heads/production/modules/profile/manifests/opensearch/server.pp#70 and those nodes are the ones that include the ferm::service class.
One thing I've never figured out how to do is see the value of a Is there an easy way to see the value of $filtered_instances that puppet evaluates to see if that's missing the datahubsearch100x nodes? Maybe something like this: https://www.puppetcookbook.com/posts/see-all-client-variables.html
The role probably needs profile::base::firewall. I think the ferm rules on those hosts may have been left over from time spent in role::insetup.
The presence of the old jvm gc checks in icinga indicate the ferm rule should be managed given their proximity.
Change 767611 had a related patch set uploaded (by Razzi; author: Razzi):
[operations/puppet@production] analytics_cluster::datahub::opensearch: add firewall and base_checks
Cool, thanks @colewhite. I also threw in the ::profile::opensearch::monitoring::base_checks since it seems to implement a simple, useful http check.
Change 767611 merged by Razzi:
[operations/puppet@production] analytics_cluster::datahub::opensearch: add firewall and base_checks
Icinga downtime set by razzi@cumin1001 for 7 days, 0:00:00 3 host(s) and their services with reason: Still having errors setting up opensearch
datahubsearch[1001-1003].eqiad.wmnet
The firewall changes appear to be successful:
razzi@datahubsearch1001:/etc/ferm$ nc -z datahubsearch1002 9300 razzi@datahubsearch1001:/etc/ferm$ echo $? 0
However the health check is still coming up 503:
razzi@datahubsearch1001:/etc/ferm$ curl localhost:9200/_cat/health {"error":{"root_cause":[{"type":"master_not_discovered_exception","reason":null}],"type":"master_not_discovered_exception","reason":null},"status":503}
I thought maybe restarting opensearch would cause it to properly discover nodes, but strangely got a different error of missing files:
Mar 03 01:37:27 datahubsearch1001 systemd[1]: Starting OpenSearch... Mar 03 01:37:27 datahubsearch1001 systemd-entrypoint[1856577]: /usr/share/opensearch/bin/opensearch-env: line 89: cd: ${OPENSEARCH_PATH_CONF-/etc/opensearch}: No such file or directory Mar 03 01:37:29 datahubsearch1001 systemd-entrypoint[1856740]: Exception in thread "main" java.nio.file.NoSuchFileException: /usr/share/opensearch/jvm.options Mar 03 01:37:29 datahubsearch1001 systemd-entrypoint[1856740]: at java.base/sun.nio.fs.UnixException.translateToIOException(UnixException.java:92) Mar 03 01:37:29 datahubsearch1001 systemd-entrypoint[1856740]: at java.base/sun.nio.fs.UnixException.rethrowAsIOException(UnixException.java:106) Mar 03 01:37:29 datahubsearch1001 systemd-entrypoint[1856740]: at java.base/sun.nio.fs.UnixException.rethrowAsIOException(UnixException.java:111) Mar 03 01:37:29 datahubsearch1001 systemd-entrypoint[1856740]: at java.base/sun.nio.fs.UnixFileSystemProvider.newByteChannel(UnixFileSystemProvider.java:218) Mar 03 01:37:29 datahubsearch1001 systemd-entrypoint[1856740]: at java.base/java.nio.file.Files.newByteChannel(Files.java:375) Mar 03 01:37:29 datahubsearch1001 systemd-entrypoint[1856740]: at java.base/java.nio.file.Files.newByteChannel(Files.java:426) Mar 03 01:37:29 datahubsearch1001 systemd-entrypoint[1856740]: at java.base/java.nio.file.spi.FileSystemProvider.newInputStream(FileSystemProvider.java:420) Mar 03 01:37:29 datahubsearch1001 systemd-entrypoint[1856740]: at java.base/java.nio.file.Files.newInputStream(Files.java:160) Mar 03 01:37:29 datahubsearch1001 systemd-entrypoint[1856740]: at org.opensearch.tools.launchers.JvmOptionsParser.readJvmOptionsFiles(JvmOptionsParser.java:182) Mar 03 01:37:29 datahubsearch1001 systemd-entrypoint[1856740]: at org.opensearch.tools.launchers.JvmOptionsParser.jvmOptions(JvmOptionsParser.java:143) Mar 03 01:37:29 datahubsearch1001 systemd-entrypoint[1856740]: at org.opensearch.tools.launchers.JvmOptionsParser.main(JvmOptionsParser.java:110) Mar 03 01:37:29 datahubsearch1001 systemd[1]: opensearch.service: Main process exited, code=exited, status=1/FAILURE Mar 03 01:37:29 datahubsearch1001 systemd[1]: opensearch.service: Failed with result 'exit-code'. Mar 03 01:37:29 datahubsearch1001 systemd[1]: Failed to start OpenSearch. Mar 03 01:37:29 datahubsearch1001 systemd[1]: opensearch.service: Consumed 2.772s CPU time.
Ok, at least the missing jvm.options is that I did a sudo systemctl restart opensearch when I should have used the service opensearch_1@datahub.service.
There are a bunch of different errors in the /var/log/opensearch/datahub.log, such as
java.nio.file.NoSuchFileException: /usr/share/opensearch/lib/logstash-gelf.jar
which I wouldn't expect, since we disabled gelf, and
[2022-03-03T01:37:33,175][WARN ][stderr ] [datahubsearch1001-datahub] Exception in thread "main" org.opensearch.bootstrap.BootstrapException: java.nio.file.AccessDeniedException: /var/run/opensearch [2022-03-03T01:37:33,177][WARN ][stderr ] [datahubsearch1001-datahub] Likely root cause: java.nio.file.AccessDeniedException: /var/run/opensearch
This is all on datahubsearch1001; I haven't touched the others.
I see there's an extra unit on datahubsearch1001:
UNIT LOAD ACTIVE SUB DESCRIPTION ● opensearch.service loaded failed failed OpenSearch ● opensearch_1@datahub.service loaded failed failed OpenSearch (cluster datahub) opensearch-datahub-gc-log-cleanup.timer loaded active waiting Periodic execution of opensearch-datahub-gc-log-cleanup.service
versus 1002:
razzi@datahubsearch1002:~$ systemctl list-units opensearch UNIT LOAD ACTIVE SUB DESCRIPTION 0 loaded units listed. Pass --all to see loaded but inactive units, too. To show all installed unit files use 'systemctl list-unit-files'. razzi@datahubsearch1002:~$ systemctl list-units opensearch'*' UNIT LOAD ACTIVE SUB DESCRIPTION opensearch_1@datahub.service loaded active running OpenSearch (cluster datahub) opensearch-datahub-gc-log-cleanup.timer loaded active waiting Periodic execution of opensearch-datahub-gc-log-cleanup.service
I wonder if that extra unit created when I did restart on a nonexistent unit; if so systemctl continues to surprise...
I downtimed the node for the next week so hopefully this messiness won't bother anybody unless they choose to look into it...
The symlink to the gelf jar is not managed by puppet anymore and has to be removed manually. I did this on datahubsearch1001 and restarted opensearch_1@datahub and observed the logs. The service now starts.
The opensearch debian package creates opensearch.service and Puppet creates opensearch_1@datahub.service. Puppet configures opensearch nodes to be "multi-instance capable" T198351 mirroring what the elasticsearch module does, but does not remove the broken unit. Given this is confusing, it may be worthwhile to have Puppet remove opensearch.service.
Hi Cole!
What is the symlink removed? I checked today and didn't find anything in /usr/share/opensearch/lib (curious)
The opensearch debian package creates opensearch.service and Puppet creates opensearch_1@datahub.service. Puppet configures opensearch nodes to be "multi-instance capable" T198351 mirroring what the elasticsearch module does, but does not remove the broken unit. Given this is confusing, it may be worthwhile to have Puppet remove opensearch.service.
This could be a good candidate for systemctl mask (there are some examples in the puppet repo, easy and convenient to avoid headaches in my opinion).
The log file for the service had lots of messages like this:
[2022-03-04T00:00:17,947][WARN ][o.o.c.c.ClusterFormationFailureHelper] [datahubsearch1001-datahub] master not discovered yet, this node has not previously joined a bootstrapped (v7+) cluster, and [cluster.initial_master_nodes] is empty on this node: <snip>
Although it's not specific to OpenSearch, this is the most useful reference: https://www.elastic.co/guide/en/elasticsearch/reference/7.17/modules-discovery-bootstrap-cluster.html
The key item is that we haven't defined any initial_master_nodes so the cluster is refusing to bootstrap itself.
Our current manifests don't include this as a configurable variable: ref so I'm just going to bootstrap the server manually on this occasion.
- I stopped puppet
- I modified the file /etc/opensearch/datahub/opensearch.yml to add the parameter.
btullis@datahubsearch1001:/var/log/opensearch$ grep initial_master_nodes /etc/opensearch/datahub/opensearch.yml #cluster.initial_master_nodes: ["node-1", "node-2"] cluster.initial_master_nodes: ["datahubsearch1001-datahub","datahubsearch1002-datahub","datahubsearch1003-datahub"]
- I started the service.
Now this node has elected itself.
[2022-03-04T16:07:37,540][INFO ][o.o.c.s.MasterService ] [datahubsearch1001-datahub] elected-as-master ([3] nodes joined) <snip> cluster UUID set to [XD2SLPtvRKqdo0wphCaXRQ]
The CAT API calls now return as expected.
btullis@datahubsearch1001:~$ curl http://127.0.0.1:9200/_cat/nodes 10.64.16.45 2 96 0 0.00 0.00 0.00 dimr - datahubsearch1002-datahub 10.64.32.27 13 96 0 0.00 0.00 0.00 dimr - datahubsearch1003-datahub 10.64.0.85 1 98 0 0.03 0.15 0.09 dimr * datahubsearch1001-datahub btullis@datahubsearch1001:~$ curl http://127.0.0.1:9200/_cat/health 1646410362 16:12:42 datahub green 3 3 true 0 0 0 0 0 0 - 100.0%
Re-enabled puppet and ran, reverting the changes to /etc/opensearch/datahub/opensearch.yml
Restarted the service with sudo systemctl restart opensearch_1@datahub.service to make sure that the settings were picked up.
Now most of the icinga alerts associated with port 9200 have resolved as well.
We still have the JVM related alert that was mentioned in T302818 but we can come back to that once we have the prometheus exporter running.
Change 768702 had a related patch set uploaded (by Btullis; author: Btullis):
[operations/puppet@production] Move some common resources to the opensearch::server profile
Change 768736 had a related patch set uploaded (by Btullis; author: Btullis):
[operations/puppet@production] Add a profile specific to datahubsearch servers
Change 768736 merged by Btullis:
[operations/puppet@production] Add a profile specific to datahubsearch servers
Change 768702 merged by Btullis:
[operations/puppet@production] Move some common resources to the opensearch::server profile
I've added the firewall rule, so port 9200 is now open to the production networks. Not the analytics vlan, but that's OK.