Wed, Oct 16
Thu, Oct 10
Wed, Oct 9
@dduvall Thanks!. I removed the test stage also forced devdeps to install. We should definitely look at a better way to handle this later. but Its fine as it is.
Currently, Build is passing but not publishing yet. Do we need to enable CI publish stage for the repo?
@dduvall Thanks. I will implement this.
Thu, Oct 3
This issue has come up again. Currently, we have only enwiki_content_1546970425 unassigned with too many shards  allocated to this node for index [enwiki_content_1546970425], index setting index.routing.allocation.total_shards_per_node=1] error from _cluster/allocation/explain.
Tue, Oct 1
Post merge builds seems to fail.
Tue, Sep 24
We should talk to elastic to see how we can move this forward.
Currently, we require jackson-databind 2.8.11 and jackson-annotation 2.8.11 for JsonLayout to work when using SyslogAppender. Version 2.8.6 is provided by debian for this packages. We should use the correct version to make sure everything work as expected.
Mon, Sep 23
Fri, Sep 20
Sep 18 2019
Sep 16 2019
Sep 12 2019
@Ladsgroup there's no TLS termination on that port for now. We should have and I will work on it in the nearest future. Please use HTTP for now
Sep 11 2019
Sep 10 2019
Sep 9 2019
Sep 6 2019
This is a know issue. The SRE team is finding a quick solution to restore these services. Thanks
JsonLayout requires other dependencies for log4j. This include jackson databind. See https://logging.apache.org/log4j/2.x/runtime-dependencies.html.
- Rebuild log4j with this dependencies
- Fall back to shipping logs with PatternLayout.
Sep 4 2019
Not sure but seems we are missing some configs in our config.yaml patch
Sep 3 2019
rsyslog Json requires the @cee token which must be provided according to standard via profile::rsyslog::udp_localhost_compat. Let's use profile::rsyslog::udp_json_logback_compat instead as it permits parsing of json from log4j without the token.
Sep 2 2019
Aug 29 2019
On another note, I think this check make sense for other clusters as well
elastic1029 is back on icinga showing memory errors. see https://icinga.wikimedia.org/cgi-bin/icinga/extinfo.cgi?type=2&host=elastic1029&service=Memory+correctable+errors+-EDAC-
Aug 28 2019
My screenshot from Windows 10/Version 76.0.3809.100 (Official Build) (64-bit)
I changed the priority of this to normal. Feel free to change it as you see fit
Aug 27 2019
@thcipriani any update on this? seems stalled or partially resolved.
what's the latest on this? Do you want to follow up on Nuria?
I'm guessing everyone is happy so I'm going to close this.
Aug 26 2019
Aug 22 2019
Aug 16 2019
@Papaul On second thought, we have other servers and losing one elastic node is Ok. So this should be set to normal
Aug 14 2019
This was traced to some initial problems during osm-initial-script. This was resolved by reinitializing osm again.
After some conversation with @EBernhardson, it was discovered dump are currently being loaded into the cloudelastic cluster (https://phabricator.wikimedia.org/T220625) and this might be related to the slow response time. There's a heavy indexing going on this cluster (9200). This causes icinga alerts requests to timeout.
Also we think this slow response time should not impact users.
This issue is solved for now and cloudelastic checks for all ports have been generated on icinga. However, only IPv4 checks were generated and this is Ok for now. If there's need to generate IPv6 checks, we can always reopen this task
Aug 13 2019
Aug 12 2019
Aug 9 2019
@jbond Thank you!
You fix is way better than mine. I will look at the patch now
Aug 8 2019
Aug 6 2019
@MSantos Thank you!
Postgres reinitialization was performed to bring this slave back up. I'll close this task for now and investigate more if it re-occurs.
Aug 5 2019
running select * from pg_stat_wal_receiver; on maps1001 returns empty. This means postgres slave is not receiving update from master. Also master only show two nodes connected instead of three:
Sadly, I don't think this will work as the host param will not be unique and icinga does not seem to handle that well. Another option might be to create more CNAMEs or more A-records like we have for git and git-ssh here: https://gerrit.wikimedia.org/r/plugins/gitiles/operations/dns/+/master/templates/wikimedia.org#336
@BBlack yea yea.. I've missed your musing on complex system. Thanks. I will make a patch
About cloudelastic resolving to icinga1001, I had jbond help me do see where it cloudelastic.wikimedia.org resolves to and it seems to be resolving to the correct IP.
@Vgutierrez we could remove the icinga part of the configuration in configuration.yaml file and define the checks in lvs::monitor_services instead. I think that should work.