Fri, Oct 23
This is a known issue with the current Logstash configuration and one of the primary drivers behind adopting a Common Logging Schema (T234565).
Added request to the above link.
Thu, Oct 22
Thu, Oct 15
Updated prometheus-rsyslog-exporter deployed to the fleet. If the log message comes up again please let us know.
Wed, Oct 14
Thu, Oct 8
Indeed, there is a bit of delay due to retries and the default retry_interval of 1 (minute) which seems appropriate for most cases.
Found a new upstream and have deployed it to netbox-dev2001 and centrallog1001 to run for a few days. If all checks out, we'll roll it to the rest of the fleet.
Wed, Oct 7
I followed the replication steps and did not see the \\ufeff or <feff> artifacts appear in either the Grafana explore tool or pasting into the terminal. A few differences though: I'm running Chromium and I'm in locale en_us.UTF-8.
Tue, Oct 6
Patched mtail rolling out to the fleet this morning. Please let me know if you encounter any related issue.
Mon, Oct 5
As of T256418, we have removed StatsD outputs from Logstash. Prometheus-ES-Exporter accepts an Elasticsearch query and exports metrics based on those queries.
This is resolved with the removal of the statsd outputs from logstash.
Thu, Oct 1
While debugging https://gerrit.wikimedia.org/r/c/operations/puppet/+/631508, @ssingh uncovered a possible bug around how Puppet yaml parser handles unquoted string values:
Given the yaml:
profile::wikidough::dnsdist::webserver: host: 0.0.0.0 port: 8083 acl: - '0.0.0.0/0' - ::/0
the catalog compiler renders:
Error: Evaluation Error: Error while evaluating a Function Call, Lookup of key 'profile::wikidough::dnsdist::webserver' failed: Value for key 'profile::wikidough::dnsdist::webserver', in hash returned from data_hash function 'yaml_data', when using location '/srv/jenkins-workspace/puppet-compiler/25616/change/src/hieradata/role/common/wikidough.yaml', has wrong type, expects Puppet::LookupValue, got Hash[Enum['acl', 'host', 'port'], Any, 3, 3] (file: /srv/jenkins-workspace/puppet-compiler/25616/change/src/modules/profile/manifests/wikidough.pp, line: 6, column: 51) on node malmok.wikimedia.org
Wed, Sep 30
Tue, Sep 29
I cannot find any indication that the 400s are originating from our servers either in webrequest log or turnilo.
Mon, Sep 28
Sep 16 2020
By all means. The patch was generated by a tool, and I applied some manual stylistic formatting you may or may not want. Have a look and do with it what you see fit.
Sep 10 2020
It does not appear to be reproducible today. Will reopen if it comes back.
Sep 9 2020
Sep 7 2020
@jcrespo Thanks for bringing this to our attention. The filters on that dashboard indicate they are broken because the filter pattern logstash-* cannot be found on logstash-next.
Sep 3 2020
I tried this today. It was unable to parse Zayo or Telia new scheduled maintenance emails, but successfully parsed NTT and GTT new scheduled maintenance emails. At this point, the project looks like it would need quite a bit of fixing to fit our use case.
Sep 2 2020
Confirmed access to Icinga fixed via IRC.
Sep 1 2020
Aug 21 2020
Superseded by parent task.
Aug 20 2020
Aug 12 2020
Aug 10 2020
Change to router metrics in service-template-node merged.
Change to heap metrics merged into service-runner/prometheus_metrics branch. Thanks @Pchelolo!
Aug 5 2020
prometheus-icinga-exporter 0.8 deployed
Jul 29 2020
I'm not inclined to upstream the patch. The patch is a terrible, terrible hack that happens to fit our use case(s). It is very likely they would not want it as-is (it adds a dependency) and it might break the file rotation handling feature in subtle ways.
Jul 28 2020
sudo puppet lookup works! Thanks!
Jul 27 2020
utils/hiera_lookup shows me the same error.
Jul 20 2020
We (Observability) decided that we would like to explore querying Elasticsearch directly. It has promise due to its flexibility and gives us a clear option to alert on logs.
Jul 13 2020
Jul 7 2020
+wmf2 has been deployed
Jul 6 2020
Jun 30 2020
this has been cleaned up with the rc35 upgrade. no more instances since 15/Jun.
mtail rc35 is now deployed across the fleet
wezen is no longer around and mtail has been upgraded to rc35 across the fleet. this message does not appear to be spamming logs on centrallog.
Jun 29 2020
There is a redis exporter available and installed on the rdb servers. However, there are no instances of the redis exporter configured to export key length/size.
Jun 25 2020
An option we discussed recently was to ingest mail generated by the servers into Logstash by either pulling events from a mailbox or piping off events at the mail servers. Once in ES, queries could be run and aggregated emails generated as a daily report and/or alerts generated via log alerting.
Jun 24 2020
After installing +wmf2 on logstash1007 CPU usage appears to max around 1.3% as opposed to around 4% on +wmf1.
Jun 23 2020
Jun 19 2020
Jun 15 2020
This sounds a lot like something we identified during the audit phase of T234565: a number of fields are created (and ultimately passed through transparently) that have essentially the same data, just different keys. IIRC, we want to consolidate on the source's timestamp and only provide our own if one is not available.
Jun 12 2020
Jun 11 2020
From discussion in -analytics, @dcausse indicated that they are safe to remove.
Jun 10 2020
Jun 8 2020
This issue hasn't resurfaced since disabling fsnotify. Moving forward with the upgrade.
Still a problem, but probably not big enough to warrant the effort.