Fixed now and 'load balancers' dashboard adjusted
I've investigated a bit the scope and impact of this issue, namely by joining the transactions IDs for which swift reported ConnectionTimeout in server.log with swift proxy-access.log. The idea being to see what swift sent back to ATS and with which latency.
Hosts are fully in service now!
Wed, Dec 4
Tue, Dec 3
We've been working with service owners to fix the obvious offenders in terms of "fields spam" and bumped the fields limit to 2048. We're also alerting on indexing failures when Logstash gets errors from Elasticsearch. ATM only kartotherian bumps into the limit, although that doesn't necessarily mean kartotherian is the "fields spammer" in this case. I'll be following up with a patch to further bump the limit to 4096, that should be plenty to fully ingest all logs we're producing now.
Similar message but for errors
Mon, Dec 2
While talking metrics and such for java, please consider also adding jmx_exporter (in addition to the native metrics) to CAS' jvm as we are doing for other JVMs across the fleet in T177197: Export Prometheus-compatible JVM metrics from JVMs in production
AFAICS through the latest rebalances we haven't observed any alerts, possibly also due to using multiple servers per port (T222366)
Fri, Nov 29
Thu, Nov 28
All deployed now, boldly resolving
Wed, Nov 27
Thanks for the in depth investigation and the numbers @colewhite ! Indeed looks like we'll need to tweak logstash pipeline parameters to >= 1000
Tue, Nov 26
FTR, re: paging on librenms alerts, see this plan: https://phabricator.wikimedia.org/T224888#5690188
First thank you for getting the ball rolling on this proposal! A question: are all approaches proposed targeting group B actions only or some approaches would also tackle group A? Also I think it'll be helpful if the (only most promising?) approaches have an outline of what group B actions will turn into.
Mon, Nov 25
Looks like this is all done, resolving
Status update: I've been working on a dashboard with wattage from sentry3 + sentry4. It has got a global stacked graph + drilldown per-site: https://grafana.wikimedia.org/d/OBD1jy1Zk/filippo-pdu
Thresholds adjusted for global availability and I've updated "frontend traffic" dashboard
FWIW I'm ok with doing whichever is easiest, IIRC we can ship to kafka first and then add rules to log to a separate file.
AFAIK if dashboards have been migrated then deployment-logstash02 should be ready to be turned off
The cause was indeed appservers latency, resolving in favor of T238939
Sat, Nov 23
Found this task only now, but see also T238973: Appservers rising GET latency might have triggered LVS pages
Fri, Nov 22
Looks good! I won't have time to look into this in depth but I'm happy to help if patches need review
Thu, Nov 21
LGTM so far, thanks @mobrovac for working on this!
Yes we can, if you know the name of the field we can add an explicit mapping to force the type in modules/profile/files/logstash/elasticsearch-template.json
Indeed, the file is a Nov 2013 upload, we could search for it in archives containers as well in case it got moved there. re: finding all orphan files, my understanding is that mediawiki has maintenance scripts to achieve that but we're not doing that on a regular basis and investigate the results.
This is complete!
Indeed looks like prometheus is trying to fetch conf1004.eqiad.wmnet:2379/metrics with no success. Locally on conf1004 even past the firewall the endpoint doesn't seem to work:
Indeed that's prometheus@analytics trying to reach burrow-exporter on port 9000 on kafkamon hosts, burrow-exporter is listening there but clearly no ferm. @Ottomata @elukey is port 9000 a legacy configuration we need to clean up or expected to be working ?
Wed, Nov 20
Thanks! I think we should go with (2) (i.e. investigate integration between icinga (or grafana alerts, and from there icinga checks) for fastnetmon and librenms) so we get all niceties like irc, silence/acknowledge, contact groups etc
The spam is back after centrallog2001 reimage (was wezen) running buster, I've bandaided the issue but it seems we should try one of the latest mtail releases (cc @colewhite)
Tue, Nov 19
I can confirm that a DELETE of https://grafana.wikimedia.org/api/dashboards/uid/000000599 results in a 403, further I don't see the request reaching grafana1001's apache logs. I'm adding Traffic since this looks like a regression, perhaps ATS is involved.