Declining as these points are covered by the alerting roadmap. Feel free to reopen if needed!
- Queries
- All Stories
- Search
- Advanced Search
- Transactions
- Transaction Logs
Advanced Search
Dec 10 2019
Declining as these points are covered by the alerting roadmap. Feel free to reopen if needed!
We'll indeed be investigating non-SMS alternatives as a requirement for pages escalation, resolving but please reopen if needed!
This was done in https://gerrit.wikimedia.org/r/c/operations/puppet/+/350555
Dec 9 2019
In T159613#5723632, @Reedy wrote:In T159613#5723053, @fgiunchedi wrote:Looks like this error is HHVM-specific and I couldn't find other occurrences in logstash, ok to resolve and keep investigating T230245 ?
It isn't HHVM specific (I'm not even sure it's monolog specific, but that's the code where it actually surfaces), but maybe where it was appearing in the logs more frequently was (ie in 2017/2018 when it was hhvm).
Certainly, as per T230245#5582062, if you run the script on PHP7 and give it a high enough quantity (ie the 10K) you'll be able to get the error too
With my hacky workaround in place, it's probably not happening now and as such isn't in the logs.
re: ats and client timeouts and retries, yes ats does retry on origin timeout as it seems. Otherwise a 504 is returned to the user, for cache_upload there are indeed a few 504 on the backend but none on the tls frontend.
We have availability-based alerts now (i.e. 5xx / all status codes) for varnish and ATS, those can be made paging now I believe as we haven't seen false positives with 99.5% (warn) and 99% (crit)
I'm wondering if we've seen this behavior again? (i.e. certain icinga changes are not applied on puppet refresh)
@EBernhardson has this been done eventually and shows up in dashboards ?
I'm boldly declining this task for now as there hasn't been activity and/or other use cases / feature requests. Feel free to reopen if needed!
The original issue is gone, taking over this issue for general cleanup on logstash logs (including GC)
We're alerting on kafka-logging consumer lag now, resolving
We've separated indices now so this specific error has been resolved, there are other logging conflicts still left of course.
Looks like this error is HHVM-specific and I couldn't find other occurrences in logstash, ok to resolve and keep investigating T230245 ?
Looks like we're still getting this from time to time on wtp hosts:
Done! packet drops are gone
Dec 5 2019
In T236573#5714723, @akosiaris wrote:For what is worth, I have no usage of these machines nor the project.
In T174432#3565169, @ema wrote:In T174432#3562830, @BBlack wrote:Are the non-icmp graphs somehow LVS-specific?
Yes, the metrics are: node_ipvs_backend_connections_active, node_ipvs_incoming_packets_total, node_ipvs_incoming_bytes_total. The icmp graph instead plots node_netstat_Icmp_InMsgs.
The text panel @fgiunchedi added is correct, so I guess that should be enough to clarify the ambiguity? Alternatively, we could move the ICMP graphs to a new dashboard with host-specific metrics only.
Fixed now and 'load balancers' dashboard adjusted
I've investigated a bit the scope and impact of this issue, namely by joining the transactions IDs for which swift reported ConnectionTimeout in server.log with swift proxy-access.log. The idea being to see what swift sent back to ATS and with which latency.
Hosts are fully in service now!
Dec 4 2019
In T239805#5713046, @Papaul wrote:@fgiunchedi the 10G NiC is dead
1- option replace the server with another server
https://netbox.wikimedia.org/dcim/devices/1099/
2- option Buy another 10G NIC
Dec 3 2019
We've been working with service owners to fix the obvious offenders in terms of "fields spam" and bumped the fields limit to 2048. We're also alerting on indexing failures when Logstash gets errors from Elasticsearch. ATM only kartotherian bumps into the limit, although that doesn't necessarily mean kartotherian is the "fields spammer" in this case. I'll be following up with a patch to further bump the limit to 4096, that should be plenty to fully ingest all logs we're producing now.
In T189333#5645365, @EBernhardson wrote:In T189333#5488005, @Krinkle wrote:In T189333#5483346, @fgiunchedi wrote:In T189333#5481492, @Krinkle wrote:I re-ran my analysis today, and oddly enough the total number of fields it not only similar but equal to the number of fields there were three months ago. Currently at 7,665 table columns.
That's indeed unexpected, can you share how you are doing the analysis/pulling the field names?
- Open a Logstash dashboard in a Chromium browser, and open the Dev Tools.
- Edit or create a filter bubble in the Kibana UI, and open the channel dropdown.
- Then, from the Console tab in Dev Tools, execute copy($$('ul.uiSelectChoices--autoWidth.ui-select-dropdown')[0].textContent)
This queries the DOM for the <ul> node that represents the channel dropdown menu, then uses textContent (recursively aggregates the textual content of all child list items and concatenates it), and copies it to your clipboard.
Then, paste in a text editor and use some method of removing empty lines and count them :)
The more direct place to get this information is to click the Management (gear/cog) link in sidebar and select Index Patterns. This will report all the fields kibana knows about, along with counts. Today it lists 11091 fields. I'm not sure when exactly this metadata updates, or if it's real time. The refresh button which gives a big warning about resetting popularity counters suggests to it might not auto-update? We can compare to the actual indices with a bit of jq magic, but would take a bit to work up.
Similar message but for errors
In T234854#5708171, @elukey wrote:Hello! I took the liberty to ack a lot of criticals/unknowns in icinga that were related to these new hosts, IIUC these are not in production :)
Dec 2 2019
While talking metrics and such for java, please consider also adding jmx_exporter (in addition to the native metrics) to CAS' jvm as we are doing for other JVMs across the fleet in T177197: Export Prometheus-compatible JVM metrics from JVMs in production
In T151009#5704732, @jbond wrote:Im tempted to add this directly to apereo cas (time permitting) however im curious what you had in mind for the service domain names considering we need one for each codfw and eqiad?
Something like:
https://prometheous.codfw.wikimedia.org/ https://prometheous.eqiad.wikimedia.org/or did you have something else in mind?
AFAICS through the latest rebalances we haven't observed any alerts, possibly also due to using multiple servers per port (T222366)
Nov 29 2019
Nov 28 2019
All deployed now, boldly resolving
Nov 27 2019
Thanks for the in depth investigation and the numbers @colewhite ! Indeed looks like we'll need to tweak logstash pipeline parameters to >= 1000
Nov 26 2019
FTR, re: paging on librenms alerts, see this plan: https://phabricator.wikimedia.org/T224888#5690188
In T224888#5693759, @CDanis wrote:Any preferences or thoughts re: the special tag? Right now I'm leaning towards #page as that seems the most self-explanatory.
First thank you for getting the ball rolling on this proposal! A question: are all approaches proposed targeting group B actions only or some approaches would also tackle group A? Also I think it'll be helpful if the (only most promising?) approaches have an outline of what group B actions will turn into.
In T224888#5690188, @CDanis wrote:I've a proposal for doing this:
- Add some special tag like #NRPE or #page to the names of any LibreNMS alert rules we'd like to make page. For our purpose here this would just be #6 Primary outbound port utilisation over 80% and #25 Primary inbound port utilisation over 80%.
- In a Python NRPE:
- query the API's list of alert rules looking for names with this tag and collect those rule IDs https://docs.librenms.org/API/Alerts/#list_alert_rules
- query the list of state=alerting and status=critical alerts https://docs.librenms.org/API/Alerts/#list_alerts (query params state=1&severity=critical) and then filter alerts based on the above list of rule IDs
- return CRITICAL if any of those are found, UNKNOWN on any scrape errors, OK otherwise
This will prevent turning any LibreNMS critical into a page for the whole team (e.g. the currently-firing "Sensor over limit" for cr3-esams). It will also mean that ACKing alerts within LibreNMS does the right thing. And it makes it fairly straightforward to add/remove alert rules from the set that pages the team.
SGTU?
@Cmjohnson @Jclark-ctr I'm not blocked on this (thus no reassigning) but ms-be1059 is in row D judging by its ip address and netbox says row C. I believe netbox will need updating
In T237438#5690914, @Cmjohnson wrote:@fgiunchedi These are ready for you for implementation. I removed the ops-eqiad tag. if you have an issue please assign to me and add the ops-eqiad tag back