Has been fixed since
- Queries
- All Stories
- Search
- Advanced Search
- Transactions
- Transaction Logs
Advanced Search
Mar 30 2020
Not a bug, timezone is displayed
We're moving Prometheus on its own dedicated hosts everywhere, I see no reason not to leave the current entry point as is now (also we moved to apache in the meantime)
With https://gerrit.wikimedia.org/r/580985 merged I'm resolving this task since check latency is doing better now, and we're alerting on excessive latency.
We have https://github.com/wrouesnel/postgres_exporter deployed on the maps hosts, I believe some/all of the metrics you are looking for are available in grafana/prometheus. You can get a preview of those from the host itself if you wish: curl -s localhost:9187/metrics or use Grafana's "explore" function (while logged in), the metrics will start with pg_, hope that helps!
Mar 26 2020
Lowering priority as things I believe are better now, pending https://gerrit.wikimedia.org/r/c/operations/puppet/+/580985 as the last attempt at lowering check latency further.
Thanks @Andrew ! Appreciate it
Mar 24 2020
Myself, @akosiaris @colewhite and @Ottomata met today to bikesh^W understand better what service means and other related labels.
Mar 23 2020
Mar 20 2020
Summary of the IRC chat: the current batch of uploads is about halfway finished and will likely be done by early next week, although no byte size estimates are available. Bots don't seem to have upload rate limits enforced (thanks @Reedy) which I filed as T248177. The one file per page approach is fine as is, depending on the source we do get cases like that.
In T248151#5986783, @Dominicbm wrote:Hi, this is me! 😳If it's easier, I can get on Telegram or IRC to chat with you about my project. Obviously, I've been going at a high rate, but I don't really want to break Wikimedia!
The flapping seems to have started with the latest version of the exporter AFAICS (around March 18 end of UTC day) maybe that's a lead too? Also it happens in codfw exclusively, I'm assuming due to the periodic icinga restarts/reloads we do there
Mar 19 2020
Not relevant anymore as we're dialing down our graphite usage across the board
Docs have been expanded and available at https://wikitech.wikimedia.org/wiki/Graphite
Resolving since we have significantly lessened the load of udp traffic
Declining as we haven't been experiencing this problem anymore (less dashboards on graphite)
See also T119719 when Grafana 6.7 is released
GH issue is resolved, and the feature will be available in Grafana 6.7: https://github.com/grafana/grafana/blob/master/CHANGELOG.md#670-beta1-2020-03-12
In T247820#5977711, @colewhite wrote:Good idea forking the original task. Thanks for that!
Mar 18 2020
Top 50 checks as of today, with a little longer time horizon than the previous audit
The new baseline in eqiad for average check latency is ~70s, which isn't great IMHO but certainly better. Short of deploying more powerful hardware I think we can continue looking for low hanging fruits and lessen the load in terms of checks that icinga needs to run.
Mar 17 2020
In T247538#5975450, @gerritbot wrote:Change 580327 had a related patch set uploaded (by Filippo Giunchedi; owner: Filippo Giunchedi):
[operations/puppet@production] base: relax interval for selected checks
Fix is deployed, looking good!
I've bumped the limits for squid on install1003 and things look good now, the permanent fix is in https://gerrit.wikimedia.org/r/580296
Mar 16 2020
Thanks for the context on service @akosiaris , now it is much more clear in my mind what the status quo is. In the interest of compatibility and time (and picking our battles) I'd say let's go ahead and keep service as it is, for sure the whole conversation is interesting but definitely for another task.
This is complete (i.e. T242609: Move thumbor to the logging pipeline), resolving. Feel free to reopen though!
In T238658#5964111, @Ottomata wrote:Perhaps using a single metric name e.g. 'express_router_request_duration_seconds' for all services is a bad idea? Maybe these should be named per service instead?
The metric won't be a single one for all services though, when Prometheus pulls from k8s services it'll attach tags
By tags do you mean labels, or is this a prometheus thing I don't know about?! My understanding that the metric name here is 'express_router_request_duration_seconds', and every service-runner based app will emit a metric with the same name.
@fgiunchedi my suggestion would be to use a per service app name metric for this, instead of using one for all services that happen to use express. I currently have eventstreams_connected_clients, so this would be eventstreams_request_duration_seconds with path specific labels.
Mar 10 2020
I took a look at this on both prometheus200[34] for up{instance=~"elastic2055.*9108"} and the metric appears yesterday on 2004 at 9:44 and 2003 at 17:48. Whereas both target files (/srv/prometheus/ops/targets/elasticsearch_codfw.yaml) are last modified at 10:10, so definitely 2003 has been lagging behind and eventually discovered the target, for reasons yet to be determined.
In T226986#5954880, @Krinkle wrote:I've saved this as a dashboard called mw-client-errors and linked it from the Kibana homepage.
Mar 9 2020
Not sure if there's a more specific python3 + Thumbor task but the alpha version of Thumbor ships with Python 3 support: https://github.com/thumbor/thumbor/releases/tag/7.0.0a2
In T238658#5953276, @akosiaris wrote:It does have it's usefulness. As I pointed out, gauges have the problem that you will never get insights into events that last less than the current polling period (60s currently). Counters have the capability to expose that. But your point about the long lived connections is correct. I 'd say keep both?
Similarly, the gc logs rotation (openjdk 11 on buster) config doesn't seem to work:
It doesn't look like the current log4j config is working as intended:
Stalling since we'll be piggybacking on Proton (and mobileapps) moving to k8s, and thus the logging pipeline. See also T219924 for the full discussion.