optimistically resolving now that the checklist in the desc is complete
- Queries
- All Stories
- Search
- Advanced Search
- Transactions
- Transaction Logs
Advanced Search
Wed, Nov 29
Tue, Nov 28
In T349159#9360246, @lmata wrote:can you help me identify what is relevant to:
- Transfer Grafana alerts (AlertManager).
...I'm hoping we may be able to close this task with your latest patch.
In T352128#9362457, @Volans wrote:As we got an email from VO about unassigned overrides I think that the issue here is that only one rotation was assigned and not the one that actually pages:
Mon, Nov 27
Thu, Nov 16
One option that comes to mind is relabeling with something like labelkeep to ingest only the labels we want/need on the prometheus side. That'd let us cut down without modifying the framework at the source.
Wed, Nov 15
Prometheus1005 is an R440 which should have 10 total 2.5" bays, today there are (6) 2T SSDs installed. I think it'd be worth getting the ball rolling on adding another (4) SSDs.
Mon, Nov 13
Thu, Nov 9
I spent some time today experimenting with https://github.com/grafana/cortex-tools, specifically cortextool analyse grafana which looked promising, but unfortunately throws parse errors when it encounters a period in the metric name which makes it not suitable for graphite metrics.
Wed, Nov 8
I've had success mounting non-root filesystems that were unreliable (networked fs, external arrays, these kinds of things) using autofs, which these days can be done in systemd.
Tue, Nov 7
Thanks for the input everyone! Sounds like we have a consensus on option 1. I'll get started with rolling collector VM reboots into 12GB memory then upload a patch for the JVMs and go from there.
Mon, Nov 6
In T350434#9308048, @fgiunchedi wrote:Would we have enough Ganeti resources to bump all the VMs to 12GB?
Fri, Nov 3
Thu, Nov 2
Personally I'm for doing option 1 right away, seeing how we handle the next log spike(s) and go from there. One of the main upsides from my view is this option changes the fewest things, essentially only the logstash JVM size as opposed to shuffling services around or introducing new host variants.
After a rolling restart of apache2 cpu utilization is down to essentially 0. Resolving for now, will investigate further if cpu util rises again.
In T350402#9302137, @Stashbot wrote:Mentioned in SAL (#wikimedia-operations) [2023-11-02T14:56:17Z] <herron> logstash1025 systemctl restart apache2.service T350402
Oct 31 2023
MW statsd-exporter instances are being scraped by production prometheus
With the above patch statsd_exporter has been deployed to mw hosts via puppet
In T344751#9293584, @lmata wrote:In T344751#9292738, @gerritbot wrote:Change 954114 merged by Herron:
[operations/puppet@production] profile::mediawiki::common: set default histogram buckets
Does this mean we're decided? 😃
Oct 27 2023
Oct 23 2023
promtool tsdb create-blocks-from rules is looking promising so far. Here's an slo recording rule that was deployed 2 days ago with history going back 12 weeks:
The initial approach I'm thinking of is running promtool tsdb create-blocks-from rules against the recording rules created by pyrra (since they are new and have no production dependencies yet) and output that into an unused scratch directory. Then transfer that over to pontoon and experiment with loading/using them, and go from there. Sound alright?
Most promising option at the moment looks like https://prometheus.io/docs/prometheus/latest/command-line/promtool/#promtool-tsdb-create-blocks-from-rules
Oct 19 2023
Thank you, this has been useful already!
Oct 18 2023
Oct 17 2023
Perfect, I've updated https://gerrit.wikimedia.org/r/c/operations/puppet/+/954114 to reflect this and I think with a +1 from @Krinkle we'll be good to go.
Oct 16 2023
Thanks SGTM overall, I'll propose just one amendment
10:52 AM <+jinxer-wm> (RedisMemoryFull) resolved: Redis memory full on arclamp1001:9121 - https://wikitech.wikimedia.org/wiki/Redis#Cluster_redis_arclamp - https://grafana.wikimedia.org/d/000000174/redis?orgId=1&var-datasource=eqiad%20prometheus/ops&var-job=redis_arclamp&var-instance=arclamp1001:9121&viewPanel=16 - https://alerts.wikimedia.org/?q=alertname%3DRedisMemoryFull
Currently maxmemory is set to 1Mb
In T348756#9246345, @hashar wrote:We have a Redis dashboard but arclamp1001.eqiad.wmnet is not collected there.
Oct 13 2023
Above is a patch for initial audit logging of POST data via modsec. Once we have some example data to work with we can refine the rules to log more human readable entries.
In T341606#9248134, @BCornwall wrote:@herron Thanks for all of your help. We've implemented varnish_sli_bad. I followed the formulae presented at the top of grafana-grizzly's slo_definitions.libsonnet and got these results. They seem to differ from the current dashboard's values, however. Would you be kind enough to double-check to make sure I'm not missing anything? Thank you. :)
Oct 11 2023
initial inclination/impressions:
Done!
Oct 6 2023
When I raised the idea of using curator for jaeger earlier this year the thinking then was essentially that not managing these indices in curator was a feature.
Oct 3 2023
That's been sorted: RECOVERY - Check systemd state on alert1001 is OK: OK - running: The system is fully operational
Oct 2 2023
We (o11y) have moved pyrra to the titan hosts, which makes this task moot. Transitioning to invalid
Sep 27 2023
Sep 25 2023
build2001:~$ sudo -i docker-registryctl delete-tags docker-registry.discovery.wmnet/dispatch We're about to delete the following tags for image docker-registry.discovery.wmnet/dispatch: latest v20220801-1-20220821 v20220801-1-20220828 v20220801-1-20220904 v20220801-1-20220911 v20220801-1-20220918 v20220801-1-20220925 v20220801-1-20221009 v20220801-1-20221016 v20220801-1-20221023 v20220801-1 v20220915-1-20221030 v20220915-1 v20220915-2-20221106 v20220915-2-20221113 v20220915-2 v20220915-3-20221120 v20220915-3-20221127 v20220915-3-20221204 v20220915-3-20221211 v20220915-3-20221218 v20220915-3-20221225 v20220915-3-20230101 v20220915-3-20230108 v20220915-3-20230115 v20220915-3-20230122 v20220915-3-20230129 v20220915-3-20230205 v20220915-3-20230212 v20220915-3-20230305 v20220915-3-20230312 v20220915-3-20230319 v20220915-3-20230326 v20220915-3-20230402 v20220915-3-20230409 v20220915-3-20230416 v20220915-3-20230423 v20220915-3-20230430 v20220915-3-20230507 v20220915-3-20230514 v20220915-3-20230521 v20220915-3-20230528 v20220915-3-20230604 v20220915-3-20230611 v20220915-3-20230612 v20220915-3-20230618 v20220915-3-20230625 v20220915-3-20230702 v20220915-3-20230716 v20220915-3-20230723 v20220915-3-20230730 v20220915-3-20230806 v20220915-3-20230813 v20220915-3-20230820 v20220915-3-20230827 v20220915-3-20230903 v20220915-3-20230910 v20220915-3-20230917 v20220915-3-20230924 v20220915-3 Ok to proceed? (y/n)y
GCP: Project "Dispatch" is now shut down and scheduled to be deleted after Oct 25, 2023.
Sep 21 2023
In T346144#9185491, @RLazarus wrote:I think it wouldn't even need to be "make editable," this only changes the default range for the time picker, right? So you can still use the time picker to punch in the dates of other quarters. Not as clean as having a list to pick from, but no worse than it is now.
Had a quick look at the current file limits for thanos-rule on titan2001, I'm seeing 524k as the current limit
Sep 20 2023
Sep 12 2023
+1 for trying this. Thinking out loud:
Sep 1 2023
Uploaded the above to get the ball rolling on a patch. As a starting point it is essentially borrowing the values used for benthos mw accesslog [0.005, 0.01, 0.025, 0.05, 0.1, 0.25, 0.5, 1, 2.5, 5, 10, 15, 20, 30, 60]
Aug 31 2023
Aug 30 2023
Aug 29 2023
Yes stalling is fine. The original reason for the switch to cfssl was related to adding a SAN to the thanos-fe certificate. That shouldn't be blocked since we can use still cergen for the time being.
In T326657#9126098, @Clement_Goubert wrote:Hi,
Just FYI, JS and CSS are currently broken on prometheus-{eqiad,codfw}.wikipedia.org due to 401 and 403 errors, with some CORS sprinkled in
Aug 25 2023
@BCornwall fwiw switching from "sli good" to "sli bad" does have the above in mind, namely by working with the small margin-of-error (by switching to calculation to a bad and total metric) instead of against it (attempting to maintain identical good and total metric values). That'd be actionable in the near term and would avoid the negative sli with the exception of edge cases where haproxy is serving 100% errors. With that said, looping in @colewhite and @fgiunchedi for their thoughts and potential alternatives
Aug 24 2023
@lmata could you please confirm if/when ready to proceed with decom of dispatch infra?
In T313229#9117902, @BCornwall wrote:Closing as dispatch has been ruled out as an option: See T308467 for follow-up discussion of where we're going.