Tue, Sep 12
+1 for trying this. Thinking out loud:
Fri, Sep 1
Uploaded the above to get the ball rolling on a patch. As a starting point it is essentially borrowing the values used for benthos mw accesslog [0.005, 0.01, 0.025, 0.05, 0.1, 0.25, 0.5, 1, 2.5, 5, 10, 15, 20, 30, 60]
Thu, Aug 31
Wed, Aug 30
Tue, Aug 29
Yes stalling is fine. The original reason for the switch to cfssl was related to adding a SAN to the thanos-fe certificate. That shouldn't be blocked since we can use still cergen for the time being.
Fri, Aug 25
@BCornwall fwiw switching from "sli good" to "sli bad" does have the above in mind, namely by working with the small margin-of-error (by switching to calculation to a bad and total metric) instead of against it (attempting to maintain identical good and total metric values). That'd be actionable in the near term and would avoid the negative sli with the exception of edge cases where haproxy is serving 100% errors. With that said, looping in @colewhite and @fgiunchedi for their thoughts and potential alternate alternatives
Thu, Aug 24
@lmata could you please confirm if/when ready to proceed with decom of dispatch infra?
Agree, although let's create a decom task for that as there are some services on the alert hosts to clean up as well
Wed, Aug 23
Aug 21 2023
I'm curious if the recent maglev hashing 'mh' inclusion/migration in T263797 provides any improvement here. On paper using 'mh' scheduler should address session stickiness better than 'sh' did.
Aug 18 2023
Aug 17 2023
Aug 15 2023
Thanks for the info. With this in mind I'm going to stall this victorops setup task while the details of the desired alerting/paging/team layout are decided. Once that's sorted please update with the desired team name(s) and members and we'll work on setting that up. Thanks!
Hey @bking, thanks for the task! Could you please point me towards the current team roster in order to get the ball rolling for team/account creations?
Aug 14 2023
Thanks for the patch @BCornwall LGTM
Aug 10 2023
Aug 8 2023
Aug 7 2023
AFAIK this is sorted, please reopen if follow up is needed
Aug 3 2023
Thanks for creating a tracking task! Quickly adding my notes from looking into this:
Jul 27 2023
Jul 19 2023
Untagging observability to table this wrt the kafka-logging cluster for the time being. Will need to revisit the kafka-logging acl config in more detail as part of planning out the kafka upgrade to 3.x
Jul 18 2023
Jun 29 2023
When attempting to manually reproduce the api call that is erroring:
"code": 401, "message": "API keys are not supported by this API. Expected OAuth2 access token or other authentication credentials that assert a principal. See https://cloud.google.com/docs/authentication",
Looking into alternate credentials
Gmail plugin is now enabled using the existing dispatch google cloud credentials (same as used by drive and docs plugins)
Jun 28 2023
Jun 16 2023
Off hand will need to confirm that vo-escalate, vopsbot and the oncall schedule generator will work (or be adjusted to work) normally after this change.
Jun 13 2023
The Grafana hosts are getting tight on disk space, let's bump disk spec for the bullseye hosts to something like 30GB
Jun 9 2023
Jun 8 2023
(fixed in T338127)
Checking in this morning, I see both /srv/mw-log/api.log hourly and /srv/mw-log/*.log daily rotations happened. And with the 6to4 relay in place I'm seeing logs arrive on both sides of the "udp tee". I think we're good here!
Jun 7 2023
Jun 6 2023
3.21.0 looks promising. Here are my notes from testing on a throwaway VM. Depending on the version, and configuration order the behavior varies.
Checking back on this after rotations overnight, looks like we're not out of the woods yet. Daily rotations are happening once again, however api.log is now being rotated daily as well. Looks like we're hitting https://github.com/logrotate/logrotate/issues/38
Jun 5 2023
looks like logrotate is treating the glob meant to exclude api.log as a literal filename 🤦♂️
Jun 2 2023
Jun 1 2023
mwlog1002 has been upgraded to bullseye as well, resolving!
May 30 2023
mwlog2002 is up and running now on bullseye. I made a cursory attempt to use python3, but after fixing errors thrown and getting the daemons up and running under python3, it still wasn't writing logs to the filesystem.
END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host mwlog2002.codfw.wmnet with OS bullseye
I live edited reuse-lvm-root-4dev.cfg adding this to the bottom of the file, after another reimage the host boots into the os and is accessible from install_console
Mwlog2002 is throwing an error and dropping into grub rescue after reimage with the reuse partitions recipe, going to try and troubleshoot the recipe
May 25 2023
I'm not seeing any traffic arriving to mwlog1002 on port 6379, considering this confirmation of no further redis clients of mwlog
May 24 2023
May 23 2023
Good catch thanks, I had to take a look through my history on cumin1001 because I remember decomming these. Turns out ran the cookbook with the -d (dry run) flag enabled 🤦♂️ Will re-run these decoms now.
May 17 2023
May 16 2023
Looking good, I see the first hourly rotated file on disk