Tue, Sep 26
Sounds great, please reach out and/or send reviews if sth is amiss with the standard recipes
Mon, Sep 25
I've bumped the for threshold, resolving for now and will reopen if the issue is back
No reoccurence, resolving
Please excuse the drive-by comment, I've worked with @Muehlenhoff on standardizing our partman recipes and I'm wondering if the standard raid0 recipes (i.e. raid1 for / and raid0 for /srv) would work in this case?
Fri, Sep 22
Change deployed, we'll be standing by and see if thanos still laments evaluation failures. Note that prometheus itself has experienced some, although that is completely different and tracked in T347167: Temporary prometheus alert evaluation failures on host role change
Thank you for raising this @taavi !
It isn't thanos-rule itself reporting the error message, but thanos-store that rule talks to; in other words the error message is nested
Thu, Sep 21
Wed, Sep 20
I'll optimistically call this specific issue resolved, the nail in the coffin will be file-based xds for envoy
I've looked into the puppet logs from the first puppet run on cumin1001:/var/log/spicerack/sre/hosts/reimage/202309130825_filippo_2981305_titan1001.out and the initial failure is because systemd::syslog can't find the envoy user when setting directory ownership:
Test was successful in the sense that we can keep processing the webrequest firehose from codfw (centrallog2002). Even though by taking a performance hit in terms of messages processed, of course also network bandwidth takes a significant hit, jumping ~ +25MB/s. I believe that in the unlikely event of one centrallog host being unavailable for multiple hours we can still get a reasonable webrequest sampled stream. If even that doesn't work for some reason, it is easy to spin up benthos on different hosts since it is all stateless and trivially horizontally scalable
For reference, the dashboards me and @elukey are looking at:
Tue, Sep 19
Opened T346759: Investigate and deploy 'max-repeaters = 20' to all librenms devices for followups, this is done
Thank you for reaching out @Urbanecm_WMF and letting us know about metric cleanup! This is done
Mon, Sep 18
+ netops for visibility since this can impact network devices
Setting max-repeaters to 20 definitely had an impact on bgp peers poll time:
I tried the setting above on https://librenms.wikimedia.org/device/device=159/tab=edit/section=snmp/ though the web UI reloaded and the text field was empty, suggesting to me that the setting "didn't take"
This is done, webrequest_live is more robust against partial / unindexable requests
That's correct yes, all done
This is done, we're using prometheus-assemble-config for snmp-exporter too now
Fri, Sep 15
Sweet, thank you @ayounsi !
This is done
Thu, Sep 14
The integration works again, what I did is:
Nothing to do, host was reimaged:
@elukey and I looked into this, turns out that these records have dt set to - and therefore not indexed in druid, current plan is to drop those at the benthos level before sampling
Wed, Sep 13
Hosts reimaged with raid0, resolving
@Hokwelum you can find alertmanager onboard documentation for new teams at https://wikitech.wikimedia.org/wiki/Alertmanager . Please add me or other members of observability to the gerrit code reviews and we'll review/merge them for you! Please reach out too if you need further assistance and/or have questions
There's also a related problem, which is more a puppet one, that if the build-envoy-config exec fails (like it happens on the first puppet run) it is never retried unless one of admin-config.yaml or runtime.yaml changes (and the exec is called again)
This is preventing zero-touch reimage of hosts running envoy AFAICS.
First puppet run does indeed create /etc/envoy/envoy.yaml if isn't present, trying to fix its permissions
Tue, Sep 12
I was a little too hasty here, I forgot we need raid0 on these hosts to be able to store blocks to be compacted, will need to reimage the hosts
New hosts are in service, resolving
I noticed this because it seems that during said incident with many invalid ip benthos has sent out more messages than I'd have expected:
re: the last point, namely cleaning up thanos components off thanos-fe (therefore leaving only swift) I initially thought of going the state => absent route, though that seems more trouble than removing the thanos profiles from thanos::frontend role and roll-reimage the thanos-fe hosts. What do you think @MatthewVernon ?
Mon, Sep 11
Looks like the alert is working as expected: https://alerts.wikimedia.org/?q=%40state%3Dactive&q=%40cluster%3Dwikimedia.org&q=alertname%3DPuppetConstantChange
mesh tracing for citoid also enabled in staging now!