After exploring the NPM approach a bit on https://gerrit.wikimedia.org/r/c/operations/puppet/+/720110/ it's clear that we would be better off to look for an alternate tool written in another language with less convoluted dependencies, and which is easier to audit and maintain in the long term.
Wed, Sep 15
Tue, Sep 14
the cluster dropdown should only list cache_text and cache_upload, while it currently includes clusters such as appserver and bastion which obviously don't make much sense for a Varnish SLO dashboard. Other than that I think we look good
Fri, Sep 10
So far so good testing logagent. Confirmed that it can indeed ingest/parse GELF logs from our elasticsearch-gelf config and output them json formatted to stdout. By wrapping this config in a systemd unit we should be able to pick up these logs with rsyslog and send them onward to kafka logging.
Thu, Sep 9
I'm tempted to try shimming these using an rsyslog listener that emulates gelf and routes these logs to the kafka logging pipeline until the longer-term/upgraded elastic config is in place.
Wed, Sep 8
According to the logstash input type distributions graph we're down to elastisearch via gelf for non-kafka inputs.
Seeing errors like this in the paniclog unfortunately
Wed, Sep 1
Tue, Aug 31
Thanks @ema! This is helpful feedback
Mon, Aug 30
Hi @Arnoldokoth welcome! I've just created an account for you, and you should see a VictorOps invite in your email shortly.
Thu, Aug 26
Tue, Aug 24
As a short term stopgap I've cleaned daemon.log manually on deployment-logstash0 (same thing done on all)
Mon, Aug 23
I opted to remove role::kafka::monitoring in favor of role::kafka::monitoring_buster so the config wouldn't be disrupted when retiring the old hosts. Will upload a patch to update the cumin alias.
Aug 19 2021
Old hosts have been retired and the duplicate role cleaned up, resolving!
Aug 18 2021
Sounds good, yes grizzly deploys the jsonnet/grafonnet approach outlined in the task description and good progress has been made putting that in place.
Aug 16 2021
Disk util on kafka-logging hosts has been stable for 70+ days now, resolving
Hi @MatthewVernon, I see your VO account is now active and you are present in the SRE Batphone rotation as well.
Aug 13 2021
Nice! Regarding upstream improvements, on a related note there will hopefully in the future be better control over partition movement within Kafka itself with https://cwiki.apache.org/confluence/display/KAFKA/KIP-435%3A+Internal+Partition+Reassignment+Batching and similar work (although afaict is currently stalled). But splitting them out manually seems fine for now.
Aug 11 2021
Resolving as new hosts and extended SSD retention are in place now. Let's reopen if any issues arise.
Aug 5 2021
New hosts are live in both sites, and shards are relocating onto the new hosts. Next step will be to increase retention on the SSD tier, I think we can safely double it.
Removing the pid from the filename would help keep these under control, and we could increase the filecount to keep more history on disk if needed. Are there any dependencies on the filename format?
Aug 4 2021
+1 to removing the check. We also have since enabled shell TMOUT which helps clean up cases where shells are left idle. Currently that's a 5 day timeout.
Welcome @MatthewVernon! I've created an account for you in VO with SRE team membership, and you should be receiving an invite via email.
Plan looks good to me!
Aug 3 2021
Along with deploying these we should extend retention on the SSD tier
alerting and several other cleanup patches merged
Aug 2 2021
This alert has cleared and the queue is now ~50% below the icinga threshold.
Jul 27 2021
All elk5 hardware has been decommed at this point.
Jul 22 2021
Jul 19 2021
I've PoC this with check_ipmi_sensor which supports checking SEL
The downside of this approach is potentially old SEL entries that we'll have to clear as they are surfaced on first deployment. Going forward, the SEL will need clearing for such errors to let the icinga alert actually clear. Since if we deploy this we'll be routinely clear the SEL on errors, I think it is important to log its entries elsewhere too and for that we can deploy freeipmi-ipmiseld which polls SEL and logs to syslog.
sure, sounds good to me!
+1 for option 2, I think that will be a more straightforward approach overall.
Jul 1 2021
Hi @BTullis, sure, I've just added you to analytics-alerts and you should be receiving these emails now.
Hi @FGoodwin, your ldap account has been added to group wmf. I'll transition this to resolved now, but please don't hesitate to reopen if any followup is needed. Thanks!
Looks reasonable to me, and thanks much for writing the patch!
Key updated, but gerrit unable to update task due to policy. Resolving!
Jun 30 2021
Verified face to face via a google meet session
Hi @tchin, your ldap account is now a member of the wmf group. I'll transition to resolved now but please don't hesitate to reopen if any follow-up is needed. Thanks!
I also have an item on my checklist to say that I should be in the cn=ops LDAP group.
There are instructions on how I can add myself to that group, but only once I have sudo access.
Can anyone confirm this requirement? If so, can it be done on this ticket, or should I raise a new one?
Jun 29 2021
Shell account has been created, and ldap account has been added to group wmf
Sure I'll go ahead and prep a patch. I may have missed it, but what realname should be used for btullis?
@razzi will take care of this, and I will follow up with SRE on enabling root access after the initial access is granted.
Jun 28 2021
Hi @tchin could you please coordinate obtaining a comment of approval on this task from your manager?
Hi @FGoodwin could you please coordinate obtaining a comment of approval on this task from your manager?
Hi @NRodriguez there are a couple steps to check off in order to move forward on this request. When you have a moment could you please...