Mon, Oct 19
What are the downsides to using iptables rules for this?
Fri, Oct 9
I think we're in good shape here now.
This would have been helpful in troubleshooting T264504
Thu, Oct 8
wiki-mail-codfw.wikimedia.org has been delisted. This should resolve the issue outlined in the description here.
Actually, after some further manual testing I think we have a reason:
With regard to why mail from eqiad seemed to be working while codfw was not -- part of this is because the working email examples are gerrit mails, which in addition to having different message contents are also sent outward via the main mx host interface instead of the wiki-mail-site.wikimedia.org bulk mail interface.
Mon, Oct 5
Hi @Sbailey, the updated SSH key has been deployed to servers by now. Please re-open if any follow-up is needed. Thanks!
New key has been confirmed via google chat and email
Fri, Oct 2
This has been done, I'll transition to resolved now
Hi @Sbailey I've reached out to you via google chat and by email to verify. Thanks!
Thu, Oct 1
The file bast2002:/home/urbanecm/id_ed25519_wmnet_20201001.pub.sig does indeed match the key in the description, and on the patch
Is there another host in production where you have working access? Placing a file there would work too, just let me know where to check. Otherwise we can figure out another method. Thanks!
The requested access has been enabled and will become active within the next 30 minutes. I'll transition this task to resolved now, but please don't hesitate to re-open if any follow-up is needed. Thanks!
Since this is somewhat of an atypical access request (in that the account and group membership are pre-existing, but attributes are changing) please have a close look at https://gerrit.wikimedia.org/r/631455 to ensure it matches the expected outcome. Thanks in advance!
Hi @Sbailey as a security precaution, could you please use your existing shell access to upload the desired new ssh key onto one of the bastions (let's say bast1002) as a file in your home directory called sbailey_new_ssh_key? Once done and confirmed we'll be ready to move forward with the above patch. Thanks in advance!
Wed, Sep 30
Removing the SRE-Access-Requests tag for now, please re-add when ready to proceed with this. Thanks!
Sep 24 2020
Sep 23 2020
Alert1001 is now the active Icinga server. Meta monitoring for alert001 has been enabled as well.
Sep 21 2020
Sep 15 2020
Thanks @JMeybohm, ok I think we should defer to your expertise with regard to the optimal way to output these logs from the Kubernetes environment.
Sep 14 2020
Sep 10 2020
Sep 8 2020
Sep 3 2020
The buster kafkamon hosts are now live. Will let them settle for a bit before moving on to cleanup/teardown of the old hosts.
Icinga/alerts certificate issue has been fixed and meta monitoring is now working against the new alert001 hosts.
Sep 2 2020
Thanks @Volans, sync_check_icinga_contacts is happy now on alert001
Aug 28 2020
Aug 27 2020
Hey @elukey, prep work is done for the new hosts. Will be performing cut-over in the near future, will keep you on the cc.
Aug 26 2020
Currently the sync_check_icinga_contacts unit is failed on alert1001. I've armed the keyholder, but am not sure if there's an additional step to carry out on the wikitech-static host to permit the key from a new host. Or even if the sync should be running from multiple places at the same time.
Thanks! These Galera checks are now green in the new icinga instance
Aug 25 2020
This came up again this morning in codfw, cluster went yellow due to shard allocation failure on logstash-2020.08.18 index.
Aug 24 2020
Aug 11 2020
lists.wikimedia.org is now running from the buster host lists1001.wikimedia.org.
Aug 9 2020
Saw this same Data too large, data for... error also affecting shard allocation on the HDD hosts yesterday. Bumping the heap on the eqiad HDD hosts manually from 24G to 26G and issuing a /_cluster/reroute?retry_failed=true cleared it. Uploaded https://gerrit.wikimedia.org/r/619032 to persist the setting (and for deploy to codfw)
Aug 4 2020
Aug 3 2020
I've updated the description to outline the two auto-retrigger and auto-resolve options as available by VO today.
@Nuria could you please review and give a thumbs up/down on the request for analytics-privatedata-users membership?