Wed, Jan 15
Hey @Papaul, I don't think there is any specific urgency to this and it can wait until he's back, but if it needs to go sooner I could work on it.
Tue, Jan 14
This should be fixed now.
Wed, Jan 8
The row_A ganeti group is running low on memory capacity (please see T239151#5707691) . Should we allocate a few of these new hosts to expand the existing row_A ganeti group?
Tue, Jan 7
The eqsin ganeti cluster is now up and running, and a first VM netflow5001 has been created.
ganeti-test.wikimedia.org VM has been created on row_C, and I've uploaded a patch to assign it role::gerrit with https://gerrit.wikimedia.org/r/#/c/operations/puppet/+/562587/
Fri, Jan 3
Hi @Nuria, a friendly ping/bump for approval on this. Happy new year!
I'd like to edit the form but don't currently have permission. Primarily I'd like to add the clinic duty checklist and clarify a few prerequisites for the requestor to complete. These are things that we currently do manually via back-and-forth comments. Adding them to the template should save time on every request. I'd like to update the template like so:
Thanks for the update @Kris_Litson_WMDE
Thu, Jan 2
Sounds good @jcrespo, please pass back to me when you've received the export and uploaded it to the mailman host and I'll see what I can do to import. Thanks!
Hello! Looping in @RStallman-legalteam to coordinate getting your NDA on file.
Removing the SRE-Access-Requests project tag for now. Please update and re-add if/when any further action is needed. Thanks!
Thu, Dec 19
esams and ulsfo are online now, and eqsin should be shortly. Not sure if it's best to do all at once, or per-site, but wanted to get a task created to keep tabs on it.
The esams ganeti cluster is now up and running, and netflow3001 has been created there as a first VM.
These hosts have been reimaged with buster, certs created, and patches uploaded to enable ganeti on these hosts.
Actually since netflow4001 is not yet puppetized the instance has been shut down. https://gerrit.wikimedia.org/r/559330 should unblock the first puppet run, and can re-start the instance after its merged.
For sure, but its a work in progress currently. Basically I'd like a sanity check that the manual steps make sense and aren't already automated, or are better handled, in a way that I'm not aware of.
The ulsfo buster ganeti cluster is up and running now, and netflow4001 has been created there as a first VM.
Dec 16 2019
Looked into these alerts a bit, and pulled the source IP addresses for these checks from watchmouse, but I don't see these IPs appearing in the mx logs. I think it is because the exim mx logs are not currently detailed enough. So I'll make the logs a bit more verbose and review again after more log information has been gathered.
Yes, we will need a second logstash stretch instance, and to migrate the Kafka broker ID from deployment-logstash2 to the new host.
Dec 10 2019
@elukey hey, yes that's been fixed by making a newer version of curator available to the new clusters. Haven't seen cron errors from these since Dec 5. Thanks for cleaning up the "config does not exist" entries!
Dec 5 2019
Looking more closely the problem was due to a Broker: Leader not available issue in the deployment-prep kafka logging cluster. After starting deployment-logstash2 back up (the instance had been stopped) logs are flowing again. Longer term we'll likely need another logstash stretch instance and to migrate over the broker id from deployment-logstash2 to the new instance.
Dec 4 2019
Nov 26 2019
Ok, for my own edification, how would the private only LVS model work if we wanted to stand up a public facing non HTTP(S) service in a VM at one+ of these sites?
Will this Ganeti cluster use vlan tagged interfaces, or will separate physical interfaces connect to both public and private vlans? If tagging, are the switchports configured for that yet?
Nov 19 2019
Thanks for the ping, I missed the question. Sure, being added to the Gerrit Manager that would work for me!
Nov 15 2019
https://gerrit.wikimedia.org/r/551270 should do the trick for source/dest ports. I don't recall why these weren't parsed out in the first place. While we're at it would any of the other parts the ulogd/iptables events be useful as fields?
Nov 13 2019
Nov 8 2019
Nov 7 2019
Since it looks like cp3056 might be down for some time could we remove it from the config until fixed? It would be good to let the ipsec checks in icinga return to green.
Nov 6 2019
In terms of “what” should be escalated, so far we discussed
Nov 5 2019
Yes, I think we're in good shape here
Nov 4 2019
Host monitoring for SCS systems has been added to icinga
Nov 1 2019
Essentially a duplicate of T220390 where audit work has gone into documenting clusters and use cases
The document https://docs.google.com/document/d/1mr217D6eyoGvGUG31M-FVMve9MCOCRKZxUWFZOZirQw/edit#heading=h.mbousz3hsm22 was created & shared via the goal meetings during Q4. Time permitting this probably could use another round of comments to extend/finalize and remove any info that's gone stale by now.
To circle back on this, we moved forward with option 2 and are using task T225005 to track the migration effort
Oct 30 2019
Oct 28 2019
Oct 18 2019
Hello! @gsingers, as a last step could you please review and sign the L3 document? Once that's done (and the related patch has a +1 from a peer within SRE) we'll be ready to merge and deploy analytics-privatedata-users group membership.
Oct 16 2019
Transitioning this resolved as all subtasks have now been resolved. If additional follow-up is needed, please don't hesitate to re-open. Thanks!
The requested group memberships have been provisioned. I'll transition this to resolved now, but please don't hesitate to re-open if any follow up is necessary. Thanks!
Oct 11 2019
Great, thank you!
Access has been granted. Transitioning this to resolved now, but if any follow-up is needed please don't hesitate to re-open. Thanks!
Regarding chat I'd encourage them to reach out with any questions via IRC. Details about available channels and their associated topics can be fount at https://meta.wikimedia.org/wiki/IRC/Channels
Hi @Nuria could you please review this group request for approval?
! In T234564#5565056, @Krinkle wrote:
Would it be possible to give type:mediawiki channel:(error OR exception OR fatal) a separate index as well? These are the only critical ones involved in deployment and should not suffer due to spam from random info/debug channels.
We might want to include type:syslog program:php72-fpm and type:scap in there as well.
Oct 10 2019
Hello, could you please expand on this request? What resources are meant to be accessed, and do you know specifically what LDAP group?