Fri, Mar 27
Spent some time on IRC with @jbond reproducing this and indeed puppetdb-populate will fail repeatedly for new hosts until performing a run using an empty manifest to populate facts. Then subsequent runs succeed.
Thu, Mar 26
Wed, Mar 25
Tue, Mar 24
In my testing simply removing instances of "index_options":"docs" from the logstash template addresses the issue, please see https://gerrit.wikimedia.org/r/583112
Mon, Mar 23
Hello! The multimedia-team list has been renamed to structured-data-team, and redirects/forwarding have been put into place. I'll transition this to resolved as a soft close, but please re-open if any follow up is needed. Thanks!
Wed, Mar 18
Tue, Mar 17
Thu, Mar 12
https://gerrit.wikimedia.org/r/579329 seems like low hanging fruit that could help reduce load
Wed, Mar 11
Tue, Mar 10
Fri, Mar 6
Thu, Mar 5
Additionally, before moving on to max_clause_count I had experimented with settings like "default_field": "*" and "default_field": "message" in Kibana config query:queryString:options but the errors persisted.
In testing I was able to work around this by increasing indices.query.bool.max_clause_count to a value greater than the number of fields matched. This comes with some tradeoffs wrt resource utilization, but it does resolve the issue.
Feb 24 2020
Hi @wiki_willy, do you know what the ETA is for these hosts?
Feb 21 2020
Looking a bit closer I think this is happening because the nodes in labs are assigned their roles/profiles/etc via the external node classifier in horizon, which isn't making the call to role() as we do in prod and so $::_role isn't set in the process.
Feb 20 2020
Similar to recent issue T245725
Feb 19 2020
I've cherry picked https://gerrit.wikimedia.org/r/#/c/operations/puppet/+/571239/ on deployment-puppetmaster04.deployment-prep.eqiad.wmflabs (and made a minor change in patchset 9, since logstash was complaining about the quotes). The config loads ok in logstash.
Feb 14 2020
Learned today that T243226 is tracking the current beta cluster puppetmaster issues
Turning off debug logging on puppetmaster04 (I set logdest = /dev/null in /etc/puppet/puppet.conf) has helped with the disk usage, and sluggishness issues. But sadly puppet runs are currently failing with:
Feb 13 2020
Fwiw I do see logs flowing into logstash-beta generally, but puppet was broken in the beta cluster because the master filled its disk. Puppet master on deployment-puppetmaster04.deployment-prep.eqiad.wmflabs seems to be logging at debug level, making puppet runs super slow and rapidly filling the disk. I don't have time at the moment, but if still broken in the morning I'll take a closer look.
Hey @jijiki, usually to test/validate filters like this I'll cherry pick or live-hack the logstash config on the beta cluster and generate the desired traffic there to see how logstash behaves. There are some details at https://wikitech.wikimedia.org/wiki/Logstash#Beta_Cluster_Logstash
Jan 15 2020
Hey @Papaul, I don't think there is any specific urgency to this and it can wait until he's back, but if it needs to go sooner I could work on it.
Jan 14 2020
This should be fixed now.
Jan 8 2020
The row_A ganeti group is running low on memory capacity (please see T239151#5707691) . Should we allocate a few of these new hosts to expand the existing row_A ganeti group?
Jan 7 2020
The eqsin ganeti cluster is now up and running, and a first VM netflow5001 has been created.
ganeti-test.wikimedia.org VM has been created on row_C, and I've uploaded a patch to assign it role::gerrit with https://gerrit.wikimedia.org/r/#/c/operations/puppet/+/562587/
Jan 3 2020
Hi @Nuria, a friendly ping/bump for approval on this. Happy new year!
I'd like to edit the form but don't currently have permission. Primarily I'd like to add the clinic duty checklist and clarify a few prerequisites for the requestor to complete. These are things that we currently do manually via back-and-forth comments. Adding them to the template should save time on every request. I'd like to update the template like so:
Thanks for the update @Kris_Litson_WMDE
Jan 2 2020
Sounds good @jcrespo, please pass back to me when you've received the export and uploaded it to the mailman host and I'll see what I can do to import. Thanks!
Hello! Looping in @RStallman-legalteam to coordinate getting your NDA on file.
Removing the SRE-Access-Requests project tag for now. Please update and re-add if/when any further action is needed. Thanks!
Dec 19 2019
esams and ulsfo are online now, and eqsin should be shortly. Not sure if it's best to do all at once, or per-site, but wanted to get a task created to keep tabs on it.
The esams ganeti cluster is now up and running, and netflow3001 has been created there as a first VM.
These hosts have been reimaged with buster, certs created, and patches uploaded to enable ganeti.
Actually since netflow4001 is not yet puppetized the instance has been shut down. https://gerrit.wikimedia.org/r/559330 should unblock the first puppet run, and can re-start the instance after its merged.
For sure, but its a work in progress currently. Basically I'd like a sanity check that the manual steps make sense and aren't already automated, or are better handled, in a way that I'm not aware of.
The ulsfo buster ganeti cluster is up and running now, and netflow4001 has been created there as a first VM.
Dec 16 2019
Looked into these alerts a bit, and pulled the source IP addresses for these checks from watchmouse, but I don't see these IPs appearing in the mx logs. I think it is because the exim mx logs are not currently detailed enough. So I'll make the logs a bit more verbose and review again after more log information has been gathered.
Yes, we will need a second logstash stretch instance, and to migrate the Kafka broker ID from deployment-logstash2 to the new host.
Dec 10 2019
@elukey hey, yes that's been fixed by making a newer version of curator available to the new clusters. Haven't seen cron errors from these since Dec 5. Thanks for cleaning up the "config does not exist" entries!
Dec 5 2019
Looking more closely the problem was due to a Broker: Leader not available issue in the deployment-prep kafka logging cluster. After starting deployment-logstash2 back up (the instance had been stopped) logs are flowing again. Longer term we'll likely need another logstash stretch instance and to migrate over the broker id from deployment-logstash2 to the new instance.
Dec 4 2019
Nov 26 2019
Ok, for my own edification, how would the private only LVS model work if we wanted to stand up a public facing non HTTP(S) service in a VM at one+ of these sites?
Will this Ganeti cluster use vlan tagged interfaces, or will separate physical interfaces connect to both public and private vlans? If tagging, are the switchports configured for that yet?
Nov 19 2019
Thanks for the ping, I missed the question. Sure, being added to the Gerrit Manager that would work for me!
Nov 15 2019
https://gerrit.wikimedia.org/r/551270 should do the trick for source/dest ports. I don't recall why these weren't parsed out in the first place. While we're at it would any of the other parts the ulogd/iptables events be useful as fields?
Nov 13 2019
Nov 8 2019
Nov 7 2019
Since it looks like cp3056 might be down for some time could we remove it from the config until fixed? It would be good to let the ipsec checks in icinga return to green.
Nov 6 2019
In terms of “what” should be escalated, so far we discussed