Wed, Mar 20
Tue, Mar 19
Ah, thanks for clarifying! I agree we probably don't need the full suite of checks on the client nodes, but at the same time would like to make sure we continue monitoring elasticsearch client node health on the collectors since Logstash and Kibana depend on them.
Why do you say the elasticsearch icinga checks are not needed on the logstash elasticsearch data/master nodes? Is the thinking to monitor cluster status from only the client nodes?
Mon, Mar 18
According to netbox support for hosts kafka00 expired in Dec 2018. After discussing a bit with @Ottomata, a server refresh with higher-spec hardware would be a reasonable course of action to address both server age and capacity.
Fri, Mar 15
Thu, Mar 14
There looks to be a significant increase (about 1.5 million in the past hour) of log messages from the mediawiki "deprecated" channel to the effect of Use of ParserOutput::getModuleScripts was deprecated in MediaWiki 1.33. Could we squelch these somehow?
Wed, Mar 13
Tue, Mar 12
Thu, Mar 7
A pretty accurate list of stakeholders for a given host can be gleaned from the users, groups, and sudoers config deployed to it.
Tue, Mar 5
Looking here https://grafana.wikimedia.org/d/000000020/graphite-eqiad?refresh=1m&orgId=1&from=now-3h&to=now disk utilization has increased significantly
Mon, Mar 4
Sadly this bit us again last week. Details outlined in https://wikitech.wikimedia.org/wiki/Incident_documentation/20190228-logstash
Service migration and OS upgrade work is complete with ES and Kafka services running from logstash101, and frontend VMs logstash100 upgraded to stretch.
Fri, Mar 1
Thu, Feb 28
Wed, Feb 27
kafka service from logstash1004 has been migrated to logstash1010, and logstash1004 is now transitioned to spare::system.
Mon, Feb 25
Setup of new hosts is complete. Tracking follow up steps in T213898
Feb 22 2019
Looking much better now!
Feb 21 2019
Test build succeeded https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/14773/console, I think we're in good shape now.
compiler1002 is ready to be re-enabled at your earliest convenience
Feb 19 2019
That finished a bit faster than I was expecting! Ready to re-enable in the morning. And FWIW here's an example of a successful manual run https://puppet-compiler.wmflabs.org/compiler1002/2/.
logstash101[0-2] have been added to the logging eqiad elasticsearch cluster, and data is now being relocated from the old logstash100[4-6] hosts onto logstash101[0-2]. This will to take some time to complete as there are several TB worth of shards to relocate.
Compiler1002 is back online and successfully ran through a few local test catalog compiles. populate-puppetdb is running now so we should be in good shape to re-enable this host tomorrow morning. Will follow up when that completes.
I've re-created compiler1002 from scratch and am working to bring the puppet compiler service up on the host and validate a few builds locally. I estimate this will take until tonight (eastern time) or tomorrow morning since the local puppetdb takes a while to populate.
Great! Glad to hear it. Resolving
Feb 15 2019
Hi @MarcoAurelio, has this situation improved for you with the above patches merged?
Feb 7 2019
Feb 6 2019
Progress! (I hope...) https://gerrit.wikimedia.org/r/488602 adds an acl to detect unknown/untrusted hosts who are attempting to issue a mail from command that contains our domain (lists.wikimedia.org in this case). I enabled this briefly in a warn-only mode on lists and it indeed flagged the same IP address from the pastes. From the lists exim log:
Hey @Cmjohnson, sending a friendly ping to see how these builds are going. If there's anything I can do to assist remotely just let me know.
While not necessarily optimal, it is possible to ingest a file with rsyslog. So, if left with no other option we may be able to ingest json this way for forwarding on. At the same time I think having a "human readable" and greppable log file on the host would be useful for quick troubleshooting. Maybe we could settle on more that one output? For example a plain syslog output to populate a greppable local /var/log/firewall.log (or similar) and the central log hosts, and a json file for structured logs that is passed onwards to logstash and friends.
Sadly I'm seeing unexpected backscatter since merging https://gerrit.wikimedia.org/r/488022. Going to revert this for now while looking closer at the cause.
Feb 5 2019
Thanks for the patch! As mentioned in https://gerrit.wikimedia.org/r/488022 a reject rule is now in place based on this subject. But let's keep tuning this to reject based on multiple criteria and try to find a reliable long-term filter. Do you have one or more example messages with full headers that could be shared? FWIW I've seen a few instances of this on the ops list as well, but have already deleted the messages.
Jan 23 2019
http://www.openldap.org/doc/admin24/overlays.html#Password%20Policies (specifically sections 12.2 and 12.10) outline some possibilities for audit logging and password policy that could be useful here
Jan 22 2019
Jan 11 2019
Adding the usual checklist, even though it's nearly all done. Since this involves sudo privs it's been flagged for review/approval during the next SRE meeting which happens on Monday the 14th. The on-duty SRE will follow up after then. Thanks!
Thanks for the update @noarave
Jan 10 2019
Sure, sounds good!
@Tomthirteen which OS and browser version did this occur on? Also is it reproducible using different browers, hosts, etc.? Thanks in advance!
Actually upon closer inspection I'm not understanding the issue with the current config. The production MX hosts accept mail for valid @wikimedia.org addresses regardless of the originating IP address (unless it is in a dnsbl). It is relay for other remote domains where the relay whitelist comes into play, and I had misunderstood the description thinking that tools-mail was attempting to use the prod MX as a smarthost relay for other domains. I also understand now that the current configuration is what I had described as option 3 in T213416#4869863.
Hi @kchapman, I wasn't able to find a mailman list with this name, nor an email server alias. As @Reedy suggests we'll need follow-up from Office-IT or a current list member in Performance-Team. I've added some tags, hopefully that will give the task enough visibility to move forward.
Jan 9 2019
Proceeding with this
Great! It's would be fine to paste the public ssh key here in the task.
Jan 8 2019
This list has been created an initial password emailed by the system to aklapper@wm. The list has been set to "confirm and approve" subscription mode, with archives set to private. With that said please double check the settings in the list admin interface to ensure they are as expected. Thanks!
Please consider this a soft-close and reopen if any follow up is needed. Thanks!
Hello, this list password has been reset and the new value automatically sent to the owner by the system. Please don't hesitate to re-open if any follow up is needed. Thanks!