removed ganglia from mx1001/mx2001
We talked about this. The problem here is manifold. Some facts:
Mon, Dec 11
Adding Security-Team. What do you guys think about such a key nowadays? Ticket is from 2012 after all.
Fri, Dec 8
@BBlack ^ So i think it's basically a request to add a new domain to the list of canonical domains. Any thoughts how we should proceed on this one?
Thu, Dec 7
So it should be decom'ed?
Is there a ticket to get eventlog2001 back into production? It is in site.pp but doesn't have any roles. Adding it with role spare::system for now.
Is the stress test over? Then T165170 is probably unstalled now. Is it not over yet? Then maybe this ticket should be reopened?
Is this unstalled now? The reason was while T168246 is ongoing but that ticket is resolved. Is it really resolved though?
This is correct, i believe it's still a valid ToDo for Contint in general.
Let us know if something isn't working.
Alright, this should be resolved now. As requested and approved, using 3 new groups as such:
Hi, is puppetcompiler1001 going to stay in site.pp permanently or was it a temporary thing?
Wed, Dec 6
I noticed this host was up and running but not in site.pp / no roles as part of decoming Ganglia from everything (T177225). It's gone from Icinga but running. I am adding it back to site.pp with role(spare) per decom workflow because this prevents a couple issues, like Ganglia will be removed and the host will keep getting security upgrades.
There was a meeting about this today, involving the actual developer and security. He confirmed he analyzed our pages of the previous years and it will be just like that, only static HTML/CSS/JS, no tracking, no loading from external sites etc. We don't foresee any issues here. It will be released some time in January.
applied on terbium
cronjobs have been updated on terbium and confirmed there are no remnants, just the 3 updated commands. no crons on wasat
Tue, Dec 5
This is blocked technically on T133548 (and partially on the other ones linked in my last comment).
Mon, Dec 4
P.S. Like all shell users, this also gives you a home dir on https://wikitech.wikimedia.org/wiki/People.wikimedia.org so you can use https://people.wikimedia.org/~tjones/ if you like.
P.S. Like all shell users, this also gives you a home dir on https://wikitech.wikimedia.org/wiki/People.wikimedia.org so you can use https://people.wikimedia.org/~cparle/ if you like.
your request has been approved and the code change has been merged.
This should be resolved now (https://gerrit.wikimedia.org/r/#/c/393814/4/modules/admin/data/data.yaml)
your request has been approved and the Gerrit change is merged.
Sat, Dec 2
db2023 - https://gerrit.wikimedia.org/r/394647
db2028,db2029 - https://gerrit.wikimedia.org/r/394725
spare systems - https://gerrit.wikimedia.org/r/394733
test servers - https://gerrit.wikimedia.org/r/394734
Fri, Dec 1
Running the exact same "/usr/bin/apt-get -y -q remove --purge ganglia-monitor" manually on a trusty host (db2029)... works! But when puppet ran it, it failed because there was still a running gmond process...
@jcrespo Did it happen on all hosts or just a few? Do you recall which host the paste above is from? I am surprised since this same patch got applied on a ton of other roles without this issue over the last few days and i don't see an include of ganglia in the mariadb module yet. The difference here that i see so far is just that i apply it via regex and not via individual role classes. Let me try it with just a single host, then based on a role class.
Thu, Nov 30
Yea, we should have both, it was meant to be additional not replacing the other. But IRC as well because that's where people ask.
I saw this host as DOWN when looking at Icinga as it was in the unacknowledeged section (though notifications were disabled).
As Moritz said on a Gerrit patch to enable this for Gerrit/Planet. There are security concerns even with the version in stretch and it probably becomes realistic to do in buster.
I was thinking about "jouncebot: next" when reading this. I mean it already outputs the deployments at the specific times by itself and reads the info the the deployment calendar, wouldn't it work just the same way to add scheduled downtimes to the calendar on wiki and then have the bot announce it to people on a couple channels like an hour and 10 minutes before?
Ok, so what we have meanwhile is this in Hiera, which all applies to any host starting with labtest*.
Wed, Nov 29
@Gilles @Cparle Or would it make sense to just add the maintenance command to a cronjob and let it run automatically at fixed intervals? Does it really need to be manual? I'd be happy to help getting another cron on maintenance servers into puppet.
Would it make sense to run this script automatically via a puppetized cronjob? Or does it need manual interaction? I mean it would always be good to reduce the need for manual things and just automate it and i'd be happy to help getting another cron on maintenance servers.
Confirmed that L3 has already been signed by cparle. A user exists in admin module but in the "ldap_only_users" section. Adding real shell access means having to move that account, it should not be duplicate.
'ALL = NOPASSWD: /usr/bin/strace *',
'ALL = NOPASSWD: /usr/sbin/tcpdump *',
This ticket is now just pending approval in the weekly ops meeting which is next Monday.
+1 and patch looks good to me now (I encouraged using systemctl instead/besides the service command).
@TJones please ignore my comments, i just realized you already have previous shell accesss (mw-log-readers, analytics-privatedata-uers). So all i said about SSH key and L3 doesn't apply anymore :) This will be much quicker.
We can probably use the existing group called "restricted".
Hi @TJones I will handle this access request. We might have to create a new admin group for this type of access, i will look at that. Meanwhile you could start by creating an ssh key and attaching it to this ticket. Also, please take a look at https://phabricator.wikimedia.org/L3 and sign it. Best, Daniel
Can you find more working ones than i did? Does "medium" have RSS feeds? Please add more?
When there are hardware issues, let's close the ticket only after servers actually get repooled. Because otherwise we keep forgetting that. Noticed it from Icinga saying: "Host mw1276 is not in mediawiki-installation dsh group"
Tue, Nov 28
Confirmed. This was about Gerrit logs. If there was a way to request "logstash but just Gerrit" then that would have been the request. Paladox is the one who did the majority of the work to move logs into logstash in the first place (to enable volunteers to read them without shell access) and works with Gerrit upstream so i also encouraged him. That said, it doesn't change that the "nda" group does a lot more than that and the permission system is not fine-tuned enough to allow it "per service".
Mon, Nov 27
Hi @Paladox I know it since we already talked on IRC, but you should add that you already mailed legal and your question if the volunteer NDA is right and all that.
Sounds like it would be useful if Partnerships could be added in the Phabricator-based workflow for future updates.
Thu, Nov 23
Sorry, my bad again. That was supposed to be in the other project too. Instance deleted now and recreated where it should have been. You can go ahead.
Is there an issue with the memcached exporter since this looks empty:
Created subtask to make quarry use the mariadb module since that is one of the few things still using it.
Wed, Nov 22
Yeah! planet-hotdog.wmflabs.org supports HTTP/2.0.
Since https://gerrit.wikimedia.org/r/#/c/391241/ was merged (thanks Filippo!) i was able to re-revert the "remove HHVM ganglia from appservers" change, so files (https://gerrit.wikimedia.org/r/#/c/392764/) have been removed from all appservers
https://logstash.wikimedia.org/ now shows the first log lines from cobalt :))
@chasemp Should we reinstall these with stretch? I noticed them in site.pp with a comment leading to this ticket.