In Hiera it is defined which is the currently "active" releases server, like so (hieradata/common.yaml):
- https://releases.wikimedia.org/parsoid/ has been created
- it's hosted on releases1001/releases2001 and active/active
- the local path is /srv/org/wikimedia/releases/parsoid/
- a new admin group "releasers-parsoid" has been created and added on the hosts above
- @ssastry has been added to that group
- permissions were fixed on the new directory to allow the group writing to it
Hi Margaret, could you explain a little bit why you need that specific group and what you are planning to do. cc: @Nuria
Tried to add missing contactgroup to mobileapps services with the change above (copying wdqs setup) but it seems that wasn't enough to actually make it happen yet. (ran puppet on einsteinium and scb1001)
@Lucas_Werkmeister_WMDE Thank you for the update! I will say the status "stalled" is still correct in that case until we now. But that was very helpful to know the background. I'll just keep this open and make it a subtask.
For SSH config also see https://wikitech.wikimedia.org/wiki/Production_shell_access#Standard_config
on bast1002: Notice: /Stage[main]/Admin/Admin::Hashuser[springle]/Admin::User[springle]/User[springle]/ensure: created
The request has been approved in ops meeting and the change above has been merged.
Afaict the way it works is that the bot has the right to create channels and it joins a channel when the first message comes in.
(We know because after restarts channels didn't get created until the first actual message from the bot).
I think it kind of blocks the "never use PHP5" / "switch to PHP7" thing which also affects appservers and deployment servers. In the last Service Operations meeting it was brought up that we need to replace terbium as part of that. But let me confirm the urgency with @Joe
Which deployment server is this about? Production or deployment-prep or another?
Tue, Apr 17
@Gehel Ok, so after looking at it again, i have a new suggestion. Adding permissions to the existing maps-admin group. There are just 5 people in that and at least 2 would be the same people. Adding yet another level of admin group between "admin" and "root" seems going to far.
@RStallman-legalteam Hi, we got another NDA request for WMDE here. Thanks!
@pmiazga Now you should have the +2 and things should just work. I'm calling this ticket resolved. Let us know if any unexpected issues.
Setting to Stalled, unless that meeting has already happened. In that case, please let us know the status here on ticket.
@Tarrow All done. Let us know if any unexpected issues.
"tarrow" has been added to the "wmde" LDAP group.
It was broken because - the instance was shut down. I don't know what or who shut it down, but it was simply not running.
Sat, Apr 14
regarding a health check on the master itself: Which type of check exactly would that be on the master? So far we just have the replication check on slaves and there are all these aliases for check_postgres that we could use:
`ERROR: FATAL: no pg_hba.conf entry for host "2620:0:860:4:208:80:153:110", user "replication", database "template1", SSL on
FATAL: no pg_hba.conf entry for host "2620:0:860:4:208:80:153:110", user "replication", database "template1", SSL off`
- puppet added the nrpe config part on netmon2001 (the slave) but not on netmon1002 (good)
- puppet added the icinga config part on einsteinium
- we got this check now but it has status UNKNOWN https://icinga.wikimedia.org/cgi-bin/icinga/extinfo.cgi?type=2&host=netmon2001&service=Postgres+Replication+Lag
The package providing /usr/bin/check_postgres_hot_standby_delay is check-postgres
Fri, Apr 13
How about a new group that lets you run any command as the users postgres and osmupdater (as requested) but not any command as any user. Would that sound ok?
The name will be *nihonium* (element 113), not nunki as in the first version of the change above. This is still eqiad, not codfw.
The name would be nihonium, element 113.
I _think_ it's that you are missing here:
Deployment access for Piotr has already been granted in T148477 back in October 2016 (including approval in Ops meeting).
Adding a custom contact group to the "LVS HTTP IPv4" service doesn't look trivial to me.
Thu, Apr 12
I debugged this by looking at the generated Icinga config directly on the server.
18:24 < mutante> Notice: /Stage[main]/Toollabs::Exec_environ/Package[language-pack-mr]/ensure: created
18:25 < mutante> --- /var/lib/locales/supported.d/localI2015-04-22 18:32:38.847090958 +0000
18:25 < mutante> Scheduling refresh of Exec[locale-gen]
@Tarrow yea, the first line can be removed. The second line can be amended to say that RStallman-legalteam should be added to do that. The third line can be amended to say that "Operations" should be added to actually do the technical change after the NDA has been signed. It could also say that the person who is currently listed as "on duty" on the #wikimedia-operations IRC channel can be contacted in case it's blocked by ops. Thanks!
Afaict here also T191523#4125233 applies so L2 is not actually correct and whereever that template is that these tickets are created from, it should probably be updated.
@RStallman-legalteam And.. here's another request for NDA for a WMDE developer, renamed ticket to keep them apart.
Hi, can you both login and then look at the "Logged in as " line showing up in the Icinga web ui and copy/paste that over here? Unfortunately there is a caveat where even capitalization matters, so let's compare the exact user name there.
@Mholloway Try this please:
Wed, Apr 11
adding @RStallman-legalteam for the NDA step
The change above will now ensure that cassandra Icinga checks are not added if on the dev cluster. We don't see the results yet because puppet is currently disabled on restbase-dev due to ongoing upgrades.
deploy1001 has been removed from scap hosts and puppet ran on tin. This should have fixed the immediate scap issue.
Tue, Apr 10
I see in your config you have
I suggest to create a fresh instance (that is not named after a hostname in prod but has a generic name) and apply role(mediawiki_maintenance) to it. Then you will see which errors you actually get (or not). The ones that just work you can keep and the ones that are breaking you disable in Hiera (the puppet class makes this easy, already disables all the crons on the inactive maintenance server (currently codfw). I don't think that "keep a list of what should be run manually" is going to work that well.
@Sharvaniharan I checked on releases1001 but i don't see any failed attempts to login with your user. It looks like it's already failing before that at the connection to the bastion host. Could you please paste your ssh config? Which bastion server are you connecting to to jump to releases1001? What errror do you see when you try it?
Thanks Gehel for pointing that out. Using the existing package is the better option. Abandoned.
per the last ops meeting and joe's comments:
The version of apache2.conf that canaries and mwdebug has matches the puppet repo template:
I investigated a bit on the part ".. on mwdebug1001 and mwdebug1002, .. behaves differently on a pooled app server (e.g. mw1299)".
seems like the best one. hast the most votes:
this is done in parent task.
Mon, Apr 9
- converted puppet role to profile
- re-added monitoring section to profile (now the style check is happy about that)
- appears here now again: https://icinga.wikimedia.org/cgi-bin/icinga/status.cgi?search_string=tftp
I have the same suspicion. There are a lot of characters before that 'src=' starts after the image tag. That seems the uncommon part about it. While the software does log which feeds it is parsing and whether there is a general error, it does not log details about the parsing/image stripping part .
Sat, Apr 7