Only a single host is left to upgrade and reboot, baham. But this is a bit more complicated to coordinate.
Additional work is needed to make modules/phabricator/templates/ flexible if we want to be able to actually test prod changes in labs/beta.
It turns out we actually still don't have any instance that is using the prod role and can also be used for testing. The instance that does use the prod role redirects to prod, because the phabricator module has hardcoded "wikimedia.org" values in templates. So this probably never worked before.
Just one thing to do maybe: Update all wiki references to "https://phab-01"
Talked to paladox. Situation was:
Upgrade phab-01.wmflabs.org to the same version like the production one, so new features can be tested there.
@Aklapper thanks, resolved :) (and made public, NDA was just because this was an RT import, yea, that old)
I see now that "labtestcontrol2003"/labservices100x" were also affected but mails stopped 3 days ago and crons are gone. I see nothing else new. Please reopen if you still get anything.
@ArielGlenn Yes, it's Ganglia-only, and i started the spam when i removed that - because it's Ganglia - in https://gerrit.wikimedia.org/r/#/c/382929/ as part of the general effort to remove all of Ganglia (T177225).
Alright, thanks. So all members of 'wmde' should be added to ldap_only_users. I will add a patch to do that.
Fri, Oct 13
The change i uploaded above intents to disable paging for the specific check "mysql procs running" if a host is in labs/labtest. It's using existing regexes in Hiera that should match all labs hosts (the same that also sets the cluster). This is just for mysql process and just for labs, but i think it's a start. It's a reaction to the recent SMS we got when mysql/maria role was used on labtest hosts and then the process was stopped. I think it should be easy to agree that something called "labs" or "test" means it can't be important enough to send actual SMS, right?
There are other existing checks though that would page and don't already have "is_critical" parameters like that.
The second patch should be a nicer solution, it avoids having a regex in the mariadb module or making any changes there and just uses Hiera.
@fgiunchedi I was wondering if there is something here that replaces the PacketLossLogtailer from udp2log (https://gerrit.wikimedia.org/r/#/c/382913/) .. udp2log is on the way out too, right, Ottomata said +1 to remove it i see
^ I amended the one for dnsrecursors to include the "ganglia: false" Hiera setting right away and then merged that too. Metrics were converted to Diamond collector in T169600
No, sorry, the nda group was never requested and is used for different things, not for controlling access to repos.
removed the ganglia stats for this today
removed ganglia stats for gdnsd (authdns servers).
@kaythaney Great! Sounds good, thank you. I already removed everything on our side.
Thu, Oct 12
Back to Chad. Jenkins should be usable now.
I stopped hhvm on gerrit2001 but i can't remove the package just yet because that requirement also means both hhvm and scap get removed if you attempt to remove hhvm.
@RStallman-legalteam Thank you for confirming. Done, i added him just now.
Hi Andrew, I created this project back in Wikitech and just deleted the only instance in it, so now it's empty. Could you delete the project itself? I don't think i can do it (in Horizon), what permissions do you need for that nowadays, btw? Thanks
fyi, currently there is: CRITICAL: https://grafana.wikimedia.org/dashboard/db/webpagetest-alerts is alerting: Speed Index Internet Explorer Desktop [ALERT] alert, Speed Index Mobile [ALERT] alert.
The project/instance was originally created for T135034 but that is declined now because we don't run OCG anymore. so yea. this can go. I can handle it since i also created it.
Looks like i created that back in May 2016 and there was something related to the issues with exim on labs instances (T135033)
23:00 < MaxSem> uhh, why wold it depend on PHP?
Wed, Oct 11
19:20 mutante: apt: reprepro copy stretch-wikimedia jessie-wikimedia jenkins
Also see T177974 for a similar thing that was attempted in the past and lead to removing the channel again now.
This ticket should be linked to all other tickets that suggest _adding_ more channels as an example for why it might not be the best idea. Because the "another channel", just like "another list" and "another wiki" come up regularly and often end up like this later.
Looks like this is the relevant list https://wikitech.wikimedia.org/wiki/Prometheus#Ganglia_plugins to see which plugins have been replaced with what. Can any updates be made to that list?
Tue, Oct 10
Yea, aware Jeff has root on the mentioned stat boxes anyways, heh.
@Jgreen Do you just need Hive/Hadoop or do you additionally need sampled webrequest logs and stat boxes with private data? Asking this way because the description for the requested group says "This group should not be used just to grant someone Hadoop access." and you mention just Hive.
I have contacted WMF legal to reach out to Pablo so he can sign the right NDA. I asked and L2 is only for Phabricator access to non-public tickets, but if LDAP groups are involved a different kind of NDA must be signed directly with legal.
@Tobi_WMDE_SW Ok, got it. Thanks for explaining and confirming that.
Sat, Oct 7
@Ottomata Hi, wondering what do you think should happen with modules/confluent/manifests/kafka/mirror/jmxtrans.pp and modules/confluent/manifests/kafka/broker/jmxtrans.pp: once Ganglia gets removed?
- added self to project and projectadmin in wikidata-dev
- created new instance "wikibase-stretch", replacing "wikibase" instance that was on jessie, let's use "stable" right away
- started profile and merged https://gerrit.wikimedia.org/r/382355
- added "Puppet prefix" "wikibase" in project wikidata-dev
- added role::wikibase to the puppet prefix, other classes
- added proxy wikibase.wmflabs.org in Horizon, assigned to instance wikibase-stretch
- fixed doc root path https://gerrit.wikimedia.org/r/382860
- fixed git clone destination https://gerrit.wikimedia.org/r/382884
- set Hiera values in repo instead of Wiki https://gerrit.wikimedia.org/r/382892
- added missing parameters to profile class https://gerrit.wikimedia.org/r/382891 (paladox)
- fixed typos in Apache config https://gerrit.wikimedia.org/r/382893 (paladox)
- fixed document root again https://gerrit.wikimedia.org/r/382894 (paladox)
- ran puppet again, restarted Apache
Hi, re: "to be able to contribute to AdvancedSearch" does this mean you want the +2 permissions to be able to merge changes in Gerrit in a wikidata-related repo?
added the mobile domain to DNS: https://electcom.m.wikimedia.org/
@jrbs: re: patches rebasing: git review -d <number of gerrit patch>; git rebase -i origin/master; then if it fails "git rebase --continue" to see in which file the conflict is; manually edit the file to remove the conflict, git rebase --continue; git review
Fri, Oct 6
Membership should be maximally-inclusive, essentially anyone who has demonstrated constructive contributions in any area of the wikimedia community should be added
Thu, Oct 5
By the way i couldn't just merge Ladsgroup's change in the new content repo as i just have +1 rights there, but not +2.
Next we should figure out:
We have the repo, and i see your change Amir, thank you! Before i merge the content itself i started with puppet code to install Apache site and git pull the static files. I was thinking we test that role class first on a cloud VPS with something like wikibase.wmflabs.org or whatever until we are happy to move it to prod. I also started it as a profile following the new puppet style guide lines, so that jenkins-bot likes it now also with the new style check. I made the server name and admin email variable and in templates/Hiera, so it will be easy to change for labs (just a wiki page edit in Hiera: namespace, not even a commit) and also to move it to prod and the actual domain name.
Wed, Oct 4
https://releases-jenkins.wikimedia.org works now without the /ci/ prefix, is eqiad-only (while releases.wm.org is still active/active on both), Apache syntax isn't broken anymore, puppet run is fine and jenkins says we removed 4 violations.
Heh, that worked ! Thanks @QChris :)
Yes, eh.. tentatively closing :) Of course we can still comment here and reopen if necessary.
Looks good, thanks. When i needed one in the past i always got a response from that wiki page and usually @QChris did them.
Yea, this should just wait for the proper setup in codfw. I don't see an emergency here.
..of course pending that DBAs are ok with this running as cron.
Yep, once you know what command exactly you want to run you can ping me about adding that to a puppetized cron job. And yes, modules/mediawiki/manifests/maintenance/ is the right place to look at to see existing examples.
Tue, Oct 3
There are no more alerts now. All the screens/tmux on puppetmaster are closed now. Deployment servers and rhenium are whitelisted. So right now Icinga is clean.
Mon, Oct 2
Yea, so the fix here is to just wait. I think there is not much else that can be done (besides buying a new cert just for these few days). Not sure what this means about "UBN" status. It is what it is.
@kaythaney Thank you for the update. It's appreciated!:)
Sat, Sep 30
should be good now, not entirely sure what actually stopped it before though. please reopen if it doesn't update again in a week
And now, double checking why the cron job might have failed, i ran the import command again which wiped the data, :p I will have to let it run once again, but nice way to confirm it. I see no reason for it failing. Running it as the same user with the same permissions.
FYI, the 15 either deny access (readapidenied) or have certificate issues or something else