T311290 has been named as the reason for that issue with the cookbook. Should be fixed already.
Thu, Jun 23
Notice: /Stage[main]/Profile::Mediawiki::Deployment::Server/Package[siege]/ensure: removed
Notice: /Stage[main]/Profile::Mediawiki::Deployment::Server/Package[wrk]/ensure: removed
Notice: /Stage[main]/Profile::Mediawiki::Deployment::Server/Package[lua-cjson]/ensure: removed
@BTullis I tried to create one for you but the cookbook failed at the DNS update step:
dzahn@cumin1001:~$ sudo cookbook sre.ganeti.makevm --vcpus 2 --memory 4 --disk 20 --network private eqiad_C dse-k8s-ctrl1001 Ready to create Ganeti VM dse-k8s-ctrl1001.eqiad.wmnet in the ganeti01.svc.eqiad.wmnet cluster on row C with 2 vCPUs, 4GB of RAM, 20GB of disk in the private network.
fake secrets were needed to be able to puppet compile scap changes such as https://gerrit.wikimedia.org/r/c/operations/puppet/+/806397
We have had the "mgmt flapping"-issue in other DCs. In codfw a bunch of them were fixed after Papaul did firmware upgrades on the DRACs.
Thank you @BTullis for all the details. Now I know what DSE means. If they doc could be public, even better. The project description at https://phabricator.wikimedia.org/project/profile/5959/ is also helpful though for the casual observer.
Thank you for the examples. That makes sense to me. Especially if Dell advises to keep them secret.
Wed, Jun 22
Before we talk about technical implementation and putting this on ice. I am wondering..has anyone even had specific concerns or data fields in mind that should be hidden?
fyi: The design document isn't accesible and from the tickets alone it's unclear what this is about.
In T310738 there is a request to revert this and move the domain back to WMF infra.
Tue, Jun 21
@Zabe I see. thank you for that!
P.S. What would be actually useful for me is if those only actually got created _after_ the wiki has been created. Or ideally if it was first "stalled" and the actual wiki creation changed that to "open". I guess. Currently I see them but still need to manually watch the "newprojects" list or so to know when they are _really_ ready to go.
@Zabe Curious why those tickets start with a custom policy in the first place. Is that something we should try to change in the bot creating those?
@dduvall Well, you could try to use it to build a docker image from a dockerfile. So far it's just "the buildkitd service is running" and follow-ups are about making sure it survives reboots and next time we setup a gitlab-runner it works more automatically. Don't worry about part. But I don't think anyone has actually let it build an image yet.
Sat, Jun 18
Fri, Jun 17
The remaining SRE aliases in the file can now be separated into:
- deleted store@ and merchandise@ after they were created in Google- coordinated with Brendan of ITS and Sandra Hust, store manager
Thu, Jun 16
added to DNS
added to DNS:
Thank you for the very quick response!
It's deployed but we have some follow-ups. I guess lowering the prio a bit is appropriate for this state.
buildkitd is now running on all (6) gitlab-runners. It's 6 because the VMs 1001 and 2001 have been decom'ed earlier today and then there are 3 physical hosts per DC.
The run of the decom book was at:
After this I ran only the DNS cookbook directly and this time it finished without such an error. I am not sure if it tried though because it said "nothing to sync".
Wed, Jun 15
There are incoming redirects into policy.wikimedia.org:
Looks like T310738 would make this obsolete.
just a note for serviceops: policy.wikimedia.org is not currently under the control of SRE/prod servers at WMF. It's hosted at Wordpress VIP.
- deleted aql-sms@ not needed anymore
- deleted order@, orders@, return@ and returns@ after Sandra Hust, manager of store.wikimedia.org confirmed they aren’t public knowledge on the store page and wasn't even aware of them. they only use merchandise@ and store@ which both go to a single zendesk email. So first simplify and then move the remaining redirects to ITS (in progress)
Tue, Jun 14
there is always moar:)
The other day I have deleted cpt-leads@ (after Tim told me it's ok and not used anymore since a while) and techcom@ (after asking ITS to create it on the Google side and agreeing with Timo that he is the new admin of that google group).
Mon, Jun 13
I also saw certificate errors pop up in a different project that uses a local puppetmaster. And we felt like we had not touched anything. Did not get to look yet but this seemed similar enoigh and I was already suspecting some change related to self-puppetmaster.
Confirming @XCollazo-WMF exists and was introduced in SRE meeting today :) welcome to WMF. Confirmed signature and checked all other boxes. Just one is open for clinic duty.
ah ACK, ok, in that case we will just move forward as planned. Thanks Papaul
A1: serviceops: gitlab2002 is still in state "in setup". While we were going to change that we will hold back until this is done.
unfortunately this is purchase date 2016-12-12 .. so ...probably can't get it fixed
/admin1-> racadm serveraction powercycle
04:32 <+logmsgbot> !log dzahn@cumin2002 conftool action : set/pooled=no; selector: dc=codfw,name=thumbor2004.codfw.wmnet
Sun, Jun 12
Fri, Jun 10
I ACKed the Icinga alerts with a link to this so they are not in "unhandled CRIT" anymore.
@Arnoldokoth The last change we had uploaded in our meeting the other day is now merged. I would say we can call this resolved and close the ticket (but also create a new one for migration to bullseye sometime in the future.. which mentions that we should ._then_ also do the remaining rename of:
Thu, Jun 9
I talked with Jesse about all this. We agreed I will follow-up about the last few things, you Faidon, also mentioned in our mail. cpt-leads@, techchom@ and the remaining fr-tech ones. I just sent mails about these. Then after that is done I'll close this ticket as resolved and tell ITS that everything related to wikiPedia.org (ongoing discussion about dropping things like jimmy@, personal aliases in wikiPedia.org etc) should be seen as a separate task and I will hand that over.
@Krinkle Yep, that summary sounds right to me. That's what we had in mind. It's just that some time ago you had said it's not ready yet to be switched on that change https://gerrit.wikimedia.org/r/c/operations/dns/+/650625/. I don't recall what the specific reasons were for it not being ready. But if there is no concern anymore now then this should be ready to go anytime. Feel free to invite me via calendar to make this happen. I can deal with reasonably early time in my timezone.
First and foremost though, the reason why gitlab has all public IPs is because we were trying to emulate the gerrit setup. And gerrit has public IPs and is not behind LVS because we wanted it that way. We wanted to be able to still use Gerrit and merge changes even if the caching layer is down for some reason. For the same reason icinga has a public IP. Certain services were not supposed to rely on loadbalancers.
moving gitlab1001.wikimedia.org to gitlab1001.eqiad.wmnet
Wed, Jun 8
@Sabrecalyx If this is a legit request, please replace <groupname> with the actual group name requested and fill out the rationale section.
What happened here is:
mw1415 does not serve 500s anymore. T307755#7990623
This caused T310225 because setting it to pooled=inactive does not mean monitoring will stop checking it and when this came back unexpectedly it caused new alerts for 500s on this box, which had not received scap updates. But setting it to pooled=no would have meant deployers would have gotten warnings about an unreachable host for a month. The deeper issue is there is no right status to set hosts to while they are waiting for hardware repair.
21:13 < mutante> !log mw1415 - scap pull, restart apache, /usr/local/sbin/restart-php7.2-fpm (INFO: The server is depooled from all services. Restarting the service directly)
something between resolved and declined. please feel free to reopen though if you feel differently about it.
Does this only affect this instance or maybe all users who have a local puppetmaster in their VPS project? It seems like we haven't touched anything and it was working before and the error makes me think something changed somewhere upstream or alternatively..someone tried to switch between local project puppetmaster and regular global puppet master.
Mon, Jun 6
ok, thank you IF team! assigning back to me for the moment to follow-up. Yes, there was a specific person. I will readd this with a specific group after discussion.
This is a better link since it's directly upstream and latest docs from 2022:
@Cmjohnson Alright, gotcha! Thanks for the updates and Dell request.
Fri, Jun 3
bundle exec rake 'spdx:convert:module[MODULENAME]'