Wed, Jun 15
Tue, Jun 14
Mon, Jun 13
Thu, Jun 9
Thanks for the explanation. I just want to make sure if not a cookbook, then a runbook at least to make it very simple for an SRE to run. If the change ever needs to be executed, it would require a timely response and I want to ensure neither you nor any other one specific person would have to be required to do so. Still, I agree ideally if the change needed to be done you would be the one to execute it.
Wed, Jun 8
@cmooney , for the manual override, https://wikitech.wikimedia.org/wiki/Network_design_-_Eqiad_WMCS_Network_Infra#Manual_Intervention, who can perform the ovverride? Is it possible to have a cookbook to do this? Or otherwise make it "easy" or more accessible? Or would this always remain a more complex networking operation? I'm trying to understand how quickly we could recover should we lose a link. Thanks!
Tue, Jun 7
Note, the sql timeout which occurred after just over 24 hours of realtime.
Copying in question from IRC:
Moved into ticket for work and tracking: T310097
Mon, Jun 6
Ran webservice restart, and sal.toolforge.org is up again.
Adding in @Andrew
For posterity, to find broken candidates, look at the cron of started services.
Fri, Jun 3
More things that restarted in the last hour (when they should have been stable)... Likely broken candidates, but I didn't check
I'm going to try and set expectations here and say this part of the incident is closed / resolved. Hopefully we don't have to re-open!
Upon further review, after the new buster hosts began acting up again, with OOM errors, even though free -m showed memory. After investigation, dcaro noted that there was a 24Mb swap partition on each. Adding temporary 1G swap space seems to have removed the errors.
Noted dpkg was broken, and tools-sgeweblight-10-9 and tools-sgeweblight-10-10 were missing grid service, etc. Fixed dpkg, re-ran puppet to bring back online.
Thu, Jun 2
Looping in @Andrew. @Kelson note that yes, we are installing new, more capable machines that have more capacity than in years past. Once they are up and running, we can explore mirroring this additional data.
Given this change, the expectation would be new subcommands (like toolforge-build) are implemented as separate binaries? Or is this only expected for toolforge-build? Any thoughts on toolforge-build also being written in golang?
Good catch! It does seem to be a simple permission error. I'm curious if any other behavior changes will be observed once this is corrected.
Might this explain the credential issues? https://wikitech.wikimedia.org/wiki/Grafana.wikimedia.org#Editing_dashboards. You need the right ldap group, all of which requires NDA. I believe T295296 mentions this. I can help with https://wikitech.wikimedia.org/wiki/Volunteer_NDA if this is the only blocker.
@cmooney Let's arrange to move some machines so we can have more optimal routing. @dcaro, do you think it would be easier to move a cephosd versus draining and migrating a cloudvirt? We could move cloudcephosd1015 and cloudcephosd1021. Otherwise, I would suggest moving cloudvirts. cloudvirt1046 and cloudvirt1035 respectively.
Yes, I agree. Let's focus on bringing the new machines online.
May 26 2022
+1 from me.
May 25 2022
Just wondering on the status of these machines. Anything I can help with?
May 10 2022
I was wondering why some of them were spread the way they were (aka outside WMCS dedicated racks), but I see @cmooney updated the racking details with this intention. So yes, please proceed with racking.
May 6 2022
@Ladsgroup Thank you very much for these insights! Please feel free to share any other thoughts on how to improve the service performance or our ability to maintain it.
You can file a troubleshooting ticket with https://phabricator.wikimedia.org/maniphest/task/edit/form/55/, after reviewing https://wikitech.wikimedia.org/wiki/SRE/Dc-operations/Hardware_Troubleshooting_Runbook
May 4 2022
@Marostegui Thank you for following up with upstream here! Perhaps they can ultimately produce a fix or resolution. Until then, yes, let's keep utilizing our existing version. Fingers crossed we can one day removing our custom package.
May 3 2022
May 2 2022
@valhallasw Sorry to hear you ran into limitations during your migration! We can consider revising the default. What level would have worked for you?
Would it be possible to use the upstream package at this point? pt-kill is in debian. I'm not sure what patches are being applied beyond the debian version and couldn't find anything beyond: https://phabricator.wikimedia.org/T183983#3983899. Can someone help explain some context for this package? Thanks!
Apr 28 2022
Apr 21 2022
@wiki_willy @RobH Can you and team confirm our most recent specification (T303446) could be ordered in an R440 chassis? In particular, paying attention to the previous issues (T201352#4671220) such as power requirements being met? If so, we can transition to an R440/450 chassis depending on pricing.
Apr 20 2022
Apr 15 2022
Apr 13 2022
@Papaul By default for HA purposes, we include language to spread servers out when needed. However, given these machines are in dev, and not production, you can safely ignore that request to share racks. Especially if it makes it easier / more convenient for you and team to manage. Thanks for asking. I hope relaxing this requirement helps!
Mar 31 2022
Mar 29 2022
+1 from me. Thank you for migrating over from gridengine!
Mar 23 2022
Should we consider it necessary to have support for longer, https://deb.freexian.com/extended-lts/ could be an option. Note, both the timeframe and specific support would have to be defined. I would also caution this is NOT a "solution" to avoiding upgrading these instances. However, it could be part of a plan to upgrade if needed.
Mar 14 2022
Thanks Arzhel! I don't believe anything else is needed from me. Assigning back to @RobH. Feel free to ping again if I missed something!
Mar 10 2022
@Jclark-ctr I would want confirmation from infa foundations that all the necessary network connectivity is present. From what I understand, these machines need the public1 VLAN. And need to serve public traffic for dumps, and NFS traffic to cloud and analytics (data engineering), amongst other things. Assuming the new rows are "the same" as the old rows, it should be fine. But I'll let others confirm. @cmooney @ayounsi can you help confirm?
Mar 9 2022
Note, the existing machines are taking up a total of 12U (6U each) in D2 and A4.
@RobH I updated the task to call out these should be installed into two different rows, as well as not installing in WMCS specific racks. These machines host dumps, and manage NFS exports for both cloud and data engineering. They don't utilize a cloud specific VLAN. The existing boxes are racked in non-WMCS specific racks.
For now, we will hold the domain and prevent misuse.
@SCherukuwada Did Toolforge work for your needs? Or would you prefer a cloud vps project? Let us know and we'll create the project.
@dschwen How are things going? Anything further we can help with?