Disk space on the root partition of mwmaint1002 is depleted, which results in failing puppet runs
Compared to the rest, mwdebug* are VMs, how large is the difference to the other servers you were seeing?
Adding the Debian maintainer :-) This seems fixed in 0.9-1 so updating stretch-backports to 0.9 could fix this.
The .sock file is created via systemd-tmpfiles, which is only read during boot, the socket will be created with the next restart
This error hasn't resurfaced, I'm closing the task.
Nice, if these are confirmed working, we should import my nfs-utils backport to apt.wikimedia.org. Should these go to a separate component (something like component/nfs13, which is then added to selective NFS servers) or be added in general? Apart from labstore100[4/5], we also have labstore100 and dumpsdata, would these get updated as well or rather not? If yes, we can also simply import the packages to apt.wikimedia.org/main.
Fri, Sep 21
Thu, Sep 20
Ok, I've repooled the server for now.
Thanks, I finished up the reimage via install_console and re-added it to Icinga, looks all fine now.
Wed, Sep 19
Tue, Sep 18
Server went down again at 10:45 UTC.
Mon, Sep 17
I've added the "Datacenter-Switchover-2018" project as this was filed as a response to a question in the staff channel (where the active maintenance server wasn't obvious). Not sure if that's over-stretching the use case of that project, if so, please remove.
nodejs 10 packages for stretch-wikimedia are now available in the repository component "component/node10" for testing. I'm keeping this bug open to track possible further additions (addons etc.)
Fri, Sep 14
As I had made a backport of the megaraid_sas driver for Perc 740/840 to the 4.9 stretch kernel anyway, I ran some tests on backup2001 (which has the new controller) and acamar (which has an older Perc controller running the megaraid_sas driver), which were successful. Submitted to the Debian kernel team in https://salsa.debian.org/kernel-team/linux/merge_requests/61
I've created a backport of the nfs-utils package from stretch for jessie, it's not yet uploaded to apt.wikimedia.org, but available at https://people.wikimedia.org/~jmm/nfs/
Closing the task, please reopen if it doesn't work for you.
Thu, Sep 13
I've added Ian to restbase-roots.
This was approved in the SRE meeing on Monday.
Closing this task, actual implementation will happen via T177385
@RobH Ack, I'll take care of that.
Mon, Sep 10
That would require code changes in Netbox and doesn't seem to warrant the overhead. Alex documented the access https://wikitech.wikimedia.org/wiki/LDAP/Groups#Specific_groups and I I think that's good enough.
@thiemowmde Does the access work for you?
@Peter Does the access work for you?
Server is running fine since a while, closing the task
This is resolved, the jessie-based labstore servers are running 4.9 since a few weeks.
This is fixed in 1.20+deb9u2 which only builds the "linux-meta-4.14" package for stretch, "linux-meta-4.9" isn't relevant/needed for stretch.
Sun, Sep 9
Fri, Sep 7
I now have a stretch netboot image with a 4.14 kernel which PXE boots via the QLogic 41xx adapter. In d-i I'm getting a strange error message which tells me that no modules could be found (although lsmod shows plenty of kernel modules loaded, need to dig a bit further in d-i what that could be caused by.
nodejs 10 packages will be in a separate repository component, allowing applications to gradually move over. We'll continue to support nodejs 6 with security updates until all applications are migrated to 10.
Thu, Sep 6
See https://phabricator.wikimedia.org/T202255#4563157 for the 4.14 kernel.
Status update: I've created a stretch backport of a 4.14 kernel which should support both QLogic 41xx and the new HP Perc megaraid controller properly. To allow to use this kernel to be used in the PXE boot I've been working on an updated stretch netboot image with the 4.14 kernel integrated. This has been quite cumbersome, I've fixed up a bunch of issues so far, but the netboot image still fails to load the initrd. I'm seeing the error message
+1 for creating a deb. I can give you an introduction on how to do that if you want.
I also tried to disable an HTTP-based PXE boot via https://gerrit.wikimedia.org/r/#/c/operations/puppet/+/458463/, but that didn't work either, same symptoms as above.
Added to pwstore.
Added to pwstore.
Added to pwstore.
Wed, Sep 5
Arzhel and I had a look and this doesn't seem to be ACL-related, the tftp packets are flowing in both directions. This is possibly a bug in the firmware, see https://phabricator.wikimedia.org/T199125#4560182
I tried an installation from cloudvirt1023, but the PXELINUX version on the NIC is affected by a bug in syslinux 6.0.3 as used on the Broadcom NIC and fails to fetch the install image:
Tue, Sep 4
@Papaul: That's expected, this also need a change to the DHCP config to use the netboot image based on 4.14, e.g. by using the patch at https://gerrit.wikimedia.org/r/457930 or setting this manually on install2002. I'll test this tomorrow (or feel free to go ahead!), the installation still won't be 100% complete as the 4.14 kernel it not yet uploaded to apt.wikimedia.org and we need another patch to install it in late-setup. With the current image it uses 4.14 in the installer, but then install the 4.9 kernel in the end which lacks the updated driver.
Closing this task, opened T203434 for decom.
I've created a custom Linux 4.14 kernel which worked fine in my tests with an updated firmware-qlogic. I've also created a netboot image based on Linux 4.14.
It's based on the last version which was in unstable for 4.14.x (4.14.17), but that's good enough for initial tests. If it's working fine and we decide to keep using it, I'll update the packages to the latest 4.14.x kernel.
1.7.3 has been rolled out to the app servers (some in codfw still need the update, this will be piggybacked on other maintenance later the week)
Mon, Sep 3
Microcode is now enabled on all baremetal servers with an Intel CPU and we haven't seen any issues so far. Closing the task.
Fri, Aug 31
Balazs has been added to pwstore.
One notable change which is to be expected from moving to 10:
Some node modules ship binary blobs in their modules and the official node packages are build against OpenSSL 1.0.2. nodejs 10 only supports OpenSSL 1.1 (which has a different ABI/API) and then those modules fail to load or throw runtime errors
Upstream discussion is at https://github.com/nodejs/node/issues/21897, but there's no real solution
@ayounsi : I can still reproduce this with an installation of cloudvirt1023, I can see in syslog that atftpd is serving lpxelinux.0 to 10.64.20.42 and I can see on the serial console that the PXE boot firmware doesn't get a reply. Ping me when you have some to debug this?
I worked on a backport of the driver 4.9 and I got to the point where the driver loaded along with the firmware, but there were runtime issues which caused connection failures. The errors were related to statistics gathering in the driver (a change I had to backport and which seems to need additional changes). I tried to keep my backport minimal to the qede driver, but all the Qlogic drivers share some common base (e.g. qede also required the qed kernel module) and to fully correct this I'd probably need to cherrypick additional upstream changes for qed. Ideally there would be some officially blessed upstream backport for the 4.9 LTS kernel series, I've contacted upstream whether they have something like this.
@aaron: To clarify/confirm: You don't need cluster-wide root access anymore? The only reason we have this discussion because we were (accidentally) added to the new groups created for the performance team. But none of those actually add anything to your privileges as you already have global root. So please either confirm that
- you want to keep your existing global root access (which you might be using for debugging during outages) so that we strip the superfluous performance team groups
- you don't actually need global root anymore; then we'd keep you in the performance team groups and remove the global root access.
Thu, Aug 30
The hardware side is fixed, but I'm seeing a kernel error, looking into it.
@jcrespo Can you please push the signed key to the keyservers?
The 1.7.3 is now signed by a freshly generated key by Christoph without any signatures (yet), while all previous releases were signed by Kunal. @Legoktm it would be good if you could sign keys with everyone now involved in wikidiff releases.
Thanks! I've installed my backported test kernel and figured out why additional firmware we need, it looks promising, the driver gets loaded along with the firmware:
Same issue as https://phabricator.wikimedia.org/T202650#4541158; Aaron has global root and can access these hosts without sitemaps-admins.
Wed, Aug 29
@Papaul : Does this maybe need some additional changein the BIOS to make the server PXE-boot from the internal NIC?
I've added the CVE IDs to the task description.
I don't understand this task. Aaron already had global root already, why is that needed at all?
Tue, Aug 28
Thanks, I've repooled the server. Closing the task, will reopen in case there are still issues.