To exclude firejail as a source of error, I disabled puppet on deployment-chromium01, remove Firejail from the service unit and restarted proton.service, same effect, Proton still fails:
@RStallman-legalteam from Legal handles those, I'm adding her to the task.
Sounds good, on db2085 there's been no further occasion after the reboot.
These updates have been fully deployed:
These updates have been fully deployed:
These packages are not used in our production infrastructure:
Fri, Feb 15
I'm puzzled as why reprepro doesn't pick that up, passing --noskipold didn't help either. For reason the source package got imported, but none of the binary packages.
Thu, Feb 14
Should be fixed now? JFTR, the command to run afterwards is
JFTR, the next Stretch update (this weekend) will update the kernel to 4.9.144-2, so that can be piggybacked.
Are there other servers of that batch beside db1106 and db2085?
I've imported pcre and libzip and that fixes the php72 component. There's still a Toolforge-specific issue, though: xdebug now depends on php-common, so it also needs to be imported to the thirdparty/php72 repo. Can someone from WMCS take this? Fix is similar to my patch from https://gerrit.wikimedia.org/r/c/operations/puppet/+/490579/, needs to also cover php-defaults.
https://rocm.github.io/ROCmInstall.html#supported-gpus should serve as a useful enough base to select a new GPU I guess (we'll need to figure out what stat1005 supports, though)
Wed, Feb 13
Yeah, I know, that one correctly pulls in a number of debs which are actually in puppet, but there's a number of additional ones which need a closer look (e.g. hpacucli or hpssa which are nowhere used).
Maybe try 4.20-1 from experimental to narrow the kernel oops down?
Tensorflow is also finding it's way into Debian, BTW (currently only in experimental): https://packages.qa.debian.org/t/tensorflow.html
Tue, Feb 12
JFTR, the recent update Ghostscript update to 9.26 also switched the JPEG2000 library from Jasper to OpenJPEG (current Ghostscript releases no longer support Jasper), so that might have also had an effect.
There are no source packages for the debs, given that they seem are otherwise pretty focused on FLOSS (e.g. https://rocm.github.io/ROCmInstall.html#closed-source-components), that's probably just an oversight and we should ask them to also publish them.
- stat1005 will be reimaged to Debian Stretch when the SRE team is ready (work is currently in progress to import Buster in production).
- Luca will ask to the SRE team to create a special POSIX group to allow Erik to be root on stat1005 and experiment with the host when he will have time/patience.
stat1005 is now running Debian buster and I've enabled Erik's access.
FYI, Ghostscript on the Thumbor servers got upgraded to 9.26, worth retesting.
FYI; the Ghostscript version on our Thumbor servers got upgraded to 9.26, this might be worth re-testing.
Closing this old bug, we're now using ghostscript 9.26 everywhere. If there's any specific other Ghostscript-related issue, please open a new task.
Mon, Feb 11
We currently pin prometheus-node-exporter to 0.17.0+ds-2 on the selected hosts and for buster, but yesterday 0.17.0+ds-3 migrated to testing/buster. I could change the puppet code to pick -3 on buster, but I'd say we upgrade the components for jessie and stretch also to -3 and bump it in general? https://packages.qa.debian.org/p/prometheus-node-exporter/news/20190131T180815Z.html lists a number of fixes and at least the TMPDIR change seems relevant as for those as well.
Fri, Feb 8
Still some rough edges to sort out, but bare metal installations are working now:
The debian installer completes, but I can't log in because apparently the first puppet run isn't completed and I can't use any login methods (ssh or direct console access).
The server went down at 12:16, with a number of memory errors logged in SEL:
As discussed on IRC: Let's upgrade to 2.7.1 next week as that fixes a security issue (CVE-2019-3826) in the internal UI (not exposed in production, but in https://beta-prometheus.wmflabs.org/). Change is already pending in Salsa: https://salsa.debian.org/go-team/packages/prometheus/commit/1cd743bc0012935842adb5941258c9ed8bff85fe
Wed, Feb 6
Tue, Feb 5
The archival mechanism doesn't seem very robust either; e.g. for user "banyek" the home is still around on e.g. cumin2001 or puppetmaster1001.
Please don't proceed with decom for now; I'm using graphite2002 for some buster tests.
Please don't proceed with decom for now; Filippo uses graphite2001 for prometheus 2 tests and I'm using graphite2002 for some buster tests.
Mon, Feb 4
I'm reopening the task, the server went down again today:
The user has now been removed.
Wed, Jan 30
We could narrow this down further by enabling debug flags for the initrd, I don't remember the specific options out of the top of my head, but we can look into this next week. As Manuel mentioned, my hunch is that this is a hw issues which manifests during the reboots, but which is not caused by the kernel change between -7 and -8 itself.
Fri, Jan 25
Is anyone still using Servermon at this point?
Thu, Jan 24
Wed, Jan 23
Tue, Jan 22
We've looked into this; our netboot images don't need an update: In the initrd anna is used instead of apt and it's not affected by CVE-2019-3462.
Mon, Jan 21
The FQDN where that server is being renamed to doesn't exist here yet, so it should simply skipped when setting downtime?
Sun, Jan 20
Jan 19 2019
Jan 18 2019
True that, also note that in the nodejs 10 packages (from component/node10), the nodejs-legacy package is gone. Debian dropped it, we could patch it back in, but it probably makes sense to fix this mid-term.
if you install the nodejs-legacy package, it will provide a symlink from node to nodejs.
Jan 17 2019
Jan 16 2019
I've removed Balazs from pwstore.
Jan 15 2019
The reports in that thread are for RHEL 7, which uses 3.10 as the base layer kernel (but with backports for all kinds of drivers, so it's hard to tell how that maps to out 4.9 kernel. One thing we could try is to test the 4.19.12-1~bpo9+1 kernel from stretch-backports. If it still fails in that version, we can easily report it to the upstream maintainers given that 4.19 is the latest LTS branch. Or we point Dell to the thread and ask them them swap the NICs to a known working 10G card.
Jan 14 2019
Jan 11 2019
The 4.9.144-1 kernel is fully production-ready, the point releases for Debian are used to rebase the Stretch kernel to the latest set of 4.9.x bug fixes (although depending on the final date for Stretch 9.7 there might be one further update still).