Wed, Feb 21
This is fully rolled out.
Happened again on ganeti1007, again with page allocation errors.
This is completely rolled out.
Tue, Feb 20
This is complete
Mon, Feb 19
Rachel wanted to doublecheck within the WMF Legal department, but that will take until tomorrow at least due to the WMF holiday for US staff.
Yeah, that's the most plausible option (and we're already using custom OpenSSL 1.1 packages on Debian jessie to support e.g. chacha), but 1.1.1 has only just seen it's first alpha release (and they won't release it in a final version until TLS 1.3 is final)
The two hosts have been switched to role::spare, dropped from conftool and marked as downtime until the end of the year. Unclaiming myself again, the rest is DC ops territory.
Fri, Feb 16
Thu, Feb 15
Yeah, definitely, this is currently only meant for all many common system services we use across the fleet (nrpe, diamond, systemd-timesyncd, atd, prometheus exporters, site-local exim, sshd etc.) and not for outward-facing LVSed services, those need separate tooling improvements.
Ack, the deb8u2 patch for jessie was for a security fix which is also fixed in the stretch version.
If this request is only about getting added to the wmde group (which controls some Gerrit settings related to WMDE projects) you don't need to sign an NDA with the WMF Legal department (but let me clarify that with them, I'll update this task).
Wed, Feb 14
These are fully rolled out:
Valentín has been added to pwstore.
Tue, Feb 13
Should not have any impact on the production scalers, to be extra sure I only shut them down for now and if no one complains in the next days, I'll completely remove them.
Still, PHP 7.1 should be considered independently, has this specific bug been reported upstream (or is that blocked by Wednesday's upgrade as upstream doesn't accepted bug report for outdated versions?)
We can import the PHP 7.1 packages from Ondrej Sury to a separate repository component (like component/php71), the maintainer can be trusted. But this would be specific for use by Phabricator, since using external repo has a number of notable downsides (e.g. no update guarantees as for the Debian updates and more importantly no integration with the wider PHP extensions ecosystem (i.e. all php packages not build from the main PHP package need to imported/adapted manually). Looking at phab1001 we have
- php-apcu (needs an update)
- php5-json (in PHP7 this is part of the main package)
- php5-mailparse (is a custom package anyway)
Yuvi's shell access was removed via https://gerrit.wikimedia.org/r/407577 and I've also just removed him from the wmflabs.org root mail alias and from the cn=nda LDAP group.
Added to cn=ops and cn=wmf LDAP groups.
Mon, Feb 12
Uploaded to apt.wikimedia.org. To add it to a server you can use
Fully rolled out now.
Beta has been upgraded to ICU 57, we'll also upgrade production to that version at (no timeline established yet).
Beta/deployment-prep has been upgraded to an HHVM build using ICU 57.
Fri, Feb 9
Even as of right now we have versions 2.1.13 and 2.2.6, (in addition to 3.11.0) in play. Version 2.1.13 is used by maps (which depending on who you talk to Doesn't Matter(tm)), and AQS uses 2.2.6, and probably will for the foreseeable future (they have no plans to upgrade). Even if you disqualify maps, we did keep AQS on a 2.1.x release for a considerable period of time (months?) after we'd moved RESTBase to a 2.2 release.
These are fully rolled out:
These are fully rolled out:
Our internal php-luasandbox package has been rebuilt to only provide the hhvm-luasandbox package (that's kind of confusing given the source package name, but it's only temporary given our migration to PHP 7).
The Jupyterhub spare (notebook1002) was repurposed as kafka1023 in https://phabricator.wikimedia.org/T181518
Thu, Feb 8
First is upgrade tlsproxy hosts to 1.13.6-2+wmf1 (but still on existing nginx-full packages)
Wed, Feb 7
Is there a specific reason for calling the repo component cassandra311? That's very specific and adding/removing components requires some puppet churn. IOW do we have reason to believe that Cassandra 3.12 will be incompatible with a 3.11 cluster?
Thanks, I ran "scap pull" and repooled the server.
Fri, Feb 2
That host has a broken sshd config (coming from Phabricator), but it's possible to login via mgmt and the root password.
Happened again on ganeti1005, similar errors, but this time triggered by a copy of the Archiva data.
Our internal wikidiff2 package has been rebuilt to only provide the hhvm-wikdiff2 package now (and after some fiddling with reprepro I removed the old php-wikidiff2 from apt.wikimedia.org).
That's dependent on goal planning / road map considerations, I only meant to point out the availability in backports since it was mentioned earlier on this task.
mailman3-core, mailman3-hyperkitty, postorius and mailmanclient have been accepted into stretch-backports today.
In addition I'll drop the php-wikidiff2 from our internal src:php-wikidiff2 package (so that it only builds hhvm-wikidiff2).
I've uploaded a backport of Kunal's 1.5.1-3 package from Debian testing to stretch-backports. The packages in Debian only support Zend PHP (since Debian doesn't feature more in depth integration of HHVM in the wider module eco system), but we still need hhvm-wikidiff2, so I'll update the internal source package to only build the hhvm-wikidiff2 binary package. (And when we've migrated to PHP7 we can remove the internal package entirely)
@grin: Thanks for the pointer! Since ClamAV has retracted the broken signature (and will make sure this doesn't reoccur) I'll close this task. We're following ClamAV via jessie-updates, so when this is fixed upstream, we'll pick up the new version once released on short notice.
Thu, Feb 1
@Cmjohnson Is this ready to be re-pooled with the new DIMM or are you planning further tests which require the server to be out of service?
@RobH : Given Antoine's comment, let's reclaim, then? This host has almost 2.5 years remaining warranty
clamav is socket-activated, maybe it tripped over some rule? I installed the new version and the errors are gone for now, let's keep an eye on it.
Let's simply use a Debian base image, then? With the overhead that Kafka adds, the disk space saving of musl over glibc is negligable anyway.
Wed, Jan 31
During the SRE offsite/onsite we came up with the following plan:
HHVM is available for stretch-wikimedia for quite a while now (used by the video scalers).
Tue, Jan 30
A revised fix has been released (along with 3.18.8), I'll roll that into our packages: https://hhvm.com/blog/2018/01/30/hhvm-3.24.1.html
Yeah, it's probably easiest to add a new disk and move /var/lib/archiva to it
These are fully rolled out:
Thu, Jan 25
Jan 23 2018
For the HHVM builds on apt.wikimedia.org this has been fixed in 3.18.5+dfsg-1+wmf4 (jessie) and 3.18.5+dfsg-1+wmf4+deb9u1 (stretch). Does Travis use the deb packages provided by Facebook?
Jan 19 2018
@Papaul: ethtool shows "Link detected: no" for both network interfaces, the next time you're in the DC could you please check the cabling? (Not time-critical)
The hardware configuration from T181419 seems perfectly fine for Cumin masters. I don't have a good estimate how much cheaper a single CPU/32 GB machine would be compared to this setup, so I'll leave it at your (and Faidon/Mark's) discretion whether evaluating a lower spec option is actually worthwhile.
Stretch packages have also been uploaded in the mean time.
Jan 18 2018
Some general docs (targeted at ops pwstore, but pretty similar) are at https://office.wikimedia.org/wiki/Pwstore
Ops and Releng are using pwstore, which is just using a simple git repository underneath for storage.
This is solely for T174110 or are we anticipating other use cases?
Jan 17 2018
I've built/uploaded new HHVM packages for jessie (stretch following soon) which disable the broken patch and also reported this upstream at https://github.com/facebook/hhvm/issues/8104
Jan 16 2018
I rebooted this spare host for completeless wrt Meltdown kernel update and while it's now running the fixed kernel, sshd came up running the /etc/ssh/sshd_config.phabricator instead of the regular /etc/ssh/sshd_config. iridium can still be reached via mgmt and is up for decom, so no point in debugging/fixing this IMO.
Jan 15 2018
Jan 12 2018
Jan 11 2018
Given that this task is stalled for a while now, we should reimage these servers with stretch before eventually putting them into production?
Jan 10 2018
Fixed kernels are available for trusty now, I've installed them on francium and snapshot100[1,5-7].
Jan 8 2018
That's a bug in the systemd unit of prometheus-blazegraph-exporter, it needs to start after Blazegraph, but the current version doesn't declare that, so systemd tries to start it when multi-user.target is reached. I can fix it some time this week.
Can you also remove apt::use_experimental from the Hiera settings for deployment-prep? There's no point for deployment-prep to use "experimental" at this point.
Jan 5 2018
@Legoktm already prepared a stretch-backports upload of php-luasandbox, so we can use that one. We could update wikidiff2 in stretch-backports to 1.5.1-3 and stick with the Debian releases?
Jan 4 2018
@Paladox: Most of WMCS runs trusty with either the 3.13 or 4.4 kernel and needs an update by Canonical (which isn't available).
Jan 3 2018
The module was initially blacklisted since there were multiple security issues which exploited privilege escalation bugs in overlayfs. Since then trusty has gained support for disabling unprivileged user namepaces (which was enabled), which was the biggest risk. I'm fine with adding a Hiera setting to disable the blacklist for Docker hosts. For the rest of the fleet we don't have a use for it and should keep it blacklisted.
Jan 2 2018
This host still shows up in puppetdb, i.e. misses the deactivate step (e.g. visible in https://servermon.wikimedia.org/hosts/)