Page MenuHomePhabricator

MoritzMuehlenhoff (Moritz Mühlenhoff)
User

Today

  • Clear sailing ahead.

Tomorrow

  • Clear sailing ahead.

Friday

  • Clear sailing ahead.

User Details

User Since
Apr 1 2015, 4:33 PM (338 w, 6 d)
Availability
Available
LDAP User
Moritz Mühlenhoff
MediaWiki User
MMuhlenhoff (WMF) [ Global Accounts ]

Recent Activity

Yesterday

MoritzMuehlenhoff closed T275873: Prepare our base system layer for Debian 11/bullseye as Resolved.

Bullseye preparations have completed and it's in active use, closing. For future migration tracking, T291916 can be used.

Tue, Sep 28, 10:59 AM · Patch-For-Review, SRE
MoritzMuehlenhoff triaged T291916: Tracking task for Bullseye migrations in production as Medium priority.
Tue, Sep 28, 10:59 AM · Infrastructure-Foundations, SRE
MoritzMuehlenhoff added a parent task for T288804: Move the Data Engineering infrastructure to Debian Bullseye: T291916: Tracking task for Bullseye migrations in production.
Tue, Sep 28, 10:58 AM · Analytics-Clusters
MoritzMuehlenhoff added a subtask for T291916: Tracking task for Bullseye migrations in production: T288804: Move the Data Engineering infrastructure to Debian Bullseye.
Tue, Sep 28, 10:58 AM · Infrastructure-Foundations, SRE
MoritzMuehlenhoff added a parent task for T289135: Upgrade Cirrus Elasticsearch clusters to Debian Bullseye: T291916: Tracking task for Bullseye migrations in production.
Tue, Sep 28, 10:57 AM · Discovery-Search, SRE
MoritzMuehlenhoff added a subtask for T291916: Tracking task for Bullseye migrations in production: T289135: Upgrade Cirrus Elasticsearch clusters to Debian Bullseye.
Tue, Sep 28, 10:57 AM · Infrastructure-Foundations, SRE
MoritzMuehlenhoff created T291916: Tracking task for Bullseye migrations in production.
Tue, Sep 28, 10:56 AM · Infrastructure-Foundations, SRE
MoritzMuehlenhoff added a comment to T289624: Q1: (Need By: TBD) rack/setup/install centrallog2002.codfw.wmnet.

@MoritzMuehlenhoff I don't know is you saw my comment on Sep 10th but i am having issue installing Bullseye. I am getting the error below

Failed to load ldlinux.c32
Boot failed: press a key to retry, or wait for reset...

Thanks

Tue, Sep 28, 8:41 AM · SRE, SRE Observability (FY2021/2022-Q1), ops-codfw, DC-Ops

Mon, Sep 27

herron awarded T288028: Remove the "Long running screen/tmux" Icinga check a Party Time token.
Mon, Sep 27, 1:54 PM · Observability-Alerting, Patch-For-Review, SRE
lmata awarded T288028: Remove the "Long running screen/tmux" Icinga check a Like token.
Mon, Sep 27, 1:37 PM · Observability-Alerting, Patch-For-Review, SRE
MoritzMuehlenhoff closed T288028: Remove the "Long running screen/tmux" Icinga check as Resolved.

Check is now gone.

Mon, Sep 27, 10:29 AM · Observability-Alerting, Patch-For-Review, SRE
MoritzMuehlenhoff added a comment to T268985: Improve user experience for Kerberos by creating automatic token renewal service.

Looking at /var/log/installer it seems stat1005 was installed in 2019 with Stretch and then later on dist-upgraded to buster (something we rarely do since we prefer reimages, but it happens). Installing usrmerge in this case (and let's check whether other stat* hpsts have the same issue) sounds good to me.

Mon, Sep 27, 10:27 AM · Analytics-Kanban, User-MoritzMuehlenhoff, Analytics-Clusters

Fri, Sep 24

MoritzMuehlenhoff added a comment to T286911: Upgrade MXes to Bullseye.

The two VMs (mx1002/mx2002) which were used to test the Bullseye setup have been taken down.

Fri, Sep 24, 10:50 AM · SRE, Patch-For-Review, Infrastructure-Foundations, Mail

Thu, Sep 23

MoritzMuehlenhoff added a comment to T286911: Upgrade MXes to Bullseye.

Both mx1001 and mx2001 are now running Bullseye. There's a little cleanup/followup work, but the core of the work is completed.

Thu, Sep 23, 2:37 PM · SRE, Patch-For-Review, Infrastructure-Foundations, Mail
MoritzMuehlenhoff added a comment to T290982: Support expired tile deduplication.

I can easily rebuild/upload a fixed package for apt.wikimedia.org, though. Just let me know.

Thu, Sep 23, 12:52 PM · Product-Infrastructure-Team-Backlog (Kanban), Maps

Tue, Sep 21

MoritzMuehlenhoff added a comment to T291387: Ensure Cloud Services platforms will accept new LE issuance chain.

The expected version numbers are
openssl1.0: 1.0.2u-1~deb9u5
gnutls28: 3.5.8-5+deb9u6

Tue, Sep 21, 8:14 AM · PAWS, Cloud-VPS, Toolforge, cloud-services-team (Kanban)
MoritzMuehlenhoff added a comment to T291425: Rebuild CI images affected by OpenSSL compat issue with new Let's Encrypt issuance chain.

The expected version numbers are
openssl1.0: 1.0.2u-1~deb9u5
gnutls28: 3.5.8-5+deb9u6

Tue, Sep 21, 8:14 AM · Patch-For-Review, Release-Engineering-Team (Done by Wed 06 Oct), Continuous-Integration-Config
MoritzMuehlenhoff created T291458: Rebuild production Stretch images with GNUTLS/OpenSSL updates for LE issue chain update.
Tue, Sep 21, 8:13 AM · serviceops, SRE

Mon, Sep 20

MoritzMuehlenhoff added a comment to T283714: Python 3's eventlet.green getaddrinfo timeout in Bullseye.

I was able to get a working python3-eventlet package by integrating upstream PR, the easy solution for now IMHO is to upload the package internally for Bullseye.

Mon, Sep 20, 7:17 PM · Patch-For-Review, SRE, User-fgiunchedi, SRE-swift-storage
MoritzMuehlenhoff added a comment to T290982: Support expired tile deduplication.

Script is now deployed on the masters

Mon, Sep 20, 3:37 PM · Product-Infrastructure-Team-Backlog (Kanban), Maps
MoritzMuehlenhoff updated the task description for T290982: Support expired tile deduplication.
Mon, Sep 20, 3:37 PM · Product-Infrastructure-Team-Backlog (Kanban), Maps
MoritzMuehlenhoff added a comment to T283165: OpenSSL < 1.1.0 compatibility issues with new LE issuance chain.

For production:

  • OpenSSL in Buster and Bullseye is not affected (only ship OpenSSL 1.1)
  • OpenSSL updates for openssl 1.0.2 in Stretch have been rolled out
  • GNUTLS in Bullseye is not affected
  • GNUTLS in Buster was already fixed in Buster 10.10 (rolled out via T285206)
  • GNUTLS updates for Stretch have been rolled out
Mon, Sep 20, 2:28 PM · Patch-For-Review, Infrastructure-Foundations, SRE, Traffic
MoritzMuehlenhoff added a comment to T290982: Support expired tile deduplication.

Should be fixed here after this change.

Mon, Sep 20, 8:00 AM · Product-Infrastructure-Team-Backlog (Kanban), Maps
MoritzMuehlenhoff created T291353: Check home/HDFS leftovers of mholloway-shell.
Mon, Sep 20, 6:36 AM · Analytics-Kanban, Analytics

Fri, Sep 17

MoritzMuehlenhoff added a comment to T290982: Support expired tile deduplication.

@Jgiannelos One of the tests fails with Python 3.7 (the Python version in Buster):

Fri, Sep 17, 12:40 PM · Product-Infrastructure-Team-Backlog (Kanban), Maps
MoritzMuehlenhoff added a comment to T291052: Deploy PHP patch for DOM replaceChild/removeChild performance.

Ack, I'll upload to apt.wikimedia.org on Monday.

Fri, Sep 17, 12:06 PM · Patch-For-Review, SRE, serviceops

Thu, Sep 16

MoritzMuehlenhoff added a comment to T290982: Support expired tile deduplication.

The approach of the CLI looks good to me. We should now see how to backport the script to debian buster to use on the maps clusters.

@MoritzMuehlenhoff do you have any thoughts regarding the debian packaging backport? How can we proceed with this?

Thu, Sep 16, 3:37 PM · Product-Infrastructure-Team-Backlog (Kanban), Maps
MoritzMuehlenhoff added a project to T286911: Upgrade MXes to Bullseye: SRE.
Thu, Sep 16, 3:17 PM · SRE, Patch-For-Review, Infrastructure-Foundations, Mail
MoritzMuehlenhoff added a comment to T286911: Upgrade MXes to Bullseye.

Status update: mx2001 is reimaged to Bullseye and working fine so far. The smart hosts config on our servers has been switched to prefer mx2001 over mx1001 and the MX records of a handful of lesser used domains now point to mx2001.
If there's no further issues, the remaining DNS records will be updated on Monday and following that mx1001 will be reimaged some time mid next week.

Thu, Sep 16, 3:17 PM · SRE, Patch-For-Review, Infrastructure-Foundations, Mail
MoritzMuehlenhoff added a comment to T291052: Deploy PHP patch for DOM replaceChild/removeChild performance.

scandium has been upgraded. If tests are fine, I'd upload to apt.wikimedia.org

Thu, Sep 16, 8:05 AM · Patch-For-Review, SRE, serviceops
MoritzMuehlenhoff added a comment to T199911: Systemd session creation fails under I/O load.

Since the Thanos hosts run Buster and a more recent kernel/glibc/systemd, I disabled the cleanup cron job on these hosts, so that we can check whether this got fixed. If Buster is still affected we can add the cron job back.

Thu, Sep 16, 7:55 AM · Infrastructure-Foundations, SRE, SRE-tools
MoritzMuehlenhoff added a comment to T290766: Requesting access to analytics-privatedata-users for Michael Raish (Design Strategy).

Hi @cmooney , actually I just checked again (80 minutes later) and I actually do have the access I need now. Maybe it took a while for everything to fall into place?

Thu, Sep 16, 7:40 AM · SRE, SRE-Access-Requests

Wed, Sep 15

MoritzMuehlenhoff added a comment to T290766: Requesting access to analytics-privatedata-users for Michael Raish (Design Strategy).

Hi @cmooney , thatnks for noticing that. Yes, the 'mraish' account was set up when I was still contracting, and I set up the 'Mikeraish' when I converted and linked to my wmf email. It would be great to remove the original 'mraish' account and add access to 'Mikeraish' as you suggested. I just signed in to the old account looking for a way to delete it, but I wasn't able to find one, however. Should this deletion ideally come from your end or from mine?

Wed, Sep 15, 3:20 PM · SRE, SRE-Access-Requests
MoritzMuehlenhoff added a project to T291052: Deploy PHP patch for DOM replaceChild/removeChild performance: SRE.
Wed, Sep 15, 2:42 PM · Patch-For-Review, SRE, serviceops
MoritzMuehlenhoff added a comment to T291052: Deploy PHP patch for DOM replaceChild/removeChild performance.

Sure thing, I'll upgrade scandium tomorrow morning then.

Wed, Sep 15, 2:41 PM · Patch-For-Review, SRE, serviceops
MoritzMuehlenhoff updated subscribers of T291052: Deploy PHP patch for DOM replaceChild/removeChild performance.

I've made an updated PHP 7.2 package with a 7.2 backport of https://github.com/php/php-src/commit/781e6b4d214012e9b9c0cf96a239cdf9f948da91

Wed, Sep 15, 1:32 PM · Patch-For-Review, SRE, serviceops
MoritzMuehlenhoff added a comment to T290984: error while resolving custom fact "lldp_neighbors" on ms-be105[1-9], ms-be205[1-6] and relforge100[3-4].

That page mentions that at least firmware version NVM 6.01 (for the NIC) and a current driver version are required. According to ethtool, the X710 in ms-be1051 has firmware 6.8 which should be ok. But it doesn't show the lldp disable option when I run the ethtool "-show-priv-flags" command:

Wed, Sep 15, 10:15 AM · Infrastructure-Foundations, Puppet
MoritzMuehlenhoff added a comment to T290984: error while resolving custom fact "lldp_neighbors" on ms-be105[1-9], ms-be205[1-6] and relforge100[3-4].
  • Decide on a way to have this done at boot-time for affected hosts.
    • That also involves working out how to deal with this via automation, a difficulty is identifying hosts using the affected Intel NIC, and the PCI ID of the affected interface on each (which is part of the path the command gets echoed to).
Wed, Sep 15, 10:12 AM · Infrastructure-Foundations, Puppet
MoritzMuehlenhoff created T291060: Check home/HDFS leftovers of kaywong.
Wed, Sep 15, 7:55 AM · Analytics-Kanban, Analytics

Tue, Sep 14

MoritzMuehlenhoff updated the task description for T210704: Migrate node-based services in production to node10.
Tue, Sep 14, 12:39 PM · Patch-For-Review, Platform Team Initiatives (Containerise Services), serviceops, SRE
MoritzMuehlenhoff updated the task description for T210704: Migrate node-based services in production to node10.
Tue, Sep 14, 12:39 PM · Patch-For-Review, Platform Team Initiatives (Containerise Services), serviceops, SRE

Mon, Sep 13

MoritzMuehlenhoff added a comment to T286911: Upgrade MXes to Bullseye.

mx2001 is now filtered on the routers, in case there are any issues, this can be reverted by merging https://gerrit.wikimedia.org/r/720783 and running 'homer "cr*" merge' on cumin2002.

Mon, Sep 13, 4:09 PM · SRE, Patch-For-Review, Infrastructure-Foundations, Mail
MoritzMuehlenhoff created P17267 (An Untitled Masterwork).
Mon, Sep 13, 3:27 PM
MoritzMuehlenhoff added a comment to T210704: Migrate node-based services in production to node10.

Not sure why restbase is ticked off, though? The restbase hosts in production currently run nodejs 6.11 still.

Mon, Sep 13, 2:12 PM · Patch-For-Review, Platform Team Initiatives (Containerise Services), serviceops, SRE
MoritzMuehlenhoff added a comment to T289779: Create a new ldap group for sre users without root access.

@MoritzMuehlenhoff i created the new sre-admins ldap group manually as i couldn't see a puppet way. pinging incase i missed something.

Mon, Sep 13, 7:12 AM · User-jbond, LDAP, Infrastructure-Foundations, Security

Fri, Sep 10

MoritzMuehlenhoff added a comment to T277739: rsyslog-kubernetes missing in buster-wikimedia.

Bullseye is out and there is not rsyslog-kubernetes in it, maybe we could start working with upstream to have it in unstable first and possibly in backports?

Fri, Sep 10, 1:26 PM · SRE, observability
MoritzMuehlenhoff added a comment to T283165: OpenSSL < 1.1.0 compatibility issues with new LE issuance chain.

As mentioned on the issue description, debian backported the fix for OpenSSL as it can be seen on a current debian jessie container:

root@69310d82543d:~# cat /etc/debian_version 
8.11
root@69310d82543d:~# openssl version
OpenSSL 1.0.1t  3 May 2016
root@69310d82543d:~# openssl verify -CAfile rsa-2048.chain.crt rsa-2048.crt 
rsa-2048.crt: OK
root@69310d82543d:~# openssl x509 -dates -noout -in rsa-2048.crt 
notBefore=May 10 13:15:07 2021 GMT
notAfter=Aug  8 13:15:07 2021 GMT
Fri, Sep 10, 11:36 AM · Patch-For-Review, Infrastructure-Foundations, SRE, Traffic
MoritzMuehlenhoff added projects to T283165: OpenSSL < 1.1.0 compatibility issues with new LE issuance chain: SRE, Infrastructure-Foundations.
Fri, Sep 10, 11:33 AM · Patch-For-Review, Infrastructure-Foundations, SRE, Traffic
MoritzMuehlenhoff added a comment to T276589: migrate services from cumin2001 to cumin2002.

If it's not too much trouble, it would be nice if cumin2001 could have a MOTD pointing you to cumin2002. If you accidentally log into cumin2001 you'll end up trying to run cookbooks that haven't been updated since May :/

Fri, Sep 10, 9:43 AM · Patch-For-Review, SRE
MoritzMuehlenhoff created T290715: Check home/HDFS leftovers of jmads.
Fri, Sep 10, 7:45 AM · Analytics-Kanban, Analytics

Thu, Sep 9

MoritzMuehlenhoff added a comment to T286905: Add logout.d script for Gerrit.

Adding this functionality goes a little beyond the scope of the logout.d scripts I think. Right now running these scripts is fully idempotent and every logout action really only log outs, while this would actually modify account state.

Thu, Sep 9, 2:55 PM · Release-Engineering-Team (Radar), Patch-For-Review, Gerrit, Infrastructure-Foundations, User-jbond, CAS-SSO, SRE
MoritzMuehlenhoff closed T287566: Add logout.d script for Wikitech as Resolved.

@Majavah, I've merged your patch and confirmed that it works fine :-) Thanks!

Thu, Sep 9, 2:27 PM · cloud-services-team (Kanban), wikitech.wikimedia.org, Infrastructure-Foundations, SRE
MoritzMuehlenhoff closed T287566: Add logout.d script for Wikitech, a subtask of T283242: Cookbook for centralised logouts and session status queries , as Resolved.
Thu, Sep 9, 2:27 PM · Infrastructure-Foundations, User-jbond, Patch-For-Review, CAS-SSO, SRE

Mon, Sep 6

MoritzMuehlenhoff added a comment to T289802: GitLab major version upgrade: 14.x.

I've uploaded 14.0.10, we can bump the import hook after the initial update is complete.

Mon, Sep 6, 1:53 PM · Release-Engineering-Team (Yak Shaving 🐃🪒), serviceops, User-brennen, GitLab
MoritzMuehlenhoff updated the task description for T284811: Upgrade eqiad/codfw Ganeti clusters to Buster.
Mon, Sep 6, 11:40 AM · Infrastructure-Foundations, SRE
MoritzMuehlenhoff closed T290410: Absent bteshome ldap account (including ldap/wmde removal) as Resolved.

Thanks, access for the nda and wmde groups has been removed.

Mon, Sep 6, 9:54 AM · SRE, LDAP-Access-Requests
MoritzMuehlenhoff closed T290411: Absent ataherivand ldap account (including ldap/wmde removal) as Resolved.

Thanks, access to the wmde group has been removed.

Mon, Sep 6, 9:44 AM · SRE, LDAP-Access-Requests
MoritzMuehlenhoff closed T290412: Absent johlig ldap account (including ldap/wmde removal) as Resolved.

Thanks, access to the wmde group has been removed.

Mon, Sep 6, 9:41 AM · SRE, LDAP-Access-Requests
MoritzMuehlenhoff closed T290413: Absent jkroll ldap account (including ldap/wmde removal) as Resolved.

Thanks, access to the wmde and nda groups has been removed.

Mon, Sep 6, 9:36 AM · SRE, LDAP-Access-Requests

Thu, Sep 2

MoritzMuehlenhoff added a comment to T289135: Upgrade Cirrus Elasticsearch clusters to Debian Bullseye.

Two things here:

Thu, Sep 2, 1:13 PM · Discovery-Search, SRE
MoritzMuehlenhoff added a comment to T271736: Migrate WMF Production from PHP 7.2 to PHP 7.4.

I'm getting tired of putting up with dinosaur versions of PHP in production. PHP is our core business.
7.4.0 was released in November 2019, so presumably will stop receiving bug fixes in less than 3 months. 8.0.x is the latest stable and has lots of nice features, like the JIT. That seems like a reasonable target at this point.

Thu, Sep 2, 12:40 PM · serviceops
MoritzMuehlenhoff created T290242: mw2264 went down.
Thu, Sep 2, 11:24 AM · SRE, serviceops, ops-codfw
MoritzMuehlenhoff created T290232: Check home/HDFS leftovers of gilles.
Thu, Sep 2, 9:01 AM · Performance-Team, Analytics
MoritzMuehlenhoff created T290231: Check home/HDFS leftovers of fdans.
Thu, Sep 2, 8:38 AM · Analytics-Kanban, Analytics

Aug 12 2021

MoritzMuehlenhoff closed T287960: Import the openjdk8 packages in Bullseye as Resolved.

OpenJDK 8u302 has been rebuilt against the bootstrap packages (which were removed) and eventually imported. Resolving this, please open if you run into any issues.

Aug 12 2021, 3:49 PM · Infrastructure-Foundations, Analytics, SRE
MoritzMuehlenhoff closed T287960: Import the openjdk8 packages in Bullseye, a subtask of T275873: Prepare our base system layer for Debian 11/bullseye, as Resolved.
Aug 12 2021, 3:49 PM · Patch-For-Review, SRE
MoritzMuehlenhoff added a comment to T288630: Please create a Ganeti VM for Wikidough in ulsfo.

Please don't create new instances with 10G "disks", these tend to cause more work in the long term, e.g. by filing up the root partition with kernels etc. 15G or 20G is a good minimum.

Aug 12 2021, 3:38 PM · SRE-tools, Infrastructure-Foundations, SRE, Traffic, vm-requests
MoritzMuehlenhoff updated the task description for T288028: Remove the "Long running screen/tmux" Icinga check.
Aug 12 2021, 9:29 AM · Observability-Alerting, Patch-For-Review, SRE
MoritzMuehlenhoff added a comment to T275873: Prepare our base system layer for Debian 11/bullseye.

This is tracked by upstream at https://github.com/prometheus/node_exporter/issues/1892 and their solution is to also mask the RAPL collector (https://github.com/wagdav/homelab/commit/26fc86c6a79a5f1a634c7b313f86c0b6109539c0), so I think we can simply do that fleet-wide (including older distros)? I don't think we're currently using that data in any way.

+1 on disabling the collector (on >= bullseye, since it's been introduced in node-exporter 1.0.0)

Given that this also applies to Bullseye and there's still time to land the fix in testing, I filed https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=991160.

Aug 12 2021, 6:52 AM · Patch-For-Review, SRE

Aug 11 2021

MoritzMuehlenhoff added a project to T287960: Import the openjdk8 packages in Bullseye: Infrastructure-Foundations.
Aug 11 2021, 10:03 AM · Infrastructure-Foundations, Analytics, SRE
MoritzMuehlenhoff added a comment to T287960: Import the openjdk8 packages in Bullseye.

OpenJDK 8 needs OpenJDK 8 to build itself, I'm currently building an initial package on my laptop to bootstrap this (and import it to component/jdk8), which will then be used to build openjdk on deneb (and which will replace the interim package). The component should not be used until the final package is built/imported.

Aug 11 2021, 10:03 AM · Infrastructure-Foundations, Analytics, SRE
fgiunchedi awarded T190693: Extend dpkg Icinga check to also check for inconsistent apt state a Like token.
Aug 11 2021, 9:34 AM · Observability-Alerting, Icinga, observability, SRE

Aug 10 2021

MoritzMuehlenhoff added a comment to T262446: Import row information into Netbox for Ganeti instances.

@MoritzMuehlenhoff @Volans :
Instead of adding a custom field and machinery to keep it up to date, what do you think of reorganizing the existing data:
At least on the network level, it seems more appropriate to call a Netbox cluster what Ganeti calls a group.
For example create an "eqiad row A" cluster, that consists of the row A hypervisors, as they share the same failure domain.
Once done across all rows, create a "cluster group" that regroups all the eqiad clusters, behind ganeti01.svc.eqiad.wmnet

Aug 10 2021, 1:15 PM · Infrastructure-Foundations, netbox
MoritzMuehlenhoff added a comment to T286911: Upgrade MXes to Bullseye.

To prevent this test server from accidentally messing with our existing production mail infrastructure, I'd like to also filter port 25 for mx2002.wikimedia.org on the router level. @cmooney or @ayounsi, is that something you could set up?

Done, let us know when to rollback.

Aug 10 2021, 11:02 AM · SRE, Patch-For-Review, Infrastructure-Foundations, Mail

Aug 4 2021

MoritzMuehlenhoff added a comment to T286206: Create Ganeti test cluster.

The Ganeti test cluster has been set up, along with two test instances (testvm2001/2002). Next it will be used to test the Buster update.

Aug 4 2021, 3:55 PM · Patch-For-Review, Infrastructure-Foundations, SRE
MoritzMuehlenhoff updated subscribers of T286911: Upgrade MXes to Bullseye.

Ok, so let's proceed with option two. There's a test instance mx2002.wikimedia.org which I'll setup with the mx role and bullseye next week.

Aug 4 2021, 3:02 PM · SRE, Patch-For-Review, Infrastructure-Foundations, Mail
MoritzMuehlenhoff triaged T288036: Switch buffer re-partition - cloudsw1-c8-eqiad as Medium priority.
Aug 4 2021, 2:22 PM · cloud-services-team (Kanban), SRE, Infrastructure-Foundations, netops
MoritzMuehlenhoff triaged T288037: Switch buffer re-partition - cloudsw1-d5-eqiad as Medium priority.
Aug 4 2021, 2:22 PM · cloud-services-team (Kanban), SRE, netops, Infrastructure-Foundations
MoritzMuehlenhoff added a comment to T288028: Remove the "Long running screen/tmux" Icinga check.

I dug deeper and the causes seems to be a 2017 incident mentioned on the meeting notes as:

screen "api-hhvm-restarts" on neodymium restarted a bunch of api servers on Fri (screen from 2016, now stopped)

Aug 4 2021, 9:18 AM · Observability-Alerting, Patch-For-Review, SRE
MoritzMuehlenhoff triaged T288028: Remove the "Long running screen/tmux" Icinga check as Medium priority.
Aug 4 2021, 8:44 AM · Observability-Alerting, Patch-For-Review, SRE
MoritzMuehlenhoff added a comment to T288024: releases1002 /srv/docker DISK SPACE alert.

If there's no immediate fix on the Jenkins side we should add a systemd timer to trigger a cleanup before this escalates to alerts

Aug 4 2021, 7:42 AM · SRE, Release-Engineering-Team
MoritzMuehlenhoff closed T286776: LDAP Access Request for WMDE Employee - Elena Aleynikova as Resolved.

@elal : I've added you to the cn=nda and cn=wmde LDAP groups. You should now be able to access Superset. If you run into any issues, please reopen the task.

Aug 4 2021, 6:17 AM · SRE, LDAP-Access-Requests
MoritzMuehlenhoff added a comment to T286776: LDAP Access Request for WMDE Employee - Elena Aleynikova.

@RLazarus I am confirming the NDA has been signed. Please proceed with the access request. Thanks!

Aug 4 2021, 6:15 AM · SRE, LDAP-Access-Requests
MoritzMuehlenhoff created T288028: Remove the "Long running screen/tmux" Icinga check.
Aug 4 2021, 5:49 AM · Observability-Alerting, Patch-For-Review, SRE
MoritzMuehlenhoff assigned T288024: releases1002 /srv/docker DISK SPACE alert to hashar.

Antoine, could you please have a look whether we can free something?

Aug 4 2021, 5:39 AM · SRE, Release-Engineering-Team
MoritzMuehlenhoff triaged T287983: Raw "upstream connect error or disconnect/reset before headers. reset reason: overflow" error message shown to users during outage as Low priority.
Aug 4 2021, 5:38 AM · serviceops, envoy, Sustainability (Incident Followup), SRE, Traffic
MoritzMuehlenhoff claimed T287960: Import the openjdk8 packages in Bullseye.

Sure thing, I'll take care of this next week.

Aug 4 2021, 5:36 AM · Infrastructure-Foundations, Analytics, SRE

Aug 3 2021

MoritzMuehlenhoff added a comment to T287954: Add database host removal from Orchestrator to sre.hosts.decommission cookbook.

How about we create a mechanism similar to the logout.d scripts, but for decom? Let's say we create a new /etc/wikimedia/decom.d directory where each service can (in this case it would be installed on each DB host managed in Orchestrator) drop a decom script with steps which ought to be taken when a host running this service gets decommed. These files would get executed by decom cookbook locally (and would trigger the de-registration on orch1001). This way we can flexibly extend custom decom workflows like this without tieing this a change in the decom cookbook (also also keep it more lean).

Aug 3 2021, 2:19 PM · Infrastructure-Foundations, SRE-tools, Orchestrator
MoritzMuehlenhoff added a comment to T275696: reclaim cescout1001.eqiad.wmnet.

Noticed this during clinic duty: @ssingh If the decom cookbook ran on the host, you can can tick off the relevant parts under "Steps for service owner" and reassign to John Clark.

Aug 3 2021, 11:41 AM · DC-Ops, ops-eqiad, SRE, Traffic, decommission-hardware
MoritzMuehlenhoff added a comment to T222113: prometheus: upgrade to >= 2.12.

Since the codfw/eqiad Prometheus hosts are going to be replaced with new HW in Q2, I'm going to force-install prometheus on the stretch hosts for now. It means the UI won't be necessarily functional in the meantime (i.e. when accessed via ssh tunnels), however nowadays the UI at https://thanos.wikimedia.org should be used instead for queries.

Aug 3 2021, 8:54 AM · Observability-Metrics, media-backups, Data-Persistence-Backup, User-fgiunchedi, Sustainability (Incident Followup), SRE

Aug 2 2021

MoritzMuehlenhoff added a comment to T287852: Failover m2 master (db1107) to a different host to upgrade its kernel.

@MoritzMuehlenhoff @dpifke @Krinkle @bd808 @hnowlan @kostajh I am planning to failover this host on Thursday at 08:00 AM UTC.
This means there will be a few seconds of read only time (I expect between 5 and 10 seconds) if all goes fine.

Please let me know if there's something that requires action and Thursday isn't a doable date.
Thanks!

Aug 2 2021, 1:25 PM · DBA, Znuny, SRE, Recommendation-API, Infrastructure-Foundations, SRE-tools
MoritzMuehlenhoff renamed T287222: Clean up old Docker images on deneb from deneb.codfw.wmnet root partition is full to Clean up old Docker images on deneb.
Aug 2 2021, 7:02 AM · serviceops, SRE
MoritzMuehlenhoff triaged T287838: Degraded RAID on cloudcephosd1008 as Medium priority.
Aug 2 2021, 7:01 AM · cloud-services-team (Kanban), SRE, ops-eqiad
MoritzMuehlenhoff triaged T287792: WARNING: opcache cache-hit ratio is below 99.99% on multiple eqiad appservers and parsoid servers as Medium priority.
Aug 2 2021, 7:00 AM · serviceops, Performance-Team, SRE
MoritzMuehlenhoff triaged T287546: CommRel support for September 2021 Switchover as Medium priority.
Aug 2 2021, 7:00 AM · CommRel-Specialists-Support (Jul-Sep-2021), Datacenter-Switchover, SRE
MoritzMuehlenhoff created T287845: Check home/HDFS leftovers of gsingers.
Aug 2 2021, 5:26 AM · Analytics

Jul 30 2021

MoritzMuehlenhoff claimed T287763: Add logout.d script for Kerberos.
Jul 30 2021, 2:57 PM · Infrastructure-Foundations, User-jbond, CAS-SSO, SRE
MoritzMuehlenhoff created T287763: Add logout.d script for Kerberos.
Jul 30 2021, 2:56 PM · Infrastructure-Foundations, User-jbond, CAS-SSO, SRE
MoritzMuehlenhoff renamed T287753: Block FUSE (kernel module/package) on hosts which don't need it from Blacklist FUSE to Block FUSE (kernel module/package) on hosts which don't need it.
Jul 30 2021, 12:29 PM · Infrastructure-Foundations, SRE
MoritzMuehlenhoff created T287753: Block FUSE (kernel module/package) on hosts which don't need it.
Jul 30 2021, 12:28 PM · Infrastructure-Foundations, SRE