Page MenuHomePhabricator

MoritzMuehlenhoff (Moritz Mühlenhoff)
User

Projects

Today

  • Clear sailing ahead.

Tomorrow

  • Clear sailing ahead.

Wednesday

  • Clear sailing ahead.

User Details

User Since
Apr 1 2015, 4:33 PM (236 w, 5 d)
Availability
Available
LDAP User
Moritz Mühlenhoff
MediaWiki User
MMuhlenhoff (WMF) [ Global Accounts ]

Recent Activity

Today

MoritzMuehlenhoff added a comment to T235405: Build cergen for buster.

networkx has some breaking API changes between 1.x and 2.x which are non-trivial to resolve. To unbreak the use of cergen on buster the build has been adapted to use a forward-ported 1.11 package on a separate component for buster-wikimedia (component/cergen, which now also includes cergen itself).

Mon, Oct 14, 5:01 PM · Patch-For-Review, Operations
MoritzMuehlenhoff moved T233289: Decommission ms-be1027 from Backlog to pending onsite steps (eqiad) on the decommission board.
Mon, Oct 14, 3:16 PM · decommission, Operations, ops-eqiad
MoritzMuehlenhoff moved T233080: Decommission analytics1032 from Backlog to pending onsite steps (eqiad) on the decommission board.
Mon, Oct 14, 3:15 PM · decommission, Operations
MoritzMuehlenhoff updated subscribers of T235427: Serve volatile uri from local site.

Adding @BBlack, @ema, @Vgutierrez for explicit input/signoff wrt the GeoIP sub directory.

Mon, Oct 14, 3:02 PM · Patch-For-Review, Operations, Puppet
MoritzMuehlenhoff added a comment to T235250: No microcode updates loaded on puppetmaster2001/2002 after reimage to Buster.

Thanks, could we try upgrading the BIOS/firmware initially on 2002? Maybe tomorrow (I'd prepare the server so that it can be taken down without impact)?

Mon, Oct 14, 1:32 PM · ops-codfw, Operations
MoritzMuehlenhoff moved T224475: Return sulfur to spares from Ready for Decommission to pending onsite steps (eqiad) on the decommission board.
Mon, Oct 14, 8:21 AM · decommission, ops-eqiad, Operations
MoritzMuehlenhoff reassigned T224475: Return sulfur to spares from RobH to Cmjohnson.
Mon, Oct 14, 8:21 AM · decommission, ops-eqiad, Operations
MoritzMuehlenhoff created T235406: maps1002: Failed power supply.
Mon, Oct 14, 7:50 AM · Discovery-Search (Current work), Operations, ops-eqiad
MoritzMuehlenhoff created T235405: Build cergen for buster.
Mon, Oct 14, 7:42 AM · Patch-For-Review, Operations

Fri, Oct 11

MoritzMuehlenhoff updated the task description for T232308: Integrate Stretch 9.10/9.11 point updates.
Fri, Oct 11, 12:57 PM · Operations
MoritzMuehlenhoff added a comment to T232707: Requesting access to analytics cluster for Martin Gerlach.

Try running

Fri, Oct 11, 11:20 AM · Analytics, Operations, SRE-Access-Requests
MoritzMuehlenhoff added a comment to T214364: CDH Jessie dependencies not available on Stretch.

Can we narrow down which component needs libssl1.0.0? One of the many outdated/bundled ones?

Fri, Oct 11, 10:33 AM · Analytics, User-Elukey
MoritzMuehlenhoff updated the task description for T224549: Track remaining jessie systems in production.
Fri, Oct 11, 9:24 AM · Operations
MoritzMuehlenhoff triaged T235250: No microcode updates loaded on puppetmaster2001/2002 after reimage to Buster as Normal priority.
Fri, Oct 11, 7:35 AM · ops-codfw, Operations
MoritzMuehlenhoff created T235250: No microcode updates loaded on puppetmaster2001/2002 after reimage to Buster.
Fri, Oct 11, 7:34 AM · ops-codfw, Operations

Thu, Oct 10

MoritzMuehlenhoff added a comment to T226089: Make the Kerberos infrastructure production ready.

One other thing (not necessarily now) is to add a monitoring check, e.g. https://exchange.nagios.org/directory/Plugins/Security/check_krb5

My idea was to add initially only a nagios process count check, and then think about something like check_krb5. Would it be reasonable?

Makes sense, let's split this to a separate task.

And we should also have an Icinga check to ensure the replica is up-to-date.

In theory this should be ensured by the replication script ending up in a zero return no?

Thu, Oct 10, 1:53 PM · Operations, Analytics-Kanban, User-Elukey, Analytics
MoritzMuehlenhoff added a comment to T226089: Make the Kerberos infrastructure production ready.

One other thing (not necessarily now) is to add a monitoring check, e.g. https://exchange.nagios.org/directory/Plugins/Security/check_krb5

My idea was to add initially only a nagios process count check, and then think about something like check_krb5. Would it be reasonable?

Makes sense, let's split this to a separate task.

Thu, Oct 10, 1:03 PM · Operations, Analytics-Kanban, User-Elukey, Analytics
MoritzMuehlenhoff created T235163: Investigate GID allocation for system users.
Thu, Oct 10, 10:14 AM · Operations
MoritzMuehlenhoff removed a project from T235161: Improve management of users/groups on servers in production: LDAP.
Thu, Oct 10, 10:12 AM · Operations
MoritzMuehlenhoff created T235162: Restrict GIDs for system users to 499 as the upper boundary.
Thu, Oct 10, 10:11 AM · Patch-For-Review, Operations
MoritzMuehlenhoff created T235161: Improve management of users/groups on servers in production.
Thu, Oct 10, 10:07 AM · Operations
MoritzMuehlenhoff added a comment to T235067: reimage of puppet servers can fail.

I looked into this and it's quite a mess!

Thu, Oct 10, 10:03 AM · Puppet
mmodell awarded T235140: package wikimedia-lvs-realserver for buster a Like token.
Thu, Oct 10, 7:19 AM · Release-Engineering-Team (Development services), Release-Engineering-Team-TODO, serviceops, Phabricator, Operations
MoritzMuehlenhoff closed T235140: package wikimedia-lvs-realserver for buster, a subtask of T190568: Reimage both phab1001 and phab2001 to stretch / buster, as Resolved.
Thu, Oct 10, 7:18 AM · Release-Engineering-Team (Development services), Release-Engineering-Team-TODO, serviceops, Phabricator, Operations
MoritzMuehlenhoff closed T235140: package wikimedia-lvs-realserver for buster as Resolved.

The package was already previously copied between jessie and stretch, has no runtime dependencies which need to be updated and the if-up.d/if-down.d interfaces are unchanged between stretch/buster, so I simply copied over the deb from stretch-wikimedia to buster-wikimedia.

Thu, Oct 10, 7:18 AM · Release-Engineering-Team (Development services), Release-Engineering-Team-TODO, serviceops, Phabricator, Operations

Wed, Oct 9

MoritzMuehlenhoff added a comment to T226089: Make the Kerberos infrastructure production ready.

One other thing (not necessarily now) is to add a monitoring check, e.g. https://exchange.nagios.org/directory/Plugins/Security/check_krb5

My idea was to add initially only a nagios process count check, and then think about something like check_krb5. Would it be reasonable?

Wed, Oct 9, 2:08 PM · Operations, Analytics-Kanban, User-Elukey, Analytics
MoritzMuehlenhoff added a comment to T226089: Make the Kerberos infrastructure production ready.

One other thing (not necessarily now) is to add a monitoring check, e.g. https://exchange.nagios.org/directory/Plugins/Security/check_krb5

Wed, Oct 9, 11:37 AM · Operations, Analytics-Kanban, User-Elukey, Analytics
MoritzMuehlenhoff added a project to T226089: Make the Kerberos infrastructure production ready: Operations.
Wed, Oct 9, 10:29 AM · Operations, Analytics-Kanban, User-Elukey, Analytics
MoritzMuehlenhoff added a comment to T226089: Make the Kerberos infrastructure production ready.

Another thing we need to do: Add a new flag to data.yaml to annotate that a user is kerberos-enabled (as we need to ensure to also drop Kerberos user principals when offboarding users).

Wed, Oct 9, 10:26 AM · Operations, Analytics-Kanban, User-Elukey, Analytics
MoritzMuehlenhoff added a comment to T234462: reclaim/decom/whatever labpuppetmaster1001 and 1002.

There are still Puppet references towards labpuppetmaster* in Puppet (e.g. hieradata/eqiad/profile/openstack/eqiad1/puppetmaster.yaml, please remove these fully before handing over for reclaim.

Wed, Oct 9, 9:21 AM · decommission, DC-Ops
MoritzMuehlenhoff reassigned T234045: decommission elastic1017 from RobH to Cmjohnson.
Wed, Oct 9, 9:18 AM · Operations, DC-Ops, decommission
MoritzMuehlenhoff moved T234045: decommission elastic1017 from Backlog to pending onsite steps (eqiad) on the decommission board.
Wed, Oct 9, 9:18 AM · Operations, DC-Ops, decommission
MoritzMuehlenhoff moved T234909: decommission auth1001 from Backlog to pending onsite steps (eqiad) on the decommission board.
Wed, Oct 9, 9:18 AM · ops-eqiad, Operations, DC-Ops, decommission
MoritzMuehlenhoff reassigned T234909: decommission auth1001 from MoritzMuehlenhoff to Cmjohnson.
Wed, Oct 9, 9:16 AM · ops-eqiad, Operations, DC-Ops, decommission
MoritzMuehlenhoff updated the task description for T224549: Track remaining jessie systems in production.
Wed, Oct 9, 9:09 AM · Operations
MoritzMuehlenhoff updated the task description for T234909: decommission auth1001.
Wed, Oct 9, 9:01 AM · ops-eqiad, Operations, DC-Ops, decommission
MoritzMuehlenhoff updated the task description for T234909: decommission auth1001.
Wed, Oct 9, 9:00 AM · ops-eqiad, Operations, DC-Ops, decommission

Tue, Oct 8

MoritzMuehlenhoff updated the task description for T234909: decommission auth1001.
Tue, Oct 8, 12:37 PM · ops-eqiad, Operations, DC-Ops, decommission
MoritzMuehlenhoff added a project to T234909: decommission auth1001: Operations.
Tue, Oct 8, 12:18 PM · ops-eqiad, Operations, DC-Ops, decommission
MoritzMuehlenhoff claimed T234909: decommission auth1001.
Tue, Oct 8, 12:17 PM · ops-eqiad, Operations, DC-Ops, decommission
MoritzMuehlenhoff created T234909: decommission auth1001.
Tue, Oct 8, 12:16 PM · ops-eqiad, Operations, DC-Ops, decommission
MoritzMuehlenhoff closed T233821: Move YHSM from auth1001 to auth1002 as Resolved.

Confirmed, thanks.

Tue, Oct 8, 12:14 PM · ops-eqiad, Operations
MoritzMuehlenhoff reopened T151304: tmpreaper possible race condition, a subtask of T132324: Tracking and Reducing cron-spam to root@ , as Open.
Tue, Oct 8, 11:54 AM · Patch-For-Review, Operations
MoritzMuehlenhoff reopened T151304: tmpreaper possible race condition as "Open".

See earlier discussion on task, this is still used by Toolforge, so WMCS SREs might still want to tweak the log spam.

Tue, Oct 8, 11:54 AM · serviceops, Operations
MoritzMuehlenhoff added a comment to T233821: Move YHSM from auth1001 to auth1002.

I see in dmesg that it got removed from auth1001, but I don't see it in the logs for auth1002, is the USB slot in question maybe inactive? Could you try moving it to a different slot?

Tue, Oct 8, 9:05 AM · ops-eqiad, Operations

Mon, Oct 7

MoritzMuehlenhoff added a comment to T234683: Build, package bdsync for Buster.

bdsync was never packaged in Debian, it's an internally packaged tool (originally done by Chase), from a quick glance rebuilding it for buster should be straightforward.

Mon, Oct 7, 8:58 AM · Cloud-Services, ops-codfw, Operations
MoritzMuehlenhoff added a comment to T224585: Migrate labmon* to Stretch (or Buster, better yet!).

@Phamhi:

  1. Grafana is installed from an external repository. There's already a config to pull in the new deb package for our Buster repository (Chris is working on setting up a Grafana 6 instance on Buster), but it hasn't been imported to our apt.wikimedia.org repository yet. Best to sync up with him on that, as I'm not 100% sure whether it's best to import grafana 5 or 6 initially.
Mon, Oct 7, 7:55 AM · cloud-services-team (Kanban), Operations

Wed, Oct 2

MoritzMuehlenhoff added a comment to T232677: Remove support for Debian Jessie in Cloud Services prior to upstream End Of Life for release.

JFTR; the task description is inaccurate, supports ends five years after the jessie release, i.e. April 25 (but the internal hard deadline is end of Q3 as we still need time to properly wrap things up in puppet/repos etc.)

Wed, Oct 2, 9:42 PM · cloud-services-team (Kanban), Cloud-VPS, Epic
MoritzMuehlenhoff updated the task description for T234045: decommission elastic1017.
Wed, Oct 2, 3:14 PM · Operations, DC-Ops, decommission
MoritzMuehlenhoff closed T232310: Integrate Buster 10.1 point update as Resolved.

All done

Wed, Oct 2, 2:17 PM · Operations
MoritzMuehlenhoff updated the task description for T232310: Integrate Buster 10.1 point update.
Wed, Oct 2, 2:17 PM · Operations
MoritzMuehlenhoff updated the task description for T232310: Integrate Buster 10.1 point update.
Wed, Oct 2, 1:56 PM · Operations
MoritzMuehlenhoff updated the task description for T234045: decommission elastic1017.
Wed, Oct 2, 1:20 PM · Operations, DC-Ops, decommission
MoritzMuehlenhoff closed T153468: Ferm's upstream Net::DNS Perl library questionable handling of NOERROR responses without records causing puppet errors when we try to @resolve AAAA in labs as Resolved.

This is fixed for Buster and Stretch, the remaining ~ 100 Jessie hosts won't get fixed, they'll vanish in the next six months anyway.

Wed, Oct 2, 8:50 AM · observability, Beta-Cluster-Infrastructure, Patch-For-Review, Upstream, Operations, Beta-Cluster-reproducible, Traffic, DNS
MoritzMuehlenhoff closed T153468: Ferm's upstream Net::DNS Perl library questionable handling of NOERROR responses without records causing puppet errors when we try to @resolve AAAA in labs, a subtask of T132259: Deployment-prep hosts with puppet errors (tracking), as Resolved.
Wed, Oct 2, 8:50 AM · Puppet, Tracking-Neverending, Beta-Cluster-Infrastructure

Tue, Oct 1

MoritzMuehlenhoff closed T230024: Update component/php72 to 7.2.22, a subtask of T220600: Remove PHP 7.0 from production application servers, as Resolved.
Tue, Oct 1, 8:17 AM · serviceops, Operations
MoritzMuehlenhoff closed T230024: Update component/php72 to 7.2.22 as Resolved.

7.2.22 is rolled out fleet-wide to all servers using PHP 7.2

Tue, Oct 1, 8:17 AM · serviceops, Operations
MoritzMuehlenhoff reopened T233636: Banner History and page view data access for fundraising analysts - Jerrie and Erin as "Open".

There's two issues with the patch merged for Erin Yener: (1) If contractors have a @wikimedia.org address, they should be added to cn=wmf, not cn=nda. (2) Contractors need an entry in data.yaml with the contract end and a person of contact (expiry_date, expiry_contact fields). Otherwise we'll miss dropping their credentials when the contract expires (we ping the point of contact one week before the contract expires and will extend access if the contract is contuining)

Tue, Oct 1, 7:23 AM · Analytics, Operations, SRE-Access-Requests, Fundraising-Backlog

Mon, Sep 30

MoritzMuehlenhoff added a comment to T217114: Migrate Proton to nodejs 10.

But nodejs on the proton* hosts is still on nodejs 6?

Mon, Sep 30, 4:01 PM · Product-Infrastructure-Team-Backlog, Proton
MoritzMuehlenhoff updated the task description for T232310: Integrate Buster 10.1 point update.
Mon, Sep 30, 1:06 PM · Operations
MoritzMuehlenhoff added a comment to T230024: Update component/php72 to 7.2.22.

@Dzahn: The current update for PHP 7.2.22 is a little special as there was an upstream change in the shipped default conffile (a new option for sqlite was added). You can use the following via Cumin:

Mon, Sep 30, 9:45 AM · serviceops, Operations
MoritzMuehlenhoff updated the task description for T232343: Consider Postfix as MTA for our MXes (and OTRS/Mailman/Phab).
Mon, Sep 30, 8:05 AM · Mail, Operations

Fri, Sep 27

MoritzMuehlenhoff updated the task description for T232310: Integrate Buster 10.1 point update.
Fri, Sep 27, 3:26 PM · Operations
MoritzMuehlenhoff updated the task description for T232310: Integrate Buster 10.1 point update.
Fri, Sep 27, 2:42 PM · Operations
MoritzMuehlenhoff created T234047: Extend firewall rules for new corp LDAP replicas.
Fri, Sep 27, 2:26 PM · Operations
MoritzMuehlenhoff claimed T224557: Migrate ldap/corp replicas to Stretch/Buster.
Fri, Sep 27, 2:10 PM · Operations
MoritzMuehlenhoff updated the task description for T234045: decommission elastic1017.
Fri, Sep 27, 2:09 PM · Operations, DC-Ops, decommission
MoritzMuehlenhoff assigned T234045: decommission elastic1017 to Gehel.
Fri, Sep 27, 2:09 PM · Operations, DC-Ops, decommission
MoritzMuehlenhoff created T234045: decommission elastic1017.
Fri, Sep 27, 2:08 PM · Operations, DC-Ops, decommission
MoritzMuehlenhoff updated the task description for T232310: Integrate Buster 10.1 point update.
Fri, Sep 27, 1:19 PM · Operations
MoritzMuehlenhoff updated the task description for T224549: Track remaining jessie systems in production.
Fri, Sep 27, 1:14 PM · Operations
MoritzMuehlenhoff added a comment to T153468: Ferm's upstream Net::DNS Perl library questionable handling of NOERROR responses without records causing puppet errors when we try to @resolve AAAA in labs.

I've built the 2.4.2pre package for stretch-wikimedia and tested it on a few servers successfully (also comparing iptables output to spot any potential regression from the 2.3->2.4 move). I'll upload that on Monday as it needs some coordination with Arturo to syncronise the rollout to Cloud VPS (as the ferm package ships ferm.conf as a conffile, to rule out issues with unattended-upgrades overwriting the puppetised version)

Fri, Sep 27, 12:03 PM · observability, Beta-Cluster-Infrastructure, Patch-For-Review, Upstream, Operations, Beta-Cluster-reproducible, Traffic, DNS
MoritzMuehlenhoff added a comment to T233937: Add U2F/FIDO as second factor for CAS.

John and I have discussed next steps on IRC: Initially we'll make U2F opt-in via a memberOf/LDAP check. At a later step we'll add TOTP support (ideally in a way that allows to import the existing registrations from the wikitech endpoint) and by then we'll need MFA selection either by means of the Groovy script or via the selector support included in 6.1: https://apereo.github.io/2019/05/13/cas61x-mfa-selection-strategies/

Fri, Sep 27, 11:29 AM · Patch-For-Review, Operations
MoritzMuehlenhoff added a comment to T233142: setup/install krb2001/WMF6577.

Very strange, the debian install works in setting up raids and lvm volumes, but fail when installing grub. I noticed that the host has multiple huge disks (4TB each), and they all get a gpt partition (checked via d-i shell's parted_devices). Could it be that grub is not working with these huge disk partitions/sizes?

Fri, Sep 27, 11:06 AM · Operations, User-Elukey, Analytics
MoritzMuehlenhoff added a comment to T233950: Revisit Tomcat deployment of CAS.

Ack, what I meant was using the Tomcat packages as shipped in Debian

Fri, Sep 27, 10:51 AM · Operations
MoritzMuehlenhoff moved T212934: etcd1004-1006 is unused and idle, use the cluster or kill it. from Ready for Decommission to Blocked on Service Owners on the decommission board.
Fri, Sep 27, 10:44 AM · decommission, Prod-Kubernetes, Kubernetes, serviceops
MoritzMuehlenhoff moved T208585: Decommission esams cache_misc hosts from Ready for Decommission to Blocked on Service Owners on the decommission board.
Fri, Sep 27, 10:44 AM · ops-esams, decommission, Operations, Traffic
MoritzMuehlenhoff moved T227485: Decommission analytics10[28-31,33-41] from Ready for Decommission to Blocked on Service Owners on the decommission board.
Fri, Sep 27, 10:44 AM · decommission, Operations
MoritzMuehlenhoff moved T208585: Decommission esams cache_misc hosts from Blocked on Service Owners to Ready for Decommission on the decommission board.
Fri, Sep 27, 10:43 AM · ops-esams, decommission, Operations, Traffic
MoritzMuehlenhoff moved T199321: Return graphite200[12] to spares pool from Blocked on Service Owners to Ready for Decommission on the decommission board.
Fri, Sep 27, 10:43 AM · decommission, User-fgiunchedi, Operations
MoritzMuehlenhoff moved T227485: Decommission analytics10[28-31,33-41] from Blocked on Service Owners to Ready for Decommission on the decommission board.
Fri, Sep 27, 10:43 AM · decommission, Operations
MoritzMuehlenhoff moved T212934: etcd1004-1006 is unused and idle, use the cluster or kill it. from Blocked on Service Owners to Ready for Decommission on the decommission board.
Fri, Sep 27, 10:43 AM · decommission, Prod-Kubernetes, Kubernetes, serviceops
MoritzMuehlenhoff reassigned T199321: Return graphite200[12] to spares pool from MoritzMuehlenhoff to RobH.

These are now ready to be wiped/reclaimed as spares.

Fri, Sep 27, 10:42 AM · decommission, User-fgiunchedi, Operations
MoritzMuehlenhoff updated the task description for T199321: Return graphite200[12] to spares pool.
Fri, Sep 27, 10:41 AM · decommission, User-fgiunchedi, Operations
MoritzMuehlenhoff closed T220362: Evaluate SSO solutions as Resolved.

This was done a while ago, we've settled on Apereo CAS.

Fri, Sep 27, 8:45 AM · Operations

Thu, Sep 26

MoritzMuehlenhoff added a comment to T233906: Broken network connection on ganeti2001 after reboot.

Maybe the NIC on the server broke? Are there some self-tests/diagnostics for that on the hardware side?

Thu, Sep 26, 2:47 PM · Operations
MoritzMuehlenhoff created T233951: Systemd hardening of CAS service unit.
Thu, Sep 26, 1:14 PM · Operations
MoritzMuehlenhoff created T233950: Revisit Tomcat deployment of CAS.
Thu, Sep 26, 1:13 PM · Operations
MoritzMuehlenhoff created T233949: Fine-tune CAS logging.
Thu, Sep 26, 1:12 PM · Operations
MoritzMuehlenhoff created T233948: Review ticket policies.
Thu, Sep 26, 1:12 PM · Operations
MoritzMuehlenhoff created T233947: CAS build as a deb.
Thu, Sep 26, 1:11 PM · Operations
MoritzMuehlenhoff created T233946: Validate user lockout.
Thu, Sep 26, 1:10 PM · Operations
MoritzMuehlenhoff created T233945: Banning IPs / subnets from accessing login/validation endpoint.
Thu, Sep 26, 1:10 PM · Operations
MoritzMuehlenhoff created T233944: Log / alert on too many failing logins / Throttling login attempts.
Thu, Sep 26, 1:08 PM · Operations
MoritzMuehlenhoff created T233942: Maintain session history / audit log.
Thu, Sep 26, 1:07 PM · Operations
MoritzMuehlenhoff created T233941: Validate Single Logout Flow.
Thu, Sep 26, 1:06 PM · Operations
MoritzMuehlenhoff created T233940: CLI tools for CAS administration.
Thu, Sep 26, 1:06 PM · Operations
MoritzMuehlenhoff created T233939: Wikimedia theme for SSO login page.
Thu, Sep 26, 1:04 PM · Operations
MoritzMuehlenhoff created T233938: SSO kill switch for crucial services.
Thu, Sep 26, 1:04 PM · Operations
MoritzMuehlenhoff created T233937: Add U2F/FIDO as second factor for CAS.
Thu, Sep 26, 1:03 PM · Patch-For-Review, Operations