May 14 13:25:03 moscovium RT[3653758]: [3653758] RT::Handle=HASH(0x7f0e51f826e8) couldn't execute the query 'SELECT main.* FROM CustomRoles main JOIN ObjectCustomRoles ObjectCustomRo>
                                               DBIx::SearchBuilder::Handle::SimpleQuery(RT::Handle=HASH(0x7f0e51f826e8), "SELECT main.* FROM CustomRoles main JOIN ObjectCustomRoles O>
                                               DBIx::SearchBuilder::_DoSearch(RT::CustomRoles=HASH(0x7f0e801f27d0)) called at /usr/share/request-tracker4/lib/RT/SearchBuilder.pm line>
                                               RT::SearchBuilder::_DoSearch(RT::CustomRoles=HASH(0x7f0e801f27d0)) called at /usr/share/perl5/DBIx/SearchBuilder.pm line 513

Tue, May 14, 8:12 PM · collaboration-services

Dzahn renamed T364911: ProbeDown - moscovium from ProbeDown to ProbeDown - moscovium.

Tue, May 14, 8:11 PM · collaboration-services

Dzahn closed T364897: SystemdUnitFailed - logrotate.service - aphlict1002 as Resolved.

May 14 16:00:02 aphlict1002 systemd[1]: aphlict_logrotate.service: Succeeded.
May 14 16:00:02 aphlict1002 systemd[1]: logrotate.service: Succeeded.
May 14 17:00:02 aphlict1002 systemd[1]: aphlict_logrotate.service: Succeeded.
May 14 17:00:02 aphlict1002 systemd[1]: logrotate.service: Succeeded.
May 14 18:00:02 aphlict1002 systemd[1]: logrotate.service: Main process exited, code=exited, status=3/NOTIMPLEMENTED
May 14 18:00:02 aphlict1002 systemd[1]: logrotate.service: Failed with result 'exit-code'.
May 14 18:00:02 aphlict1002 systemd[1]: aphlict_logrotate.service: Succeeded.
May 14 19:00:02 aphlict1002 systemd[1]: aphlict_logrotate.service: Succeeded.
May 14 19:00:02 aphlict1002 systemd[1]: logrotate.service: Succeeded.
May 14 20:00:02 aphlict1002 systemd[1]: aphlict_logrotate.service: Succeeded.
May 14 20:00:02 aphlict1002 systemd[1]: logrotate.service: Succeeded.

Tue, May 14, 8:11 PM · collaboration-services

Dzahn added a comment to T364897: SystemdUnitFailed - logrotate.service - aphlict1002.

[aphlict1002:~] $ sudo systemctl status logrotate
● logrotate.service - Rotate log files
     Loaded: loaded (/lib/systemd/system/logrotate.service; static)
     Active: inactive (dead) since Tue 2024-05-14 20:00:02 UTC; 9min ago
TriggeredBy: ● logrotate.timer
       Docs: man:logrotate(8)
             man:logrotate.conf(5)
    Process: 1519778 ExecStart=/usr/sbin/logrotate /etc/logrotate.conf (code=exited, status=0/SUCCESS)
   Main PID: 1519778 (code=exited, status=0/SUCCESS)
        CPU: 25ms

Tue, May 14, 8:09 PM · collaboration-services

Dzahn renamed T364897: SystemdUnitFailed - logrotate.service - aphlict1002 from SystemdUnitFailed to SystemdUnitFailed - logrotate.service - aphlict1002.

Tue, May 14, 8:08 PM · collaboration-services

Dzahn added a comment to T350478: Investigate docker-gc.service failures on GitLab runners.

Ah, right. Sorry, mixed up 2 different sets of hosts.

Tue, May 14, 4:58 PM · collaboration-services

Dzahn updated subscribers of T364863: InterfaceSpeedError - mw2286.

@Jhancock.wm cc: @RLazarus I depooled the server and set a downtime of 24 hours.

Tue, May 14, 4:44 PM · serviceops, SRE, ops-codfw

Dzahn renamed T364863: InterfaceSpeedError - mw2286 from InterfaceSpeedError to InterfaceSpeedError - mw2286.

Tue, May 14, 4:41 PM · serviceops, SRE, ops-codfw

Dzahn added a project to T364880: Confusing failed httpbb check for totoro.wikimedia.org during scap deployment: serviceops-radar.

Tue, May 14, 4:39 PM · serviceops-radar, Scap, SRE, Patch-For-Review

Dzahn added a comment to T364880: Confusing failed httpbb check for totoro.wikimedia.org during scap deployment.

Lucas is right. I can confirm the test passes with any wikimedia.org subdomain as long as the path stays /wiki/Main_Page and starts failing as expected once that path changes.

Tue, May 14, 4:39 PM · serviceops-radar, Scap, SRE, Patch-For-Review

Dzahn added a comment to T364880: Confusing failed httpbb check for totoro.wikimedia.org during scap deployment.

scap runs httpbb /srv/deployment/httpbb-tests/appserver/* --hosts=mwdebug.discovery.wmnet --https_port=4444 --retry_on_timeout

Tue, May 14, 4:27 PM · serviceops-radar, Scap, SRE, Patch-For-Review

Dzahn added a project to T364880: Confusing failed httpbb check for totoro.wikimedia.org during scap deployment: SRE.

Tue, May 14, 4:21 PM · serviceops-radar, Scap, SRE, Patch-For-Review

Dzahn added a comment to T350478: Investigate docker-gc.service failures on GitLab runners.

Also see T364773

Tue, May 14, 3:46 PM · collaboration-services

Dzahn closed T364521: SystemdUnitFailed - docker-gc.service - gitlab-runner1004 as Resolved.

follow-ups are happening on the parent tasks

Tue, May 14, 3:45 PM · collaboration-services

Dzahn closed T364521: SystemdUnitFailed - docker-gc.service - gitlab-runner1004, a subtask of T364773: Configure Docker builder GC settings for CI, as Resolved.

Tue, May 14, 3:45 PM · Continuous-Integration-Infrastructure, Release-Engineering-Team

Dzahn closed T364521: SystemdUnitFailed - docker-gc.service - gitlab-runner1004, a subtask of T350478: Investigate docker-gc.service failures on GitLab runners, as Resolved.

Tue, May 14, 3:45 PM · collaboration-services

Dzahn added a subtask for T350478: Investigate docker-gc.service failures on GitLab runners: T364521: SystemdUnitFailed - docker-gc.service - gitlab-runner1004.

Tue, May 14, 3:44 PM · collaboration-services

Dzahn added a subtask for T364773: Configure Docker builder GC settings for CI: T364521: SystemdUnitFailed - docker-gc.service - gitlab-runner1004.

Tue, May 14, 3:44 PM · Continuous-Integration-Infrastructure, Release-Engineering-Team

Dzahn added parent tasks for T364521: SystemdUnitFailed - docker-gc.service - gitlab-runner1004: T364773: Configure Docker builder GC settings for CI, T350478: Investigate docker-gc.service failures on GitLab runners.

Tue, May 14, 3:44 PM · collaboration-services

Dzahn added a comment to T364521: SystemdUnitFailed - docker-gc.service - gitlab-runner1004.

T364773

Tue, May 14, 3:43 PM · collaboration-services

Dzahn added a comment to T364217: Gerrit ssh host key fails on fedora39.

Thanks for confirming that @Tarrow Good to know we have a workaround.

Tue, May 14, 3:10 PM · collaboration-services, Release-Engineering-Team, Gerrit

Dzahn awarded T364342: Switch Gerrit from Java 11 to Java 17 a Like token.

Tue, May 14, 3:04 PM · Release-Engineering-Team, Gerrit, collaboration-services

Mon, May 13

Dzahn closed T364414: Requesting access to deployment for ecarg/Grace Choi as Resolved.

Mon, May 13, 6:14 PM · SRE, SRE-Access-Requests

Dzahn updated the task description for T364414: Requesting access to deployment for ecarg/Grace Choi.

Mon, May 13, 6:11 PM · SRE, SRE-Access-Requests

Dzahn added a comment to T364414: Requesting access to deployment for ecarg/Grace Choi.

@ecarg Your user is now in the deployment group on the deployment server. Give it about 30 minutes and you should have all the access needed for an actual deployment.

Mon, May 13, 6:11 PM · SRE, SRE-Access-Requests

Dzahn moved T364414: Requesting access to deployment for ecarg/Grace Choi from Manager/NDA Approval/Confirmation to Ready To Go on the SRE-Access-Requests board.

Mon, May 13, 6:09 PM · SRE, SRE-Access-Requests

Dzahn placed T364414: Requesting access to deployment for ecarg/Grace Choi up for grabs.

Mon, May 13, 5:34 PM · SRE, SRE-Access-Requests

Dzahn added a comment to T364740: Site: codfw 2 VM request for staging-codfw kube-apiserver.

@JMeybohm I noticed I can't manually run puppet agent on this host. It says I don't have the sudo privileges for it. So I think puppet never ran to setup the initial users.

Mon, May 13, 4:31 PM · SRE, Infrastructure-Foundations, vm-requests, Prod-Kubernetes, Kubernetes

hashar awarded T334517: upgrade contint servers to bullseye a Yellow Medal token.

Mon, May 13, 2:53 PM · Patch-For-Review, Release-Engineering-Team (Radar), collaboration-services

Dzahn created P62365 (An Untitled Masterwork).

Mon, May 13, 2:40 PM

Dzahn updated the task description for T334517: upgrade contint servers to bullseye.

Mon, May 13, 2:33 PM · Patch-For-Review, Release-Engineering-Team (Radar), collaboration-services

Dzahn updated the task description for T334517: upgrade contint servers to bullseye.

Mon, May 13, 2:28 PM · Patch-For-Review, Release-Engineering-Team (Radar), collaboration-services

Dzahn updated the task description for T334517: upgrade contint servers to bullseye.

Mon, May 13, 2:19 PM · Patch-For-Review, Release-Engineering-Team (Radar), collaboration-services

Dzahn updated the task description for T334517: upgrade contint servers to bullseye.

Mon, May 13, 2:12 PM · Patch-For-Review, Release-Engineering-Team (Radar), collaboration-services

Fri, May 10

Dzahn added a comment to T363415: add bullseye support to deployment server puppet role - upgrade deployment server in devtools.

In T363415#9778509, @Dzahn wrote:

This still needs https://gerrit.wikimedia.org/r/c/operations/puppet/+/1026193 to be merged to be able to call it resolved.

Fri, May 10, 9:23 PM · collaboration-services, serviceops, SRE

Dzahn closed T363415: add bullseye support to deployment server puppet role - upgrade deployment server in devtools, a subtask of T360964: replace buster machines in devtools project, as Resolved.

Fri, May 10, 9:22 PM · Patch-For-Review, Cloud-VPS (Debian Buster Deprecation), collaboration-services

Dzahn closed T363415: add bullseye support to deployment server puppet role - upgrade deployment server in devtools, a subtask of T291916: Tracking task for Bullseye migrations in production, as Resolved.

Fri, May 10, 9:22 PM · Epic, Infrastructure-Foundations, SRE

Dzahn closed T363415: add bullseye support to deployment server puppet role - upgrade deployment server in devtools, a subtask of T364656: replace production buster deployment servers, as Resolved.

Fri, May 10, 9:22 PM · Release-Engineering-Team, collaboration-services, serviceops, SRE

Dzahn closed T363415: add bullseye support to deployment server puppet role - upgrade deployment server in devtools as Resolved.

Fri, May 10, 9:22 PM · collaboration-services, serviceops, SRE

Dzahn closed T363415: add bullseye support to deployment server puppet role - upgrade deployment server in devtools, a subtask of T364417: deploy1003 implementation tracking, as Resolved.

Fri, May 10, 9:22 PM · serviceops

Dzahn added a comment to T363415: add bullseye support to deployment server puppet role - upgrade deployment server in devtools.

T364656 will be about upgrading / replacing the production deployment servers.

Fri, May 10, 9:12 PM · collaboration-services, serviceops, SRE

Dzahn updated the task description for T363415: add bullseye support to deployment server puppet role - upgrade deployment server in devtools.

Fri, May 10, 9:11 PM · collaboration-services, serviceops, SRE

Dzahn added subtasks for T291916: Tracking task for Bullseye migrations in production: T364656: replace production buster deployment servers, T364417: deploy1003 implementation tracking.

Fri, May 10, 9:09 PM · Epic, Infrastructure-Foundations, SRE

Dzahn added a parent task for T364656: replace production buster deployment servers: T291916: Tracking task for Bullseye migrations in production.

Fri, May 10, 9:09 PM · Release-Engineering-Team, collaboration-services, serviceops, SRE

Dzahn added a parent task for T364417: deploy1003 implementation tracking: T291916: Tracking task for Bullseye migrations in production.

Fri, May 10, 9:09 PM · serviceops

Dzahn added a subtask for T364656: replace production buster deployment servers: T363415: add bullseye support to deployment server puppet role - upgrade deployment server in devtools.

Fri, May 10, 9:08 PM · Release-Engineering-Team, collaboration-services, serviceops, SRE

Dzahn added parent tasks for T363415: add bullseye support to deployment server puppet role - upgrade deployment server in devtools: T364656: replace production buster deployment servers, T364417: deploy1003 implementation tracking.

Fri, May 10, 9:08 PM · collaboration-services, serviceops, SRE

Dzahn added a subtask for T364417: deploy1003 implementation tracking: T363415: add bullseye support to deployment server puppet role - upgrade deployment server in devtools.

Fri, May 10, 9:08 PM · serviceops

Dzahn added a comment to T364417: deploy1003 implementation tracking.

@akosiaris I made T364656 and suggest seeing that either as a parent task or simply merging this in there.

Fri, May 10, 9:07 PM · serviceops

Dzahn added a subtask for T364656: replace production buster deployment servers: T364417: deploy1003 implementation tracking.

Fri, May 10, 9:06 PM · Release-Engineering-Team, collaboration-services, serviceops, SRE

Dzahn added a parent task for T364417: deploy1003 implementation tracking: T364656: replace production buster deployment servers.

Fri, May 10, 9:06 PM · serviceops

Dzahn created T364656: replace production buster deployment servers.

Fri, May 10, 9:05 PM · Release-Engineering-Team, collaboration-services, serviceops, SRE

Dzahn changed the status of T323073: Make https://git.wikimedia.org not redirect to Phabricator Diffusion from Open to Stalled.

Fri, May 10, 8:09 PM · Phabricator, Patch-For-Review, Diffusion, Release-Engineering-Team, collaboration-services

Dzahn changed the status of T363009: Retire legalpad phabricator/phorge application? from Open to Stalled.

Fri, May 10, 8:08 PM · Znuny, Trust-and-Safety, collaboration-services, Phabricator

Dzahn added a comment to T364521: SystemdUnitFailed - docker-gc.service - gitlab-runner1004.

Looks like the timeout is already fully puppetized and in Hiera:

Fri, May 10, 8:06 PM · collaboration-services

Dzahn added a comment to T364521: SystemdUnitFailed - docker-gc.service - gitlab-runner1004.

Current unit file of docker.gs service:

Fri, May 10, 8:02 PM · collaboration-services

Dzahn added a comment to T364521: SystemdUnitFailed - docker-gc.service - gitlab-runner1004.

Is there a setting that can be changed to allow more docker-gc.service failures within a particular window before alerting?

Fri, May 10, 7:58 PM · collaboration-services

Dzahn assigned T364414: Requesting access to deployment for ecarg/Grace Choi to thcipriani.

Thanks for manager approval. I will upload a patch and assigning to the group approver for consideration :)

Fri, May 10, 7:36 PM · SRE, SRE-Access-Requests

Dzahn updated the task description for T364414: Requesting access to deployment for ecarg/Grace Choi.

Fri, May 10, 7:33 PM · SRE, SRE-Access-Requests

Dzahn changed the status of T364414: Requesting access to deployment for ecarg/Grace Choi from Open to In Progress.

Fri, May 10, 7:31 PM · SRE, SRE-Access-Requests

Dzahn updated the task description for T364414: Requesting access to deployment for ecarg/Grace Choi.

Fri, May 10, 7:31 PM · SRE, SRE-Access-Requests

Dzahn updated the task description for T363514: Requesting access to analytics-privatedata-users for YLiou_WMF (no server access).

Fri, May 10, 3:22 PM · SRE, SRE-Access-Requests

Dzahn updated the task description for T363514: Requesting access to analytics-privatedata-users for YLiou_WMF (no server access).

Fri, May 10, 3:22 PM · SRE, SRE-Access-Requests

Dzahn moved T364588: Requesting access to cassandra-staging-devs for xcollazo from Untriaged to Manager/NDA Approval/Confirmation on the SRE-Access-Requests board.

Fri, May 10, 3:19 PM · SRE, SRE-Access-Requests

Dzahn changed the status of T364588: Requesting access to cassandra-staging-devs for xcollazo from Open to In Progress.

Fri, May 10, 3:19 PM · SRE, SRE-Access-Requests

Dzahn changed the status of T364588: Requesting access to cassandra-staging-devs for xcollazo, a subtask of T364584: Create accounts in the Cassandra staging cluster for the Data Platform team members, from Open to In Progress.

Fri, May 10, 3:19 PM · Commons-Impact-Metrics

Dzahn assigned T364588: Requesting access to cassandra-staging-devs for xcollazo to KOfori.

Fri, May 10, 3:19 PM · SRE, SRE-Access-Requests

Dzahn changed the status of T363514: Requesting access to analytics-privatedata-users for YLiou_WMF (no server access) from Open to In Progress.

Fri, May 10, 3:17 PM · SRE, SRE-Access-Requests

Dzahn moved T363514: Requesting access to analytics-privatedata-users for YLiou_WMF (no server access) from Manager/NDA Approval/Confirmation to Patch in Review on the SRE-Access-Requests board.

Fri, May 10, 3:16 PM · SRE, SRE-Access-Requests

Dzahn closed T364618: SystemdUnitFailed - backup-restore.service - gitlab1003 as Declined.

Fri, May 10, 2:32 PM · collaboration-services

Dzahn added a parent task for T364618: SystemdUnitFailed - backup-restore.service - gitlab1003: Unknown Object (Task).

Fri, May 10, 2:32 PM · collaboration-services

Dzahn renamed T364618: SystemdUnitFailed - backup-restore.service - gitlab1003 from SystemdUnitFailed to SystemdUnitFailed - backup-restore.service - gitlab1003.

Fri, May 10, 2:31 PM · collaboration-services

Dzahn added a comment to T364622: Review/cleanup content of /srv/private/modules/secret/secrets/ssl in the private repo.

Things I think we can delete, at a first glance:

Fri, May 10, 2:30 PM · Puppet-Infrastructure, Puppet (Puppet 7.0), Infrastructure-Foundations, SRE

Thu, May 9

Dzahn awarded T360414: Phase out cergen for Observability services a Barnstar token.

Thu, May 9, 3:11 PM · Patch-For-Review, SRE Observability (FY2023/2024-Q4), observability, SRE

Wed, May 8

Dzahn merged T364519: ProbeDown into T364489: ProbeDown - phab1004.

Wed, May 8, 11:40 PM · collaboration-services

Dzahn merged task T364519: ProbeDown into T364489: ProbeDown - phab1004.

Wed, May 8, 11:40 PM · collaboration-services

Dzahn closed T364510: SystemdUnitFailed - contint1003 - envoyproxy as Resolved.

22:54 < jinxer-wm> RESOLVED: SystemdUnitFailed: wmf_auto_restart_envoyproxy.service on contint1003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state -

https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed

Wed, May 8, 10:56 PM · collaboration-services

Dzahn added a comment to T358237: Ganeti VM for contint migration.

added envoy to contint1003 to fix T364510

Wed, May 8, 10:53 PM · Patch-For-Review, collaboration-services, SRE, Continuous-Integration-Infrastructure, vm-requests

Dzahn added a comment to T364510: SystemdUnitFailed - contint1003 - envoyproxy.

Recently https://gerrit.wikimedia.org/r/c/operations/puppet/+/1028796 was merged which adds auto-restart service for envoyproxy.

Wed, May 8, 10:23 PM · collaboration-services

Dzahn added a comment to T364510: SystemdUnitFailed - contint1003 - envoyproxy.

This is the test server for releng from T358237

Wed, May 8, 10:21 PM · collaboration-services

Dzahn renamed T364510: SystemdUnitFailed - contint1003 - envoyproxy from SystemdUnitFailed to SystemdUnitFailed - contint1003 - envoyproxy.

Wed, May 8, 10:16 PM · collaboration-services

Dzahn claimed T364510: SystemdUnitFailed - contint1003 - envoyproxy.

Wed, May 8, 10:16 PM · collaboration-services

Dzahn renamed T364489: ProbeDown - phab1004 from ProbeDown to ProbeDown - phab1004.

Wed, May 8, 6:46 PM · collaboration-services

Dzahn added a comment to T364217: Gerrit ssh host key fails on fedora39.

@Tarrow You could try if it works with ssh -o RequiredRSASize=1024 just to debug or until we can fix this.

Wed, May 8, 6:43 PM · collaboration-services, Release-Engineering-Team, Gerrit

Dzahn added a comment to T364217: Gerrit ssh host key fails on fedora39.

quoting from a serverfault.com question: "If your getting the "Invalid key length" error, the problem isn't your Ciphers (that may be it's own problem, but if you're getting a key, SSH has agreed to a Cipher)"

Wed, May 8, 6:40 PM · collaboration-services, Release-Engineering-Team, Gerrit

Dzahn added a comment to T333029: Gitlab-Github mirror broken.

Here is the gerrit -> github replication config, it's in hieradata/common/profile/gerrit.yaml in the puppet repo.

Wed, May 8, 2:38 PM · Patch-For-Review, GitLab (Integrations), mwcli

Tue, May 7

Dzahn added a comment to T364416: Q4:rack/setup/install deploy1003.

fwiw - for the person who will add the production puppet role to this later: This is only possible since just recently but should be mostly unblocked now: details in T363415 - needs one more patch though where your review would be great.

Tue, May 7, 6:16 PM · SRE, serviceops, ops-eqiad, DC-Ops

Dzahn moved T364414: Requesting access to deployment for ecarg/Grace Choi from Untriaged to Manager/NDA Approval/Confirmation on the SRE-Access-Requests board.

Tue, May 7, 6:12 PM · SRE, SRE-Access-Requests

Dzahn added a comment to T364414: Requesting access to deployment for ecarg/Grace Choi.

@Mcastro Please confirm if you approve

Tue, May 7, 6:10 PM · SRE, SRE-Access-Requests

Dzahn added a comment to T364414: Requesting access to deployment for ecarg/Grace Choi.

@thcipriani please consider for approval (https://wikimedia.namely.com/people/eaebb898-01ba-404e-8cf8-2ed33c4e0d04/show/personal/employee-information/)

Tue, May 7, 6:10 PM · SRE, SRE-Access-Requests

Dzahn updated the task description for T364414: Requesting access to deployment for ecarg/Grace Choi.

Tue, May 7, 6:09 PM · SRE, SRE-Access-Requests