I reverted this change in this CR since it was running into an issue where it couldn't find /srv/gitlab_backup/*_gitlab_backup.tar. This is because the rsync is run as a systemd unit with an ExecStart which doesn't wrap commands in a shell, so no globbing will work. Alternative solutions are either to wrap it in something like sh -c, or to wrap the whole thing in a shell script which gets executed by the systemd service.

Mar 30 2024, 10:28 PM · collaboration-services

eoghan closed T361418: SystemdUnitFailed - wmf_auto_restart_ssh-gitlab.service on gitlab1004:9100 as Resolved.

This was related to some backup testing work that I was doing on gitlab1004 to see if the rsync changes in T361219 would work, I think what happened was that the gitlab backup script stopped the ssh-gitlab service, but the backup process was terminated so the auto_restart service gave up.

Mar 30 2024, 10:24 PM · collaboration-services

eoghan renamed T361418: SystemdUnitFailed - wmf_auto_restart_ssh-gitlab.service on gitlab1004:9100 from SystemdUnitFailed to SystemdUnitFailed - wmf_auto_restart_ssh-gitlab.service on gitlab1004:9100.

Mar 30 2024, 10:21 PM · collaboration-services

Mar 28 2024

eoghan moved T361219: Improve gitlab-backup rsync to avoid disturbing backups that are in progress from Incoming to Work in Progress on the collaboration-services board.

Mar 28 2024, 1:40 PM · collaboration-services

eoghan created T361219: Improve gitlab-backup rsync to avoid disturbing backups that are in progress.

Mar 28 2024, 1:09 PM · collaboration-services

Mar 25 2024

eoghan added a comment to T358559: Switchover gitlab replica (gitlab1004 -> gitlab1003) - March 2024.

The cookbook completed successfully, however it registered a failure because the final puppet run was blocked by a cron job. I've rectified this in another patch to the failover cookbook

Mar 25 2024, 10:47 PM · Patch-For-Review, collaboration-services

eoghan updated the task description for T358559: Switchover gitlab replica (gitlab1004 -> gitlab1003) - March 2024.

Mar 25 2024, 10:44 PM · Patch-For-Review, collaboration-services

eoghan closed T359826: SystemdUnitFailed - gitlab1004 as Resolved.

This was related to the switchover between gitlab-replica and gitlab-replica-old. I've made improvements to the cookbook to leave a rollback in a better position.

Mar 25 2024, 4:39 PM · collaboration-services

eoghan closed T359875: SystemdUnitFailed - gitlab2002 as Resolved.

This was related to the switchover between gitlab-replica and gitlab-replica-old. I've made improvements to the cookbook to leave a rollback in a better position.

Mar 25 2024, 4:38 PM · collaboration-services

eoghan closed T360840: SystemdUnitFailed - gitlab1004 - backup-restore as Resolved.

This was related to the switchover between gitlab-replica and gitlab-replica-old. I've made improvements to the cookbook to leave a rollback in a better position.

Mar 25 2024, 4:38 PM · collaboration-services

eoghan closed T360813: SystemdUnitFailed - gitlab1003 - backup-restore as Resolved.

This was related to the switchover between gitlab-replica and gitlab-replica-old. I've made improvements to the cookbook to leave a rollback in a better position.

Mar 25 2024, 4:38 PM · collaboration-services

eoghan renamed T360840: SystemdUnitFailed - gitlab1004 - backup-restore from SystemdUnitFailed to SystemdUnitFailed - gitlab1004 - backup-restore.

Mar 25 2024, 4:36 PM · collaboration-services

Mar 22 2024

eoghan placed T358065: [vrts] Investigate ticket archiving up for grabs.

Moving this back to the backlog for the moment, it's not something we expect to do right now. Before we consider this, we'll need to get some input from the community about their workflow and how they typically use the service.

Mar 22 2024, 11:11 AM · Znuny, collaboration-services

Mar 12 2024

eoghan updated the task description for T358559: Switchover gitlab replica (gitlab1004 -> gitlab1003) - March 2024.

Mar 12 2024, 1:09 PM · Patch-For-Review, collaboration-services

Mar 11 2024

eoghan moved T358559: Switchover gitlab replica (gitlab1004 -> gitlab1003) - March 2024 from Backlog to Work in Progress on the collaboration-services board.

Mar 11 2024, 11:08 AM · Patch-For-Review, collaboration-services

Mar 8 2024

eoghan closed T355691: Ensure that phabricator.wikimedia.org adheres to Google's sender guidelines, a subtask of T355712: Ensure Wikimedia complies with Google's new email sender guidelines, as Resolved.

Mar 8 2024, 1:14 PM · Foundational Technology Requests

eoghan closed T355691: Ensure that phabricator.wikimedia.org adheres to Google's sender guidelines as Resolved.

We've ticked all the boxes here for the most part. The two outstanding items are monitoring spam rates and one-click unsubscribe.

Mar 8 2024, 1:14 PM · Release-Engineering-Team (Radar), collaboration-services, Upstream, Phabricator (Upstream), User-brennen, Infrastructure-Foundations

Mar 6 2024

eoghan updated the task description for T358559: Switchover gitlab replica (gitlab1004 -> gitlab1003) - March 2024.

Mar 6 2024, 4:43 PM · Patch-For-Review, collaboration-services

eoghan closed T359041: Decommission vrts1002 as Resolved.

Mar 6 2024, 11:43 AM · collaboration-services

Mar 4 2024

Dzahn awarded T359046: Bring back vrts in devtools a Like token.

Mar 4 2024, 7:02 PM · collaboration-services

eoghan moved T359046: Bring back vrts in devtools from Backlog to Incoming on the collaboration-services board.

Mar 4 2024, 12:15 PM · collaboration-services

eoghan created T359046: Bring back vrts in devtools.

Mar 4 2024, 12:15 PM · collaboration-services

eoghan triaged T359041: Decommission vrts1002 as Medium priority.

Mar 4 2024, 10:59 AM · collaboration-services

eoghan created T359041: Decommission vrts1002.

Mar 4 2024, 10:57 AM · collaboration-services

Feb 23 2024

eoghan added a comment to T355980: [vrts] Move attachment storage out of mysql.

Hey @Krd, for the attachment storage this will be an invisible change to the vrt community, attachments will continue to be loaded and displayed as normal, and we're not anticipating needing to purge old attachments/tickets yet. For this particular piece of work we don't anticipate making any user changes, but if we do we'll definitely involve the community (This is something that's likely going to come out of T358065, I'd be interested to hear the best way to solicit feedback from the community for changes like these).

Feb 23 2024, 8:49 AM · Znuny, collaboration-services

Feb 22 2024

eoghan added a comment to T358065: [vrts] Investigate ticket archiving.

@Matthewrbowker That's one of the things we're intending to establish before we make any kind of recommendation to the VRT team in terms of cost/benefit. I believe that this will not affect permalinks to tickets, and tickets will still be searchable, as far as I can see the only change will be that to search for archived tickets the user may need to add an "archived" attribute to the search form.

Feb 22 2024, 11:24 PM · Znuny, collaboration-services

eoghan added a comment to T358234: GitLab header logo blocks top portion of page.

@brennen Good catch. I didn't spot this in our replica/test instance because I was logged in. Perhaps worth remembering to try a logged out browser as well.

Feb 22 2024, 5:38 PM · Upstream, GitLab (Upstream pit of despair 🕳️), Release-Engineering-Team

Feb 20 2024

eoghan triaged T358065: [vrts] Investigate ticket archiving as Medium priority.

Feb 20 2024, 11:42 PM · Znuny, collaboration-services

eoghan added a comment to T358065: [vrts] Investigate ticket archiving.

I've configured a GenericAdmin job to archive tickets that were closed more than a year ago, this runs twice an hour. I'm going to leave this running and start looking at results on Thursday. I believe the article_search_index table is the one that the search functions use, before archival this is 8937mb in size.

Feb 20 2024, 11:41 PM · Znuny, collaboration-services

eoghan created T358065: [vrts] Investigate ticket archiving.

Feb 20 2024, 11:37 PM · Znuny, collaboration-services

eoghan moved T355980: [vrts] Move attachment storage out of mysql from Work in Progress to Backlog on the collaboration-services board.

We've answered the questions that we're able to answer for the moment, so I'm moving this back to the Backlog until we have physical machines in place. Once that happens, we'll work on a plan to migrate.

Feb 20 2024, 11:26 PM · Znuny, collaboration-services

eoghan added a comment to T355980: [vrts] Move attachment storage out of mysql.

@jcrespo That's useful context, thanks!

Feb 20 2024, 4:30 PM · Znuny, collaboration-services

eoghan moved T357999: ProbeDown - vrts1002 from Incoming to Work in Progress on the collaboration-services board.

This was due to work to remove the spare disk we had for testing in T355980

Feb 20 2024, 4:19 PM · collaboration-services

eoghan renamed T357999: ProbeDown - vrts1002 from ProbeDown to ProbeDown - vrts1002.

Feb 20 2024, 4:18 PM · collaboration-services

Feb 13 2024

eoghan added a comment to T355980: [vrts] Move attachment storage out of mysql.

We ran the migration yesterday with --tolerant allowing it to continue on failures. Unfortunately, we filled up the disk. Some observations:

Feb 13 2024, 4:37 PM · Znuny, collaboration-services

Feb 12 2024

eoghan closed T257741: Gerrit LFS objects lack an automatic sync to gerrit replicas as Resolved.

This was deployed and running correctly!

Feb 12 2024, 9:38 PM · git-lfs, collaboration-services, Gerrit

eoghan added a comment to T355980: [vrts] Move attachment storage out of mysql.

This is not the simple transition that I expected it to be. Here's what we know so far:

Feb 12 2024, 3:17 PM · Znuny, collaboration-services

Feb 6 2024

eoghan claimed T355980: [vrts] Move attachment storage out of mysql.

I'm in the process of adding the disk image to the ganeti instance of vrts1002, using this command: sudo gnt-instance modify --disk add:size=600g vrts1002.eqiad.wmnet

Feb 6 2024, 2:17 PM · Znuny, collaboration-services

Feb 1 2024

Quiddity awarded T356404: VRTS Slow query on dashboard load a Love token.

Feb 1 2024, 7:33 PM · Znuny, collaboration-services

Dzahn awarded T356404: VRTS Slow query on dashboard load a Mountain of Wealth token.

Feb 1 2024, 4:57 PM · Znuny, collaboration-services

eoghan closed T356404: VRTS Slow query on dashboard load, a subtask of T355976: VRTS Performance Improvements tracking, as Resolved.

Feb 1 2024, 4:52 PM · Znuny, collaboration-services

eoghan closed T356404: VRTS Slow query on dashboard load as Resolved.

We got a reply from znuny support to say that they'll add the index in a future release, but that there's no reason we shouldn't be able to do this ourselves now.

Feb 1 2024, 4:52 PM · Znuny, collaboration-services

eoghan created T356404: VRTS Slow query on dashboard load.

Feb 1 2024, 1:08 PM · Znuny, collaboration-services

eoghan closed T355979: [vrts] Switch ticket index module from IndexAccelerator::RuntimeDB to StaticDB as Resolved.

Deployed this to the production instance today. It doesn't seem to have brought us much in terms of improvement, but we are closer to implementing all of the performance warnings from the support data collector page.

Feb 1 2024, 12:36 PM · Patch-For-Review, Znuny, collaboration-services

eoghan closed T355979: [vrts] Switch ticket index module from IndexAccelerator::RuntimeDB to StaticDB, a subtask of T355976: VRTS Performance Improvements tracking, as Resolved.

Feb 1 2024, 12:34 PM · Znuny, collaboration-services

Jan 30 2024

eoghan added a comment to T355979: [vrts] Switch ticket index module from IndexAccelerator::RuntimeDB to StaticDB.

We tried rolling this out to ticket-test.wm.o earlier, and it went pretty quickly. The questions are answered below:

Jan 30 2024, 10:50 PM · Patch-For-Review, Znuny, collaboration-services

eoghan added a comment to T355691: Ensure that phabricator.wikimedia.org adheres to Google's sender guidelines.

There shouldn't be anything directly sending from phabricator.wm.o, it should route through the mx* hosts

Jan 30 2024, 2:30 PM · Release-Engineering-Team (Radar), collaboration-services, Upstream, Phabricator (Upstream), User-brennen, Infrastructure-Foundations

eoghan added a project to T356043: Missing Release Engineering members in LDAP group: LDAP-Access-Requests.

Jan 30 2024, 10:02 AM · LDAP-Access-Requests, SRE, LDAP

eoghan updated subscribers of T356043: Missing Release Engineering members in LDAP group.

@thcipriani Can you give a quick approval for @Aklapper, please?

Jan 30 2024, 10:02 AM · LDAP-Access-Requests, SRE, LDAP

Jan 29 2024

eoghan added a comment to T355691: Ensure that phabricator.wikimedia.org adheres to Google's sender guidelines.

@jhathaway and I took a look through, and I've updated the checklist above. The tl;dr is that we're looking ok from the phabricator side. We should probably remove the existing spf records that point to ip6:2620:0:861:102:10:64:16:101 ip6:2620:0:860:103:10:192:32:54, since these are essentially redundant but also don't resolve publicly so might run foul of the requirement to have records forward-and-reverse resolvable.

Jan 29 2024, 10:54 PM · Release-Engineering-Team (Radar), collaboration-services, Upstream, Phabricator (Upstream), User-brennen, Infrastructure-Foundations

eoghan updated the task description for T355691: Ensure that phabricator.wikimedia.org adheres to Google's sender guidelines.

Jan 29 2024, 10:31 PM · Release-Engineering-Team (Radar), collaboration-services, Upstream, Phabricator (Upstream), User-brennen, Infrastructure-Foundations

eoghan updated the task description for T355691: Ensure that phabricator.wikimedia.org adheres to Google's sender guidelines.

Jan 29 2024, 10:25 PM · Release-Engineering-Team (Radar), collaboration-services, Upstream, Phabricator (Upstream), User-brennen, Infrastructure-Foundations

eoghan added a comment to T356077: Inbound mail to phabricator doesn't work.

The patch did get the script to run, but there seems to be an error from Phabricator in addition to this:

Jan 29 2024, 8:27 PM · Phabricator, collaboration-services

eoghan created T356077: Inbound mail to phabricator doesn't work.

Jan 29 2024, 4:35 PM · Phabricator, collaboration-services

eoghan moved T355691: Ensure that phabricator.wikimedia.org adheres to Google's sender guidelines from Incoming to Work in Progress on the collaboration-services board.

Jan 29 2024, 10:36 AM · Release-Engineering-Team (Radar), collaboration-services, Upstream, Phabricator (Upstream), User-brennen, Infrastructure-Foundations

Jan 27 2024

eoghan created T355980: [vrts] Move attachment storage out of mysql.

Jan 27 2024, 12:19 AM · Znuny, collaboration-services

eoghan created T355979: [vrts] Switch ticket index module from IndexAccelerator::RuntimeDB to StaticDB.

Jan 27 2024, 12:17 AM · Patch-For-Review, Znuny, collaboration-services

eoghan created T355977: [vrts] Increase memory allocation of vrts instances.

Jan 27 2024, 12:16 AM · Znuny, collaboration-services

eoghan triaged T355976: VRTS Performance Improvements tracking as High priority.

Jan 27 2024, 12:15 AM · Znuny, collaboration-services

eoghan created T355976: VRTS Performance Improvements tracking.

Jan 27 2024, 12:15 AM · Znuny, collaboration-services

Jan 26 2024

eoghan claimed T355691: Ensure that phabricator.wikimedia.org adheres to Google's sender guidelines.

Jan 26 2024, 4:16 PM · Release-Engineering-Team (Radar), collaboration-services, Upstream, Phabricator (Upstream), User-brennen, Infrastructure-Foundations

Jan 23 2024

eoghan added a comment to T355502: phabricator_task_dump.service Failed on phab1004.

While we keep working on getting this fixed, I think the best option is to remove the timer for the job entirely, to avoid spurious alerts.

Jan 23 2024, 1:46 PM · Phabricator (2024-02-13), User-brennen, Patch-For-Review, Dumps-Generation, collaboration-services

Jan 8 2024

eoghan updated subscribers of T354478: ticket.wikimedia.org down: upstream connect error or disconnect/reset before headers.

@akosiaris That's great, thanks so much for digging into those and writing it up!

Jan 8 2024, 12:34 PM · collaboration-services, Znuny, Wikimedia-Incident

Jan 6 2024

eoghan claimed T354478: ticket.wikimedia.org down: upstream connect error or disconnect/reset before headers.

I restarted clamav-daemon and apache2 on the host, both had been stopped. It looks like it was a memory pressure issue again. Short term, we could look at auto-restarting apache/clamav on failure, longer term we should investigate whether increasing the memory allocation of the VM would be possible/worthwhile. I'll take care of more in-depth investigation and follow-up on Monday.

Jan 6 2024, 3:28 PM · collaboration-services, Znuny, Wikimedia-Incident