Page MenuHomePhabricator

fnegri (Francesco Negri)
Site Reliability Engineer

Today

  • No visible events.

Tomorrow

  • No visible events.

Monday

  • No visible events.

User Details

User Since
Jul 18 2022, 2:39 PM (202 w, 4 d)
Availability
Available
IRC Nick
dhinus
LDAP User
FNegri
MediaWiki User
FNegri-WMF [ Global Accounts ]

Recent Activity

Yesterday

fnegri added a comment to T428139: [toolsdb] Transaction History Length growing too much.

I have just stopped the mixnmatch "cache warmer", I think it's sufficient now, see if that makes a difference.

Fri, Jun 5, 2:43 PM · tools-platform-team, cloud-services-team, Toolforge
fnegri added a comment to T428139: [toolsdb] Transaction History Length growing too much.

I tried two approaches to understand which user/db is driving the high rate of "Data Reads".

Fri, Jun 5, 12:42 PM · tools-platform-team, cloud-services-team, Toolforge
fnegri updated subscribers of T428139: [toolsdb] Transaction History Length growing too much.

The history length started to increase around 2026-05-28, which coincides with this increase in InnoDB I/O:

Fri, Jun 5, 11:34 AM · tools-platform-team, cloud-services-team, Toolforge
fnegri added a comment to T428139: [toolsdb] Transaction History Length growing too much.

The situation after a few hours has improved:

Screenshot 2026-06-05 at 10.58.33.png (622×1 px, 63 KB)

Fri, Jun 5, 9:00 AM · tools-platform-team, cloud-services-team, Toolforge

Thu, Jun 4

fnegri added a comment to T428139: [toolsdb] Transaction History Length growing too much.

I suspect there is also an underlying issue that I haven't discovered yet, and the stuck queries from heritage and dimastbkbot might be a symptom rather than the root cause.

Thu, Jun 4, 9:11 PM · tools-platform-team, cloud-services-team, Toolforge
fnegri updated subscribers of T428139: [toolsdb] Transaction History Length growing too much.

I also lowered idle_transaction_timeout to 60:

Thu, Jun 4, 8:46 PM · tools-platform-team, cloud-services-team, Toolforge
fnegri moved T428139: [toolsdb] Transaction History Length growing too much from Backlog to In progress on the tools-platform-team board.
Thu, Jun 4, 8:30 PM · tools-platform-team, cloud-services-team, Toolforge
fnegri changed the status of T428139: [toolsdb] Transaction History Length growing too much from Open to In Progress.
Thu, Jun 4, 8:27 PM · tools-platform-team, cloud-services-team, Toolforge
fnegri added a comment to T428139: [toolsdb] Transaction History Length growing too much.

I have temporarily stopped the dimastbkbot tool. I sent an email to the maintainer and also posted a message to their user page linking to this Phabricator task.

Thu, Jun 4, 8:23 PM · tools-platform-team, cloud-services-team, Toolforge
fnegri added a comment to T428139: [toolsdb] Transaction History Length growing too much.

I can see similar queries in the slow query log since last year (2025-11-18) so I'm not sure why the "Transaction History Length" has only started increasing now.

Thu, Jun 4, 5:15 PM · tools-platform-team, cloud-services-team, Toolforge
fnegri added a comment to T428139: [toolsdb] Transaction History Length growing too much.

There is something that is compounding the effect of queries from s52323 (dimasitkbot), and it's queries from s51138 (heritage):

Thu, Jun 4, 5:03 PM · tools-platform-team, cloud-services-team, Toolforge
fnegri updated the task description for T427187: ToolsDB disk space usage growing too fast.
Thu, Jun 4, 10:21 AM · tools-platform-team, Toolforge
fnegri closed T427187: ToolsDB disk space usage growing too fast as Resolved.

@magnusmanske thank you very much!

Thu, Jun 4, 10:06 AM · tools-platform-team, Toolforge
fnegri created T428139: [toolsdb] Transaction History Length growing too much.
Thu, Jun 4, 10:04 AM · tools-platform-team, cloud-services-team, Toolforge

Wed, Jun 3

fnegri added a comment to T427187: ToolsDB disk space usage growing too fast.

Unrelated but also contributing to the disk space growth, history length is growing too much:

Wed, Jun 3, 6:54 PM · tools-platform-team, Toolforge
fnegri attached a referenced file: F84095650: Screenshot 2026-05-25 at 12.19.36.png.
Wed, Jun 3, 5:40 PM · tools-platform-team, Toolforge
fnegri added a comment to T427187: ToolsDB disk space usage growing too fast.

No joy. I will deactivate the wikidata-terminator update script for now until I understand what's wrong.

Wed, Jun 3, 5:36 PM · tools-platform-team, Toolforge
fnegri added a comment to T427187: ToolsDB disk space usage growing too fast.

I created T428087: [toolsdb] Add db-level and user-level monitoring to improve our monitoring.

Wed, Jun 3, 5:11 PM · tools-platform-team, Toolforge
fnegri created T428087: [toolsdb] Add db-level and user-level monitoring.
Wed, Jun 3, 5:10 PM · Toolforge, cloud-services-team
fnegri closed T198508: Updating documentation to mention errors due to Django + MySQL + utf8mb4 index limitations/workarounds on ToolsDB as Resolved.

I'm optimistically marking this as Resolved, as this should no longer be an issue now that ToolsDB is running MariaDB 10.6. I also removed the related notes from Wikitech.

Wed, Jun 3, 4:45 PM · cloud-services-team, User-srodlund, Documentation, Toolforge
fnegri closed T301967: toolsdb: evaluate storage usage by some tools, a subtask of T301951: toolsdb: full disk on clouddb1001 broke clouddb1002 (secondary) replication, as Resolved.
Wed, Jun 3, 4:27 PM · Cloud-Services-Origin-User, Cloud-Services-Worktype-Unplanned, User-dcaro, cloud-services-team (Kanban), Toolforge, Data-Services
fnegri closed T301967: toolsdb: evaluate storage usage by some tools as Resolved.

DB storage is tracked in the subtask T291782
NFS storage does not seem to be an immediate issue

Wed, Jun 3, 4:27 PM · cloud-services-team, Toolforge
fnegri added a comment to T427187: ToolsDB disk space usage growing too fast.

In https://gerrit.wikimedia.org/r/1297114 I reduced expire_logs_days from 14 days to 10 days, which gives us some breathing room while we continue to investigate:

Wed, Jun 3, 3:03 PM · tools-platform-team, Toolforge
fnegri added a comment to T427187: ToolsDB disk space usage growing too fast.

@magnusmanske thanks for looking! I'm not entirely sure if the number of updates by wikidata-terminator increased recently, or if they are the cause of the increase in binlogs we are seeing starting 2026-06-14. However it's definitely doing a lot of updates.

Wed, Jun 3, 1:53 PM · tools-platform-team, Toolforge
fnegri created P93712 ToolsDB binlog count by db.
Wed, Jun 3, 1:50 PM
fnegri added a comment to T427187: ToolsDB disk space usage growing too fast.

On 2026-06-02 disk usage started increasing again:

Wed, Jun 3, 10:12 AM · tools-platform-team, Toolforge

Mon, Jun 1

fnegri merged T427417: Can't connect to ToolsDB on PAWS into T188406: Provide access to user created databases in PAWS.
Mon, Jun 1, 5:01 PM · cloud-services-team, PAWS
fnegri merged task T427417: Can't connect to ToolsDB on PAWS into T188406: Provide access to user created databases in PAWS.
Mon, Jun 1, 5:01 PM · cloud-services-team, Data-Services, PAWS
fnegri added a comment to T427417: Can't connect to ToolsDB on PAWS.

There's a Phab for that! ™

Mon, Jun 1, 5:00 PM · cloud-services-team, Data-Services, PAWS
fnegri added a comment to T427417: Can't connect to ToolsDB on PAWS.

Judging from this line it looks like we are deliberately not allowing PAWS to connect to ToolsDB.

Mon, Jun 1, 4:53 PM · cloud-services-team, Data-Services, PAWS
fnegri placed T354295: [puppet] Remove expired and unused certs from modules/profile/files/ssl/ and modules/base/files/ca up for grabs.

So this task is to remove any unused certificates from modules/profile/files/ssl/ that are expired

Mon, Jun 1, 2:41 PM · SRE, Cloud-Services-Worktype-Unplanned, Cloud-Services-Origin-User, User-dcaro
fnegri added a comment to T427801: [lima-kilo] SSL certificate errors after restarting the VM.

Workaround: running toolforge_load_users_to_ldap.sh fixed it.

Mon, Jun 1, 1:22 PM · cloud-services-team, Toolforge
fnegri created T427801: [lima-kilo] SSL certificate errors after restarting the VM.
Mon, Jun 1, 1:12 PM · cloud-services-team, Toolforge
fnegri updated subscribers of T427187: ToolsDB disk space usage growing too fast.

I tried to find queries affecting many rows, and I found that we're seeing millions of row updates per day on s51205__terminator_p.items (about 50k UPDATEs per second), that might explain the growth of the space occupied by the binlogs.

Mon, Jun 1, 10:04 AM · tools-platform-team, Toolforge

Fri, May 29

fnegri added a comment to T427470: Unable to add MariaDB index to (large) table.

Took 70 minutes in total:

Fri, May 29, 5:41 PM · tools-platform-team, Toolforge
fnegri moved T427470: Unable to add MariaDB index to (large) table from Backlog to Done on the tools-platform-team board.
Fri, May 29, 5:41 PM · tools-platform-team, Toolforge
fnegri changed the status of T427470: Unable to add MariaDB index to (large) table from Open to In Progress.

I'm not sure what's causing the query to be interrupted. We do have a default limit of 1 hour that can be changed with SET SESSION max_statement_time, but when you hit that limit, you get an explicit error:

Fri, May 29, 2:24 PM · tools-platform-team, Toolforge

Thu, May 28

fnegri renamed T427187: ToolsDB disk space usage growing too fast from ToolsDB disk space growing too fast to ToolsDB disk space usage growing too fast.
Thu, May 28, 1:08 PM · tools-platform-team, Toolforge
fnegri added a comment to T427187: ToolsDB disk space usage growing too fast.

I'm actually not entirely convinced that s51138__heritage_p is the issue here. It does definitely do a lot of queries, but I have no evidence this number has increased compared to last month. We could also have some tool generating a small number of very intensive queries that will update a lot of rows, and that would also explain the increase in binlog size. This type of queries are unfortunately harder to find scanning the binlog files.

Thu, May 28, 12:04 PM · tools-platform-team, Toolforge
fnegri updated subscribers of T427187: ToolsDB disk space usage growing too fast.

Disk space has stabilized as I expected:

Screenshot 2026-05-28 at 08.08.55.png (646×1 px, 64 KB)

Thu, May 28, 6:33 AM · tools-platform-team, Toolforge

Wed, May 27

fnegri added a comment to T427187: ToolsDB disk space usage growing too fast.

An update from today, binlogs are still growing but they should hopefully stop to grow in the next 24 hours as the retention window moves forward. The number of binlogs generated per day remains higher than usual (more than 600 files generated per day), we should find out what's causing this.

Wed, May 27, 5:07 PM · tools-platform-team, Toolforge
fnegri added a comment to T427352: Remove obsolete maintain-kubeusers limitranges.

Happened following this commit https://gitlab.wikimedia.org/repos/cloud/toolforge/maintain-kubeusers/-/commit/9b51acbff6ecb7db733f289d996a517a6f56c596

Wed, May 27, 9:09 AM · tools-platform-team, cloud-services-team, Toolforge

Mon, May 25

fnegri added a comment to T427187: ToolsDB disk space usage growing too fast.

Binlogs are stored in 100M chunks. It looks like we are still generating more binlogs than usual even if the spike in "deletes per day" has ended. There might be other operations that are causing an increased binlog activity.

Mon, May 25, 4:29 PM · tools-platform-team, Toolforge
fnegri added a comment to T427187: ToolsDB disk space usage growing too fast.

Looking at the ToolsDB debugging dashboard, it's not a repeat of T409716: [toolsdb] ibdata1 growing on primary, because in that case "ToolsDB primary-replica size difference" would be growing.

Mon, May 25, 10:19 AM · tools-platform-team, Toolforge
fnegri renamed T427187: ToolsDB disk space usage growing too fast from toolsdb disk space growing too fast to ToolsDB disk space growing too fast.
Mon, May 25, 10:07 AM · tools-platform-team, Toolforge
fnegri moved T427187: ToolsDB disk space usage growing too fast from Backlog to In progress on the tools-platform-team board.
Mon, May 25, 10:01 AM · tools-platform-team, Toolforge
fnegri edited projects for T427187: ToolsDB disk space usage growing too fast, added: tools-platform-team; removed cloud-services-team.
Mon, May 25, 10:01 AM · tools-platform-team, Toolforge
fnegri changed the status of T427187: ToolsDB disk space usage growing too fast from Open to In Progress.
Mon, May 25, 10:00 AM · tools-platform-team, Toolforge
fnegri created T427187: ToolsDB disk space usage growing too fast.
Mon, May 25, 10:00 AM · tools-platform-team, Toolforge

Fri, May 22

fnegri added a comment to T420203: Extend sre.mysql.upgrade to work with multiinstance hosts.

there's something wrong with the looping logic

Fri, May 22, 2:44 PM · tools-platform-team, Patch-For-Review, Data-Persistence, Data-Services, cloud-services-team
fnegri closed T422527: [wikireplicas] Upgrade clouddbs to 10.11.16 as Resolved.
Fri, May 22, 2:40 PM · tools-platform-team, cloud-services-team, Data-Services
fnegri moved T422527: [wikireplicas] Upgrade clouddbs to 10.11.16 from In progress to Done on the tools-platform-team board.

All hosts have been upgraded and rebooted.

Fri, May 22, 2:40 PM · tools-platform-team, cloud-services-team, Data-Services
fnegri updated the task description for T422527: [wikireplicas] Upgrade clouddbs to 10.11.16.
Fri, May 22, 2:38 PM · tools-platform-team, cloud-services-team, Data-Services
fnegri added a comment to T426790: Quota increase request for project osmit.

+1

Fri, May 22, 2:05 PM · WMIT-Infrastructure, Cloud-VPS (Quota-requests)
fnegri added a comment to T420203: Extend sre.mysql.upgrade to work with multiinstance hosts.

I did run the cookbook with test-cookbook on 3 more hosts and it worked fine (clouddb1015, clouddb1016, clouddb1017).

Fri, May 22, 1:33 PM · tools-platform-team, Patch-For-Review, Data-Persistence, Data-Services, cloud-services-team
fnegri added a comment to T427060: clouddb1017 getting stuck during shutdown.

Agreed, I was thinking of testing another shutdown in one or two weeks from now, and check if it gets stuck again. If it doesn't, I will mark it as resolved.

Fri, May 22, 1:18 PM · Data-Persistence, tools-platform-team, Data-Services
fnegri added a comment to T422527: [wikireplicas] Upgrade clouddbs to 10.11.16.

The cookbook for clouddb1017 took longer than expected because of T427060: clouddb1017 getting stuck during shutdown. It eventually completed, but failed on the last step because replication lag was not zero:

Fri, May 22, 1:14 PM · tools-platform-team, cloud-services-team, Data-Services
fnegri triaged T427060: clouddb1017 getting stuck during shutdown as Low priority.

There's a threaddump in /root/shutdown-threaddump if you want to have a look, it was taken before the SIGABRT.

Fri, May 22, 1:08 PM · Data-Persistence, tools-platform-team, Data-Services
fnegri renamed T427060: clouddb1017 getting stuck during shutdown from clouddb1017 getting stuck during shut down to clouddb1017 getting stuck during shutdown.
Fri, May 22, 1:01 PM · Data-Persistence, tools-platform-team, Data-Services
fnegri added a comment to T427060: clouddb1017 getting stuck during shutdown.

Rebooted the host and restarted mariadb, this is the startup log:

Fri, May 22, 1:00 PM · Data-Persistence, tools-platform-team, Data-Services
fnegri updated subscribers of T427060: clouddb1017 getting stuck during shutdown.

@fgiunchedi attempted a SIGABRT that resulted in:

Fri, May 22, 12:56 PM · Data-Persistence, tools-platform-team, Data-Services
fnegri added a comment to T427060: clouddb1017 getting stuck during shutdown.

Yes it's not the first time this happens, I created a task to have a papertrail and see if we can find ways to prevent it.

Fri, May 22, 12:43 PM · Data-Persistence, tools-platform-team, Data-Services
fnegri created T427060: clouddb1017 getting stuck during shutdown.
Fri, May 22, 12:37 PM · Data-Persistence, tools-platform-team, Data-Services

Thu, May 21

fnegri updated the task description for T422527: [wikireplicas] Upgrade clouddbs to 10.11.16.
Thu, May 21, 5:23 PM · tools-platform-team, cloud-services-team, Data-Services
fnegri changed the status of T420203: Extend sre.mysql.upgrade to work with multiinstance hosts from Open to In Progress.

I gave a shot at adapting the cookbook to support clouddbs. I tested the patch above on clouddb1014 and it worked, you can see the full output in the Paste below:

Thu, May 21, 5:04 PM · tools-platform-team, Patch-For-Review, Data-Persistence, Data-Services, cloud-services-team
fnegri updated the task description for T422527: [wikireplicas] Upgrade clouddbs to 10.11.16.
Thu, May 21, 4:59 PM · tools-platform-team, cloud-services-team, Data-Services
fnegri created P92809 Upgrading a clouddb host with sre.mysql.upgrade.
Thu, May 21, 4:59 PM
fnegri claimed T420203: Extend sre.mysql.upgrade to work with multiinstance hosts.
Thu, May 21, 4:09 PM · tools-platform-team, Patch-For-Review, Data-Persistence, Data-Services, cloud-services-team
fnegri changed the status of T422527: [wikireplicas] Upgrade clouddbs to 10.11.16 from Open to In Progress.
Thu, May 21, 10:25 AM · tools-platform-team, cloud-services-team, Data-Services

Wed, May 20

fnegri renamed T415165: Install a clouddb host with Debian Trixie from Install a clouddb hosts with Debian Trixie to Install a clouddb host with Debian Trixie.
Wed, May 20, 1:41 PM · tools-platform-team, Data-Services, Data-Persistence
fnegri closed T415165: Install a clouddb host with Debian Trixie as Resolved.

clouddb1015 is running on Trixie and repooled.

Wed, May 20, 1:41 PM · tools-platform-team, Data-Services, Data-Persistence
fnegri closed T415165: Install a clouddb host with Debian Trixie, a subtask of T409162: Q2:rack/setup/install clouddb1026-1033, as Resolved.
Wed, May 20, 1:41 PM · ops-eqiad, cloud-services-team (Hardware), DC-Ops, SRE
fnegri closed T415165: Install a clouddb host with Debian Trixie, a subtask of T407472: Install a testing db with Debian Trixie, as Resolved.
Wed, May 20, 1:41 PM · DBA
fnegri added a comment to T415165: Install a clouddb host with Debian Trixie.

The cookbook did actually PASS, but an exception was raised while writing the PASS comment (that you can see above), which caused the following FAIL message to be posted.

Wed, May 20, 1:06 PM · tools-platform-team, Data-Services, Data-Persistence
fnegri created P92684 Spicerack exception while posting to Phabricator.
Wed, May 20, 1:05 PM
fnegri added a comment to T415165: Install a clouddb host with Debian Trixie.

wikireplicas-utils was also missing in trixie, in this case a simple copy worked:

Wed, May 20, 12:53 PM · tools-platform-team, Data-Services, Data-Persistence
fnegri added a comment to T415165: Install a clouddb host with Debian Trixie.

I found the source at https://gerrit.wikimedia.org/r/q/project:operations/debs/wmf-pt-kill but I don't know what procedure should be followed for building it, could you or @Marostegui please rebuild the package for trixie?

Wed, May 20, 11:29 AM · tools-platform-team, Data-Services, Data-Persistence
fnegri added a comment to T415165: Install a clouddb host with Debian Trixie.

There's a broken dependency:

Wed, May 20, 11:15 AM · tools-platform-team, Data-Services, Data-Persistence
fnegri added a comment to T415165: Install a clouddb host with Debian Trixie.
fnegri@apt1002:~$ sudo -i reprepro copy trixie-wikimedia bookworm-wikimedia wmf-pt-kill
Wed, May 20, 11:12 AM · tools-platform-team, Data-Services, Data-Persistence
fnegri added a comment to T415165: Install a clouddb host with Debian Trixie.

Reimage completed, Mariadb is running, but puppet is failing with:

Wed, May 20, 11:09 AM · tools-platform-team, Data-Services, Data-Persistence
fnegri updated the task description for T422527: [wikireplicas] Upgrade clouddbs to 10.11.16.
Wed, May 20, 10:04 AM · tools-platform-team, cloud-services-team, Data-Services
fnegri changed the status of T415165: Install a clouddb host with Debian Trixie from Open to In Progress.
Wed, May 20, 9:55 AM · tools-platform-team, Data-Services, Data-Persistence
fnegri changed the status of T415165: Install a clouddb host with Debian Trixie, a subtask of T409162: Q2:rack/setup/install clouddb1026-1033, from Open to In Progress.
Wed, May 20, 9:55 AM · ops-eqiad, cloud-services-team (Hardware), DC-Ops, SRE
fnegri changed the status of T415165: Install a clouddb host with Debian Trixie, a subtask of T407472: Install a testing db with Debian Trixie, from Open to In Progress.
Wed, May 20, 9:55 AM · DBA
fnegri added a comment to T415165: Install a clouddb host with Debian Trixie.

I'm reimaging clouddb1015 to trixie today, sorry for the delay.

Wed, May 20, 9:55 AM · tools-platform-team, Data-Services, Data-Persistence
fnegri added a comment to T359650: [jobs-api] Create storage layer, and save business models in persistent storage.

@Raymond_Ndibe the draft looks good, I'm ok with sending it today. A few minor fixes:

  • with toolforge -> with Toolforge
  • toolforge jobs -> I would wrap it in quotes to clarify it's a cli command, "toolforge jobs"
  • temporal -> temporary
  • line length is not consistent, please use the same length for all lines
Wed, May 20, 9:15 AM · Toolforge (Push-to-Deploy), tools-platform-team, User-Raymond_Ndibe

Mon, May 18

fnegri added a comment to T424362: Define update process for Toolforge builder/runner images.

I would like to test this assumption with this patch that updates our config to use the latest available version of heroku:24, both for default builds and for builds using --use-latest-versions.

Mon, May 18, 1:49 PM · Patch-For-Review, tools-platform-team, Toolforge
fnegri closed T426016: heroku builder and runner 24_0.21.8 rejects harbor ip host as Resolved.

Verified on lima-kilo on Linux, nuked the VM when ./start-devenv.sh asked and ran the verification commands

Mon, May 18, 9:59 AM · Patch-For-Review, Toolforge, tools-platform-team
fnegri added a comment to T424362: Define update process for Toolforge builder/runner images.

Side-note: Heroku automatically restarts all containers every 24 hours, so any update to the stacks are rolled out automatically:

Mon, May 18, 9:23 AM · Patch-For-Review, tools-platform-team, Toolforge
fnegri created T426584: [toolsbeta] probe flapping on ipv6 only.
Mon, May 18, 8:57 AM · cloud-services-team, Toolforge

Fri, May 15

fnegri updated subscribers of T426016: heroku builder and runner 24_0.21.8 rejects harbor ip host.

After merging all the patches above, things are working fine on my machine.

Fri, May 15, 4:04 PM · Patch-For-Review, Toolforge, tools-platform-team

Thu, May 14

fnegri closed T426304: toolsbeta tools are not reachable as Resolved.

Re-deploying istio-system usually does not help, since that is a no-op. Today it did, because it made changes to the pod definitions, which triggered the deployment to get re-created.

Thu, May 14, 1:58 PM · Toolforge, tools-platform-team
fnegri closed T426304: toolsbeta tools are not reachable, a subtask of T426321: [istio-gateway] Deploying the component can cause an outage, as Resolved.
Thu, May 14, 1:58 PM · cloud-services-team, Toolforge
fnegri added a subtask for T426321: [istio-gateway] Deploying the component can cause an outage: T426304: toolsbeta tools are not reachable.
Thu, May 14, 1:57 PM · cloud-services-team, Toolforge
fnegri added a parent task for T426304: toolsbeta tools are not reachable: T426321: [istio-gateway] Deploying the component can cause an outage.
Thu, May 14, 1:57 PM · Toolforge, tools-platform-team
fnegri created T426321: [istio-gateway] Deploying the component can cause an outage.
Thu, May 14, 1:56 PM · cloud-services-team, Toolforge
fnegri added a comment to T426304: toolsbeta tools are not reachable.

@taavi after reading https://wikitech.wikimedia.org/wiki/Portal:Toolforge/Admin/Runbooks/IstioGatewayPodMisplaced, my current understanding is:

  • whenever we deploy the "istio-gateway" component (which is not often), there’s a possibility that the new pods are misplaced
  • misplaced istio-gateway pods can cause all tools to become unreachable
  • redeploying istio-gateway won’t help, because that only updates the ConfigMap. The only way to fix it is to redeploy istio-system (like I did today), or manually delete the misplaced pod (as recommended in the runbook above)
Thu, May 14, 1:05 PM · Toolforge, tools-platform-team
fnegri added a comment to T426304: toolsbeta tools are not reachable.

This alert recovering thanks to the re-deployment did

Thu, May 14, 11:12 AM · Toolforge, tools-platform-team
fnegri added a comment to T426304: toolsbeta tools are not reachable.

The Helm diff above is because this branch was deployed in toolsbeta a few days ago: https://gitlab.wikimedia.org/repos/cloud/toolforge/toolforge-deploy/-/merge_requests/1235/diffs

Thu, May 14, 11:06 AM · Toolforge, tools-platform-team
fnegri added a comment to T426304: toolsbeta tools are not reachable.

Redeploying istio-system seems to have fixed it, there was an unexpected Helm difference:

Thu, May 14, 10:39 AM · Toolforge, tools-platform-team