Page MenuHomePhabricator

fgiunchedi (Filippo Giunchedi)
/* No comment */

Today

  • No visible events.

Tomorrow

  • No visible events.

Thursday

  • No visible events.

User Details

User Since
Oct 3 2014, 8:06 AM (584 w, 4 d)
Availability
Available
IRC Nick
godog
LDAP User
Filippo Giunchedi
MediaWiki User
FGiunchedi (WMF) [ Global Accounts ]

Recent Activity

Today

fgiunchedi added a comment to T410989: Remove second network connection for cloudcephosd hosts with single uplink.

JFYI we can now proceed with cloudcephosd1052 too

Tue, Dec 16, 1:21 PM · SRE, DC-Ops
fgiunchedi added a comment to T399180: Cloudcephosd: migrate to single network uplink.

I think the easiest would be to:

  • Remove the spurious enp13s0f1np1 config, run puppet to verify no other changes will be applied
  • Make sure the ifupdown config matches e.g. cloudcephosd1050, modulo addresses
  • Reboot the host and verify addresses/interfaces come up as expected

Yeah I think this should be ok.

Tue, Dec 16, 1:20 PM · netops, Infrastructure-Foundations, SRE

Yesterday

fgiunchedi added a comment to T412506: Investigation into ToolforgeKubernetesNodeNotReady 2025-12-12 page.

Checking the kube-state-metrics container logs I get the following:

Mon, Dec 15, 11:46 AM · Toolforge, cloud-services-team

Fri, Dec 12

fgiunchedi added a comment to T412506: Investigation into ToolforgeKubernetesNodeNotReady 2025-12-12 page.

Something I noticed is that kube-state-metrics as a whole has been occasionally and lately failing scrapes from prometheus: (long url, toolforge.org isn't allowed on w.wiki) https://prometheus.svc.toolforge.org/tools/graph?g0.expr=sum_over_time(up%7Bjob%3D%22k8s-kube-state-metrics%22%7D%5B1h%5D)%2F%20count_over_time(up%7Bjob%3D%22k8s-kube-state-metrics%22%7D%5B1h%5D)&g0.tab=0&g0.stacked=0&g0.show_exemplars=0&g0.range_input=4w&g0.end_input=2025-12-12%2016%3A54%3A05&g0.moment_input=2025-12-12%2016%3A54%3A05

Fri, Dec 12, 4:55 PM · Toolforge, cloud-services-team
fgiunchedi updated the task description for T412506: Investigation into ToolforgeKubernetesNodeNotReady 2025-12-12 page.
Fri, Dec 12, 4:53 PM · Toolforge, cloud-services-team
fgiunchedi created T412506: Investigation into ToolforgeKubernetesNodeNotReady 2025-12-12 page.
Fri, Dec 12, 10:30 AM · Toolforge, cloud-services-team

Thu, Dec 4

fgiunchedi added a comment to T399180: Cloudcephosd: migrate to single network uplink.

I took a look at why cloudcephosd1052 still has second nic up, currently:

Thu, Dec 4, 10:51 AM · netops, Infrastructure-Foundations, SRE
fgiunchedi closed T399807: Allow team customization for service::catalog probes, a subtask of T411470: Page on cloudweb/horizon down, as Resolved.
Thu, Dec 4, 10:26 AM · Horizon, cloud-services-team
fgiunchedi closed T399807: Allow team customization for service::catalog probes as Resolved.

This is done! I'll followup with an announcement to sre@

Thu, Dec 4, 10:26 AM · Observability-Alerting

Wed, Dec 3

fgiunchedi added a comment to T411470: Page on cloudweb/horizon down.

I dug into this a little, currently:

Wed, Dec 3, 8:50 AM · Horizon, cloud-services-team

Tue, Dec 2

fgiunchedi added a comment to T411470: Page on cloudweb/horizon down.

Availability as seen by network probes:

Tue, Dec 2, 10:36 AM · Horizon, cloud-services-team
fgiunchedi created T411470: Page on cloudweb/horizon down.
Tue, Dec 2, 10:19 AM · Horizon, cloud-services-team

Mon, Dec 1

fgiunchedi added a comment to T411274: Audit and standardize on UTC timezone for grafana.wmcloud.org dashboards.

ALSO: make sure grafana's default is UTC on new dashboards

Mon, Dec 1, 3:36 PM · cloud-services-team
fgiunchedi added a comment to T411343: thanos-store OOMing on titan eqiad.

I'm aware there is/was work going on on thanos/titan in T410152: Disk space saturation (/srv) on Titan hosts and perhaps related

Mon, Dec 1, 9:26 AM · observability, SRE
fgiunchedi created T411343: thanos-store OOMing on titan eqiad.
Mon, Dec 1, 9:25 AM · observability, SRE

Fri, Nov 28

fgiunchedi added a comment to T411274: Audit and standardize on UTC timezone for grafana.wmcloud.org dashboards.

As to effectively do the audit, we can adapt search-grafana-dashboards.js from https://wikitech.wikimedia.org/wiki/Grafana#Search/audit_metrics_usage_across_dashboards

Fri, Nov 28, 2:59 PM · cloud-services-team
fgiunchedi created T411274: Audit and standardize on UTC timezone for grafana.wmcloud.org dashboards.
Fri, Nov 28, 2:58 PM · cloud-services-team
fgiunchedi added a comment to T391369: If the inactive clouddumps host goes down, it causes a ripple effect on Cloud VPS and Toolforge.

I briefly looked the clouddumps1002 downtime from yesterday, and of course there was ~30m downtime for dumps.w.o since 1002 serves those:

Fri, Nov 28, 11:32 AM · cloud-services-team (FY2025/26-Q1-Q2), Toolforge, Cloud-VPS
fgiunchedi created T411248: Plan to make clouddumps more resilient and easier to operate.
Fri, Nov 28, 11:20 AM · Data-Services, Cloud-VPS, cloud-services-team

Thu, Nov 27

fgiunchedi updated the task description for T410983: VM metadata service slow response.
Thu, Nov 27, 4:20 PM · cloud-services-team, Cloud-VPS
fgiunchedi added a comment to T411193: SystemdUnitDown and SystemdUnitDownForLong.

Something else to note: the alerts are deployed in eqiad only, not codfw

Thu, Nov 27, 2:37 PM · cloud-services-team
fgiunchedi created T411193: SystemdUnitDown and SystemdUnitDownForLong.
Thu, Nov 27, 2:35 PM · cloud-services-team
fgiunchedi added a comment to T326325: Updates of passwords of users created with postgresql::user / PostgreSQL change to scram-sha256.

Leaving a suggestion here for a workaround for the record: while having a native pg facility to detect password changes would be optimal; I think what we could do is write the (hashed, possibly salted with a salt we control) passwords on the filesystem (e.g. one per file). Then use the following logic:

  • if the password file doesn't exist, the user and its password needs to be created in pg and the fs
  • if the password file does exist, compare its value with the current puppet password. If they differ then update pg and the fs. If they don't then there's nothing to do.
Thu, Nov 27, 10:41 AM · Infrastructure-Foundations, SRE
fgiunchedi added a comment to T411081: Improve how virt networks are configured in cloudgw.

Something I wanted to add: I'm not very familiar with that part of the puppet codebase though I was wondering if we can start referring to the networks and their attributes by name. It will enable us to have puppet code e.g. "give me the subnets for VXLAN/IPv6-dualstack, either v4 or v6 or both", in other words get to more understandable and manageable code/understanding. Ditto for other attributes/properties for a given network like its gateway, etc. Let me know what you think!

Thu, Nov 27, 10:23 AM · tools-infrastructure-team, Cloud-VPS

Wed, Nov 26

fgiunchedi added a comment to T411081: Improve how virt networks are configured in cloudgw.

Looks good to me! Definitely good to refactor these bits

Wed, Nov 26, 10:32 AM · tools-infrastructure-team, Cloud-VPS
fgiunchedi closed T411023: pontoon join-stack should ask for puppetserver host key verification as Resolved.
Wed, Nov 26, 8:39 AM · Pontoon

Tue, Nov 25

fgiunchedi created T411023: pontoon join-stack should ask for puppetserver host key verification.
Tue, Nov 25, 2:26 PM · Pontoon
fgiunchedi added a comment to T410989: Remove second network connection for cloudcephosd hosts with single uplink.

Thinking it through what is probably best:

  1. We disable the switch interfaces terminating these second ports now
  2. We re-run the PuppetDB import script for these hosts to update their Netbox interface based on current status
  3. We ask DC-Ops to remove all the cables on site, deleting each one in Netbox as they go

I can take care of 1 and 2 I think, then we can ask dc-ops to do the rest.

Tue, Nov 25, 1:23 PM · SRE, DC-Ops
fgiunchedi added a comment to T410989: Remove second network connection for cloudcephosd hosts with single uplink.

@fgiunchedi have all the cables been removed on site?

Typically I would ask DC-Ops to remove in Netbox when they remove on site, to minimize the length of time there is any discrepancy between Netbox and reality.

Tue, Nov 25, 1:07 PM · SRE, DC-Ops
fgiunchedi created T410989: Remove second network connection for cloudcephosd hosts with single uplink.
Tue, Nov 25, 9:10 AM · SRE, DC-Ops
fgiunchedi added a comment to T399180: Cloudcephosd: migrate to single network uplink.

The logical side on the host side is done. Next up is deleting the interfaces from netbox for the hosts and unplug network cables. I'll file subtasks

Tue, Nov 25, 9:09 AM · netops, Infrastructure-Foundations, SRE
fgiunchedi updated the task description for T399180: Cloudcephosd: migrate to single network uplink.
Tue, Nov 25, 9:06 AM · netops, Infrastructure-Foundations, SRE
fgiunchedi triaged T410983: VM metadata service slow response as Low priority.
Tue, Nov 25, 8:53 AM · cloud-services-team, Cloud-VPS
fgiunchedi updated the task description for T410983: VM metadata service slow response.
Tue, Nov 25, 8:29 AM · cloud-services-team, Cloud-VPS
fgiunchedi updated the task description for T410983: VM metadata service slow response.
Tue, Nov 25, 8:23 AM · cloud-services-team, Cloud-VPS
fgiunchedi updated the task description for T410983: VM metadata service slow response.
Tue, Nov 25, 8:19 AM · cloud-services-team, Cloud-VPS
fgiunchedi created T410983: VM metadata service slow response.
Tue, Nov 25, 8:12 AM · cloud-services-team, Cloud-VPS

Mon, Nov 24

fgiunchedi added a comment to T410712: [bug] PAWS: User servers failing to spawn (timeout 30s).

I can't currently reproduce the issue -- navigating to https://hub-paws.wmcloud.org/ spawns a container for me and works as expected. Is this still a problem @Sadeiiw67 ?

Mon, Nov 24, 7:58 AM · cloud-services-team, PAWS

Fri, Nov 21

fgiunchedi closed T407586: latest Trixie image (as of 2025-10-16) grub failure on R450 hardware as Resolved.

I'm calling this one done since we have a workaround in place; I'll followup on the Debian bug for an actual fix

Fri, Nov 21, 8:25 AM · Upstream, cloud-services-team, SRE

Thu, Nov 20

fgiunchedi added a comment to T407140: Plan networking for Toolforge-on-Metal experiment.

Ok thanks @fgiunchedi for the info.

I think that seems doable. As per the sub-task about a VRF I think that will be needed. And route leaking is not something I really want to do, so we will probably need the "cloudgw" to route traffic between the new VRF and the exsiting cloud-private / Openstack networks.

Thu, Nov 20, 10:04 AM · Infrastructure-Foundations, netops, Toolforge, tools-infrastructure-team
fgiunchedi created T410601: Improve "reuse" feature for standard partman recipes.
Thu, Nov 20, 7:53 AM · User-MoritzMuehlenhoff, Infrastructure-Foundations, SRE

Wed, Nov 19

fgiunchedi added a comment to T407586: latest Trixie image (as of 2025-10-16) grub failure on R450 hardware.

Reported to Debian as https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=1121006

Wed, Nov 19, 1:50 PM · Upstream, cloud-services-team, SRE
fgiunchedi updated the task description for T407586: latest Trixie image (as of 2025-10-16) grub failure on R450 hardware.
Wed, Nov 19, 7:58 AM · Upstream, cloud-services-team, SRE

Tue, Nov 18

fgiunchedi added a comment to T407586: latest Trixie image (as of 2025-10-16) grub failure on R450 hardware.

And lsblk -t for comparison:

Tue, Nov 18, 3:40 PM · Upstream, cloud-services-team, SRE
fgiunchedi added a comment to T407586: latest Trixie image (as of 2025-10-16) grub failure on R450 hardware.

As far as I understand the problem, lvm metadata size and alignment can be related to the underlying block device reported data, specifically the "optimal i/o size":

Tue, Nov 18, 3:26 PM · Upstream, cloud-services-team, SRE
fgiunchedi added a comment to T407586: latest Trixie image (as of 2025-10-16) grub failure on R450 hardware.

I investigated this a little more today. The issue is grub_lvm_detect allocating memory based on the lvm metadata areas, of which locn[2] is of size 4293914624 (or 0xFFF00000)

Tue, Nov 18, 2:47 PM · Upstream, cloud-services-team, SRE
fgiunchedi added a comment to T407586: latest Trixie image (as of 2025-10-16) grub failure on R450 hardware.

I'm giving debugging this issue one more go, as part of this we now have pause-reboot.cfg included for cloudcontrol2010-dev which will avoid mucking on apt1002

Tue, Nov 18, 10:16 AM · Upstream, cloud-services-team, SRE
fgiunchedi updated the task description for T399180: Cloudcephosd: migrate to single network uplink.
Tue, Nov 18, 10:13 AM · netops, Infrastructure-Foundations, SRE
fgiunchedi added a comment to T409690: Audit and verify all cloudcephosd have their primary interface tagged and access to cloud-storage vlan.

Something else I forgot: I'm assuming codfw also is applicable in this case? i.e. these hosts we'll be moving to single nic as well

Tue, Nov 18, 10:12 AM · netops, Infrastructure-Foundations, SRE
fgiunchedi updated the task description for T399180: Cloudcephosd: migrate to single network uplink.
Tue, Nov 18, 10:08 AM · netops, Infrastructure-Foundations, SRE

Mon, Nov 17

fgiunchedi added a comment to T409690: Audit and verify all cloudcephosd have their primary interface tagged and access to cloud-storage vlan.

Thank you @cmooney ! FYI as per Andrew we really only care about cloudcephosd1035 through cloudcephosd1052 since the rest will be decom'd soon anyways

Mon, Nov 17, 5:08 PM · netops, Infrastructure-Foundations, SRE
fgiunchedi updated the task description for T399180: Cloudcephosd: migrate to single network uplink.
Mon, Nov 17, 5:06 PM · netops, Infrastructure-Foundations, SRE
fgiunchedi closed T409912: Clean host puppet certificate upon destroy as Resolved.
Mon, Nov 17, 8:48 AM · Pontoon
fgiunchedi closed T409905: Inject netbox-hiera data for stack hosts as Resolved.
Mon, Nov 17, 8:48 AM · Pontoon

Nov 13 2025

fgiunchedi updated the task description for T409890: [toolsdb] pt-heartbeat service should automatically follow the primary.
Nov 13 2025, 10:52 AM · cloud-services-team, Toolforge

Nov 12 2025

fgiunchedi created T409912: Clean host puppet certificate upon destroy.
Nov 12 2025, 11:53 AM · Pontoon
fgiunchedi created T409905: Inject netbox-hiera data for stack hosts.
Nov 12 2025, 10:52 AM · Pontoon
fgiunchedi updated the task description for T399180: Cloudcephosd: migrate to single network uplink.
Nov 12 2025, 8:46 AM · netops, Infrastructure-Foundations, SRE
fgiunchedi created T409890: [toolsdb] pt-heartbeat service should automatically follow the primary.
Nov 12 2025, 7:03 AM · cloud-services-team, Toolforge

Nov 11 2025

fgiunchedi added a comment to T367370: Shift frack alerting to use prometheus-alertmanager instead of icinga.

Thank you for following up @Dwisehaupt and I'm glad to know you are making progress! These days I am no longer part of Observability, and will defer to @hnowlan to take it from here

Nov 11 2025, 8:41 AM · Patch-For-Review, Observability-Alerting, Fundraising-Backlog, fundraising-tech-ops

Nov 10 2025

fgiunchedi added a comment to T409690: Audit and verify all cloudcephosd have their primary interface tagged and access to cloud-storage vlan.

Yes please @cmooney, much appreciated! Note that this is currently not a blocker / not high priority. In the sense that I'll be test-driving the change on 1048 and 1049 which are configured correctly AFAICT, having it done and dusted of course would be great

Nov 10 2025, 2:41 PM · netops, Infrastructure-Foundations, SRE
fgiunchedi added a comment to T399180: Cloudcephosd: migrate to single network uplink.

@taavi @Andrew @cmooney what do you think of the above?

The plan sounds good. We need to audit and make sure all the primary links to the cloudcephosd* hosts are set to trunk mode, with the storage vlan as one of the tagged interfaces (in Netbox, and then homer run if we change anything). This change can be made non-disruptively in advance however, so shouldn't be a blocker.

Nov 10 2025, 9:14 AM · netops, Infrastructure-Foundations, SRE
fgiunchedi created T409690: Audit and verify all cloudcephosd have their primary interface tagged and access to cloud-storage vlan.
Nov 10 2025, 9:12 AM · netops, Infrastructure-Foundations, SRE

Nov 6 2025

fgiunchedi added a comment to T407140: Plan networking for Toolforge-on-Metal experiment.

Thank you @cmooney for the summary, I'll add a few thoughts I had while working on the Toolforge on Metal project design document.

Nov 6 2025, 9:53 AM · Infrastructure-Foundations, netops, Toolforge, tools-infrastructure-team
fgiunchedi added a project to T409287: [toolsdb] Destroy tools-db-4 and create new host: Sustainability (Incident Followup).
Nov 6 2025, 9:42 AM · Toolforge (Toolforge iteration 25), Sustainability (Incident Followup), cloud-services-team (FY2025/26-Q1-Q2)
fgiunchedi added a comment to T357977: [toolforge.infra] run and monitor our own sample tools.

Mentioning it here as a followup to T409244: Toolforge outage: toolsdb out of space: it is important we do monitor the ability to read/write toolsdb, and possibly page on it

Nov 6 2025, 9:39 AM · cloud-services-team, Patch-For-Review, User-aborrero, Toolforge
fgiunchedi created T409404: [toolsdb] Add filesystem space alerts.
Nov 6 2025, 9:37 AM · Toolforge (Toolforge iteration 25), cloud-services-team (FY2025/26-Q1-Q2), Sustainability (Incident Followup)

Nov 5 2025

fgiunchedi updated the task description for T409294: Fix MTU on single-NIC Ceph nodes.
Nov 5 2025, 3:12 PM · cloud-services-team, Cloud-VPS
fgiunchedi added a comment to T408543: MTU setting in IPv6 VMs causes issues with Docker.
  1. Move cloudvirt/cloudnet to support jumbo frames ensuring VMs can have a 1500 byte MTU
Nov 5 2025, 2:34 PM · Patch-For-Review, cloud-services-team, Cloud-VPS
fgiunchedi added a comment to T409244: Toolforge outage: toolsdb out of space.

Doing a comparison with the replica on tools-db-6, there's ~800G free there:

Nov 5 2025, 7:41 AM · cloud-services-team, Toolforge
fgiunchedi added a comment to T409244: Toolforge outage: toolsdb out of space.

disk space free trend for tools-db-4 over the last 30d

Nov 5 2025, 7:26 AM · cloud-services-team, Toolforge
fgiunchedi added a comment to T409244: Toolforge outage: toolsdb out of space.

tools-db-4 storage volume is out of space, I'll use this task for tracking

Nov 5 2025, 7:19 AM · cloud-services-team, Toolforge

Nov 3 2025

fgiunchedi created T409029: Flapping wikitech-static icinga alert.
Nov 3 2025, 7:50 AM · wikitech.wikimedia.org, cloud-services-team

Oct 30 2025

fgiunchedi triaged T408707: [jobs-api] apply topology constraints as Medium priority.
Oct 30 2025, 10:04 AM · Toolforge (Toolforge iteration 25), Patch-For-Review, cloud-services-team
fgiunchedi triaged T408766: [toolforge_run_functional_tests] Doesn't support alternate (fork) repo urls, unexpectedly continues on missing branch as Medium priority.
Oct 30 2025, 10:04 AM · cloud-services-team, Toolforge
fgiunchedi triaged T408767: [toolforge_run_functional_tests] copy/paste format doesn't work as Medium priority.
Oct 30 2025, 10:04 AM · Toolforge (Toolforge iteration 25), cloud-services-team
fgiunchedi triaged T408786: Upgrade cert-manager past 1.15 as Medium priority.
Oct 30 2025, 10:03 AM · cloud-services-team, Toolforge
fgiunchedi triaged T408787: [striker] tool creation validation is not applying project prefix consistently as Medium priority.
Oct 30 2025, 10:03 AM · Striker, cloud-services-team
fgiunchedi added a comment to T408543: MTU setting in IPv6 VMs causes issues with Docker.

A major downside of doing it via Puppet is that that won't have any effect on instances with non-Puppetized services (so basically anything not managed by WMF SREs).

Oct 30 2025, 7:30 AM · Patch-For-Review, cloud-services-team, Cloud-VPS

Oct 29 2025

fgiunchedi changed the status of T399180: Cloudcephosd: migrate to single network uplink from Stalled to Open.
Oct 29 2025, 7:49 AM · netops, Infrastructure-Foundations, SRE
fgiunchedi added a comment to T408543: MTU setting in IPv6 VMs causes issues with Docker.

I'm +1 on getting docker's puppetization to do the right thing, in other words detect the default route's mtu and set said value as docker's default.

Oct 29 2025, 7:35 AM · Patch-For-Review, cloud-services-team, Cloud-VPS
fgiunchedi triaged T408574: [jobs-api] handle qualified image names as Medium priority.
Oct 29 2025, 7:20 AM · Toolforge (Toolforge iteration 25), Patch-For-Review, cloud-services-team
fgiunchedi triaged T408633: Access to grafana.wmcloud.org as Medium priority.
Oct 29 2025, 7:20 AM · cloud-services-team, Toolforge

Oct 28 2025

fgiunchedi triaged T408286: Pywikibot OAuth/BotPassword authentication fails when login to third-party Wikimedia sites (Superset, Commons Query) as Medium priority.
Oct 28 2025, 11:12 AM · PendingChangesBot, Pywikibot
fgiunchedi triaged T408387: CloudVPS instance for ProVe as Medium priority.
Oct 28 2025, 11:11 AM · cloud-services-team (FY2025/26-Q1-Q2), Cloud-VPS (Project-requests)
fgiunchedi added a comment to T370037: Cloud VPS: extend tofu-infra coverage.

I'm thinking this over again, and I'm fairly compelled by the fact that resources managed by tofu are in source control with a history.

That said, I know that @fgiunchedi found the current workflow very confusing. Filippo, can you talk about your experience a bit here? One thing I wonder about is how both deployments (codfw1dev and eqiad1) are currently coupled, so if tofu can't apply in codfw1dev then we also can't change things in eqiad1; I'm pretty sure that needs to be changed.

Oct 28 2025, 11:10 AM · Cloud-VPS, User-aborrero, Epic, cloud-services-team

Oct 27 2025

fgiunchedi triaged T408371: [jobs-api] Allow toolforge job schedule to specify a time zone as Medium priority.
Oct 27 2025, 10:05 AM · cloud-services-team, Toolforge
fgiunchedi triaged T408331: Policy for (external) contributions / code style as Medium priority.
Oct 27 2025, 10:05 AM · cloud-services-team, Toolforge
fgiunchedi triaged T408354: [wikireplicas] Create views for new wiki pcmwikiquote as Medium priority.
Oct 27 2025, 10:05 AM · cloud-services-team (FY2025/26-Q1-Q2), Data-Services
fgiunchedi triaged T408346: [wikireplicas] Create views for new wiki minwikisource as Medium priority.
Oct 27 2025, 10:04 AM · cloud-services-team (FY2025/26-Q1-Q2), Data-Services
fgiunchedi triaged T397332: quarry: Use a proper Python package manager as Medium priority.
Oct 27 2025, 10:04 AM · Patch-For-Review, cloud-services-team, RoadToWiki, Quarry
fgiunchedi triaged T408321: [builds] What are the current best practices for CI? as Medium priority.
Oct 27 2025, 10:04 AM · cloud-services-team, Toolforge
fgiunchedi triaged T408108: [build service] Python pack is outdated, does not support latest Python 3.14 stable release as Medium priority.
Oct 27 2025, 10:04 AM · cloud-services-team, Toolforge
fgiunchedi triaged T407485: Set up x1 replication to an-redacteddb1001 as Medium priority.
Oct 27 2025, 10:04 AM · Data-Platform-SRE (2025.11.07 - 2025.11.28), Essential-Work, Data-Engineering, Data-Services, Data-Persistence, cloud-services-team, Privacy Engineering
fgiunchedi triaged T408002: [functional tests,toolforge-deploy] functional tests are optimistic about retries/timeouts as Medium priority.
Oct 27 2025, 10:04 AM · Patch-For-Review, cloud-services-team, Toolforge
fgiunchedi triaged T408034: toolforge jobs dump includes booleans as strings as Medium priority.
Oct 27 2025, 10:03 AM · cloud-services-team, Toolforge
fgiunchedi triaged T408028: [components-api] tries to decode server errors as json as Medium priority.
Oct 27 2025, 10:03 AM · cloud-services-team, Toolforge
fgiunchedi triaged T407586: latest Trixie image (as of 2025-10-16) grub failure on R450 hardware as High priority.
Oct 27 2025, 10:03 AM · Upstream, cloud-services-team, SRE
fgiunchedi closed T407868: JobUnavailable Reduced availability for job maintain_dbusers_eqiad in cloud@eqiad as Resolved.

Alert is gone, optimistically resolving

Oct 27 2025, 10:01 AM · cloud-services-team
fgiunchedi closed T407688: JobUnavailable Reduced availability for job openstack in cloud@eqiad as Resolved.
Oct 27 2025, 10:01 AM · cloud-services-team