Page MenuHomePhabricator

jcrespo (Jaime Crespo)
Sr Site Reliability Engineer

Projects (15)

Today

  • No visible events.

Tomorrow

  • No visible events.

Monday

  • No visible events.

User Details

User Since
May 11 2015, 8:31 AM (569 w, 4 d)
Availability
Available
IRC Nick
jynus
LDAP User
Jcrespo
MediaWiki User
JCrespo (WMF) [ Global Accounts ]

Recent Activity

Yesterday

jcrespo added a comment to T415237: etherpad table size is 233GB / plan to delete all etherpads in April 2026.

My suggestion is to move the current etherpad to etherpad-old right now, and once the due time has passed, we remove it. That will fix the "we need some time to prepare for the change".

Fri, Apr 10, 7:27 AM · User-notice, collaboration-services, Wikimedia-Etherpad, Data-Persistence

Thu, Apr 9

jcrespo added a comment to T420464: Setup ms-backup[12]00[34] and stop using and prepare for decommission ms-backup[12]00[12].

Leaving the old hosts disabled during the weekend, and we will run the decommissioning scripts next week, to verify the new hosts work well on their own.

Thu, Apr 9, 3:56 PM · Data-Persistence-Backup, media-backups
jcrespo raised the priority of T156544: Create backups of Wikimedia content in diverse geographic places from Low to High.
Thu, Apr 9, 3:54 PM · Data-Persistence-Backup, Internet-Archive, Offline-Working-Group
jcrespo added a subtask for T420464: Setup ms-backup[12]00[34] and stop using and prepare for decommission ms-backup[12]00[12]: T422852: decommission ms-backup2001 & ms-backup2002.
Thu, Apr 9, 3:42 PM · Data-Persistence-Backup, media-backups
jcrespo added a parent task for T422852: decommission ms-backup2001 & ms-backup2002: T420464: Setup ms-backup[12]00[34] and stop using and prepare for decommission ms-backup[12]00[12].
Thu, Apr 9, 3:42 PM · media-backups, Data-Persistence-Backup, decommission-hardware
jcrespo created T422852: decommission ms-backup2001 & ms-backup2002.
Thu, Apr 9, 3:41 PM · media-backups, Data-Persistence-Backup, decommission-hardware
jcrespo added a parent task for T422851: decommission ms-backup1001 & ms-backup1002: T420464: Setup ms-backup[12]00[34] and stop using and prepare for decommission ms-backup[12]00[12].
Thu, Apr 9, 3:40 PM · media-backups, Data-Persistence-Backup, decommission-hardware
jcrespo added a subtask for T420464: Setup ms-backup[12]00[34] and stop using and prepare for decommission ms-backup[12]00[12]: T422851: decommission ms-backup1001 & ms-backup1002.
Thu, Apr 9, 3:40 PM · Data-Persistence-Backup, media-backups
jcrespo created T422851: decommission ms-backup1001 & ms-backup1002.
Thu, Apr 9, 3:39 PM · media-backups, Data-Persistence-Backup, decommission-hardware
jcrespo added a project to T420623: netbox report error for puppetdb serial versus netbox serial for backup1012: collaboration-services.

@jcrespo we can do this next week Wednesday April 15th at 10am CT . Thank you.

Thu, Apr 9, 9:04 AM · collaboration-services, SRE, ops-eqiad, DC-Ops
jcrespo added a comment to T422777: Migrate an entire sX section to Debian Trixie.

I will wait as usual to be the last of the replicas to be upgraded. One question, as a reimage could cause a package upgrade- please let me know which is the latest version that can be installed, as dbprov hosts should have the same or higher minor version for the backups to be considered safe.

Thu, Apr 9, 8:57 AM · DBA
jcrespo added a comment to T421719: Data persistance: Re-IP eqiad private baremetal hosts to new per-rack vlans/subnets.

We can work together on that, the process is a bit more manual, and requires editing the host's /etc/network/interfaces file, updating netbox and updating the switch port config.

Thu, Apr 9, 8:52 AM · DBA, Ceph, SRE-swift-storage, User-Eevans, Data-Persistence

Wed, Apr 8

jcrespo closed T419970: backup2005 power supplies fried or overvoltage as Resolved.

@Jhancock.wm I want to thank you deeply the work, a lot! Please note your work will pay off, as regenerating backups will have taken hundreds of hours and also dozens from me of manual work just to set up the replacement.

Wed, Apr 8, 5:45 PM · SRE, DC-Ops, Data-Persistence-Backup, media-backups, ops-codfw
jcrespo added a comment to T420623: netbox report error for puppetdb serial versus netbox serial for backup1012.

Let me know when you can.

Wed, Apr 8, 4:04 PM · collaboration-services, SRE, ops-eqiad, DC-Ops
jcrespo added a comment to T419970: backup2005 power supplies fried or overvoltage.

@jcrespo would loading the disks from a foreign config be acceptable for you? or will that cause issues with recovery?

Wed, Apr 8, 8:04 AM · SRE, DC-Ops, Data-Persistence-Backup, media-backups, ops-codfw

Tue, Apr 7

jcrespo added a comment to T420623: netbox report error for puppetdb serial versus netbox serial for backup1012.

backup1012 hosts gerrit backups hourly. As long as we put it down just before maintenance, it could be done any time. If we stop it and downtime it properly and the downtime doesn't last more than a couple of hours.

Tue, Apr 7, 6:49 AM · collaboration-services, SRE, ops-eqiad, DC-Ops

Wed, Apr 1

jcrespo closed T369253: Alert email sent from backupmon1001 didn't reach engineer's google inbox (was: check-dbbackup-time sometimes doesn't send email alerts) as Resolved.

This was fixed 5 months ago when the new version was realeased for both bookworm, and then for trixie: https://gitlab.wikimedia.org/repos/sre/wmfbackups/-/commits/v0.8.4_bookworm?ref_type=tags

Wed, Apr 1, 2:17 PM · Data-Persistence-Automations, database-backups, Data-Persistence-Backup
jcrespo added a comment to T421719: Data persistance: Re-IP eqiad private baremetal hosts to new per-rack vlans/subnets.

we're now asking service owners to re-image their existing baremetal servers

Wed, Apr 1, 8:13 AM · DBA, Ceph, SRE-swift-storage, User-Eevans, Data-Persistence

Mon, Mar 30

jcrespo updated subscribers of T421729: Create cluster32 and cluster33 in existing es6 and es7 hosts.

@Ladsgroup @Marostegui @FCeratto-WMF FYI

Mon, Mar 30, 2:24 PM · Patch-For-Review, database-backups, DBA
jcrespo updated the task description for T421729: Create cluster32 and cluster33 in existing es6 and es7 hosts.
Mon, Mar 30, 2:24 PM · Patch-For-Review, database-backups, DBA
jcrespo created T421729: Create cluster32 and cluster33 in existing es6 and es7 hosts.
Mon, Mar 30, 2:21 PM · Patch-For-Review, database-backups, DBA

Thu, Mar 26

jcrespo added a comment to T196336: Icinga passive checks go awol and downtime stops working.

It's been consistent behavior for some weeks now - both downtimes are removed at once after the reboot occurs. Not sure if this is a cookbook issue or an icinga issue, given the context of this Icinga bug.

Thu, Mar 26, 5:17 PM · Observability-Alerting, SRE, Icinga, observability
jcrespo added a comment to T419970: backup2005 power supplies fried or overvoltage.

Might be next week before i can finish that out. I'll let you know

Thu, Mar 26, 3:01 PM · SRE, DC-Ops, Data-Persistence-Backup, media-backups, ops-codfw
jcrespo added a comment to T419970: backup2005 power supplies fried or overvoltage.

Any update? Even a "No work done, I plan to work on this next X" would be useful.

Thu, Mar 26, 1:43 PM · SRE, DC-Ops, Data-Persistence-Backup, media-backups, ops-codfw
jcrespo added a comment to T420506: Setup backup[12]01[456789] & backup[12]020 and migrate data to them; prepare for decommission backup[12]00[34567].

I've started the data migration to the new hosts @ eqiad, emptying backup1004. Emptying a single host would take around ~16 days, maybe more. Other hosts can be done in parallel, to some extent, but ramping up the transfer speed very slowly.

Thu, Mar 26, 11:36 AM · Data-Persistence-Backup, media-backups, database-backups, bacula

Tue, Mar 24

jcrespo added a comment to T420041: db1253 depooled following host crash.

Technically there are logs (I've updated the header), just they are useless for us, as they are non-specific enough. A few issues pointing to (but not necessarily caused by) the IME happened in the past, there was never a clear reasoning, it just ended up working after a few firmware updates.

Tue, Mar 24, 3:19 PM · DBA
jcrespo updated the task description for T420041: db1253 depooled following host crash.
Tue, Mar 24, 3:17 PM · DBA
jcrespo added a comment to T420041: db1253 depooled following host crash.

To summarize, no hardware fault was detected in idrac, racadm, journald logs, dmesg.
After the maintenance freeze we could clone the host and repool it.
We could consider running cpu, memory and I/O stress tests in the meantime just in case.

Tue, Mar 24, 3:00 PM · DBA

Mon, Mar 23

jcrespo updated subscribers of T420873: Degraded RAID on db1170.

That's an s7 core host, it is for @FCeratto-WMF to make the call.

Mon, Mar 23, 4:42 PM · DBA, SRE, DC-Ops, ops-eqiad

Fri, Mar 20

jcrespo closed T410020: Evaluate garage as a replacement for an S3-compatible replacement for minio, a subtask of T262668: WMF media storage must be adequately backed up, as Resolved.
Fri, Mar 20, 1:58 PM · media-backups, Data-Persistence-Backup, Epic, Goal, SRE, SRE-swift-storage
jcrespo closed T410020: Evaluate garage as a replacement for an S3-compatible replacement for minio as Resolved.

We evaluated Garage and while it is a nice cloud system for personal use, it didn't fit the needs for backup handling, so we went for an arguably less technologically advanced, but a simpler, more flexible, and reliable approach, and more fitting to the exiting client automation, even if performance was scarified: an s3 proxy to the filesystem (versitygw).

Fri, Mar 20, 1:58 PM · Patch-For-Review, Data-Persistence, media-backups, Data-Persistence-Backup, SRE
jcrespo closed T410028: Unexpected media growth led to low disk resources on several media backup hosts as Resolved.
Fri, Mar 20, 1:43 PM · Infrastructure Security, media-backups, Data-Persistence-Backup, SRE, Data-Persistence
jcrespo added a comment to T420177: clouddb1013 crashed after the upgrade to mariadb 10.11.16.

Sorry I wasn't clean enough, I was helping debug. I (neither our team) don't own clouddb ( not I am a dba; I won't be touching clouddbs) and we don't handle backups nor recoveries for cloud env either (I have no backups for cloud)

Fri, Mar 20, 12:01 PM · SecTeam-Processed, Upstream, Product Safety and Integrity, cloud-services-team, DBA, Data-Persistence, Data-Services
jcrespo added a comment to T420177: clouddb1013 crashed after the upgrade to mariadb 10.11.16.

I see some mentions of table corruption the 25 of february. I checked hardware and it seems fine. Given it has only happened on s3, my guess it is due to data corruption. You should reload from a logical copy, this may be causing a segfault every time it reads certain bittrotten table file.

Fri, Mar 20, 11:46 AM · SecTeam-Processed, Upstream, Product Safety and Integrity, cloud-services-team, DBA, Data-Persistence, Data-Services
jcrespo added a comment to T410028: Unexpected media growth led to low disk resources on several media backup hosts.

Backups are slowly flowing on eqiad, too:

db1204.eqiad.wmnet[mediabackups]> select count(*), location FROM backups group by location;
+----------+----------+
| count(*) | location |
+----------+----------+
| 30309863 |        1 |
| 29274314 |        2 |
| 31524753 |        3 |
| 30900062 |        4 |
| 28463614 |        5 |
| 14012226 |        6 |
|        5 |        7 |
|        1 |        8 |
|        4 |        9 |
|        3 |       10 |
|        3 |       11 |
|        6 |       12 |
|        6 |       13 |
|        3 |       14 |
|        3 |       15 |
|        8 |       16 |
|        4 |       17 |
|        4 |       18 |
|        2 |       19 |
|        3 |       20 |
|        4 |       21 |
|        3 |       22 |
|        3 |       23 |
|        6 |       24 |
|        6 |       25 |
|        1 |       26 |
|        6 |       27 |
|        6 |       28 |
|        3 |       29 |
|        7 |       30 |
+----------+----------+
30 rows in set (30.025 sec)
Fri, Mar 20, 11:33 AM · Infrastructure Security, media-backups, Data-Persistence-Backup, SRE, Data-Persistence
jcrespo added a comment to T420506: Setup backup[12]01[456789] & backup[12]020 and migrate data to them; prepare for decommission backup[12]00[34567].

These old hosts cannot be decommissioned yet, they need to migrate its data to the new hosts first- they are still in production, just read only!

Fri, Mar 20, 9:21 AM · Data-Persistence-Backup, media-backups, database-backups, bacula
jcrespo updated the task description for T420506: Setup backup[12]01[456789] & backup[12]020 and migrate data to them; prepare for decommission backup[12]00[34567].
Fri, Mar 20, 9:20 AM · Data-Persistence-Backup, media-backups, database-backups, bacula
jcrespo added a comment to T420708: Unresponsive management for backup2005.mgmt:22.

:'-(

Fri, Mar 20, 9:19 AM · SRE, DC-Ops, ops-codfw
jcrespo added a comment to T410028: Unexpected media growth led to low disk resources on several media backup hosts.

The trend is clear here: while old objects had some average size, new ones (shard 6 only had new ones, but it equally space filled), the new average size is 3 times larger, while maintaining the upload speed, meaning we now consume space 3 times faster than usual.

Fri, Mar 20, 9:17 AM · Infrastructure Security, media-backups, Data-Persistence-Backup, SRE, Data-Persistence
jcrespo added a comment to T410028: Unexpected media growth led to low disk resources on several media backup hosts.

After fixing some authentication and some region configuration issues, backups are flowing now. This view is really nice, show the new 24 shards getting filled:

cumin2024@db2183.codfw.wmnet[mediabackups]> select count(*), location FROM backups group by location;
+----------+----------+
| count(*) | location |
+----------+----------+
| 30299611 |        1 |
| 30295929 |        2 |
| 30291310 |        3 |
| 30297141 |        4 |
| 28000043 |        5 |
| 13729535 |        6 |
|      103 |        7 |
|      108 |        8 |
|      128 |        9 |
|      113 |       10 |
|       27 |       11 |
|       33 |       12 |
|       29 |       13 |
|       33 |       14 |
|       48 |       15 |
|       43 |       16 |
|       50 |       17 |
|       44 |       18 |
|      108 |       19 |
|       98 |       20 |
|       96 |       21 |
|      110 |       22 |
|      117 |       23 |
|       92 |       24 |
|      123 |       25 |
|      108 |       26 |
|      120 |       27 |
|      114 |       28 |
|      118 |       29 |
|       95 |       30 |
+----------+----------+
30 rows in set (42.289 sec)
Fri, Mar 20, 9:12 AM · Infrastructure Security, media-backups, Data-Persistence-Backup, SRE, Data-Persistence
jcrespo added a comment to T419970: backup2005 power supplies fried or overvoltage.

Any update?

Fri, Mar 20, 8:00 AM · SRE, DC-Ops, Data-Persistence-Backup, media-backups, ops-codfw
jcrespo merged task T420613: Unresponsive management for backup2005.mgmt:22 into T419970: backup2005 power supplies fried or overvoltage.
Fri, Mar 20, 8:00 AM · SRE, ops-codfw, DC-Ops
jcrespo merged T420613: Unresponsive management for backup2005.mgmt:22 into T419970: backup2005 power supplies fried or overvoltage.
Fri, Mar 20, 8:00 AM · SRE, DC-Ops, Data-Persistence-Backup, media-backups, ops-codfw

Thu, Mar 19

jcrespo updated the task description for T420464: Setup ms-backup[12]00[34] and stop using and prepare for decommission ms-backup[12]00[12].
Thu, Mar 19, 5:29 PM · Data-Persistence-Backup, media-backups
jcrespo merged task T420308: Unresponsive management for backup2005.mgmt:22 into T419970: backup2005 power supplies fried or overvoltage.
Thu, Mar 19, 10:10 AM · SRE, DC-Ops, ops-codfw
jcrespo merged T420308: Unresponsive management for backup2005.mgmt:22 into T419970: backup2005 power supplies fried or overvoltage.
Thu, Mar 19, 10:09 AM · SRE, DC-Ops, Data-Persistence-Backup, media-backups, ops-codfw

Wed, Mar 18

jcrespo triaged T410028: Unexpected media growth led to low disk resources on several media backup hosts as High priority.
Wed, Mar 18, 6:04 PM · Infrastructure Security, media-backups, Data-Persistence-Backup, SRE, Data-Persistence
jcrespo added a subtask for T410028: Unexpected media growth led to low disk resources on several media backup hosts: T420506: Setup backup[12]01[456789] & backup[12]020 and migrate data to them; prepare for decommission backup[12]00[34567].
Wed, Mar 18, 6:04 PM · Infrastructure Security, media-backups, Data-Persistence-Backup, SRE, Data-Persistence
jcrespo added a parent task for T420506: Setup backup[12]01[456789] & backup[12]020 and migrate data to them; prepare for decommission backup[12]00[34567]: T410028: Unexpected media growth led to low disk resources on several media backup hosts.
Wed, Mar 18, 6:04 PM · Data-Persistence-Backup, media-backups, database-backups, bacula
jcrespo added a subtask for T410028: Unexpected media growth led to low disk resources on several media backup hosts: T420464: Setup ms-backup[12]00[34] and stop using and prepare for decommission ms-backup[12]00[12].
Wed, Mar 18, 6:03 PM · Infrastructure Security, media-backups, Data-Persistence-Backup, SRE, Data-Persistence
jcrespo added a parent task for T420464: Setup ms-backup[12]00[34] and stop using and prepare for decommission ms-backup[12]00[12]: T410028: Unexpected media growth led to low disk resources on several media backup hosts.
Wed, Mar 18, 6:03 PM · Data-Persistence-Backup, media-backups
jcrespo triaged T420506: Setup backup[12]01[456789] & backup[12]020 and migrate data to them; prepare for decommission backup[12]00[34567] as High priority.
Wed, Mar 18, 5:58 PM · Data-Persistence-Backup, media-backups, database-backups, bacula
jcrespo created T420506: Setup backup[12]01[456789] & backup[12]020 and migrate data to them; prepare for decommission backup[12]00[34567].
Wed, Mar 18, 5:58 PM · Data-Persistence-Backup, media-backups, database-backups, bacula
jcrespo added a comment to T420464: Setup ms-backup[12]00[34] and stop using and prepare for decommission ms-backup[12]00[12].

See also:
https://phabricator.wikimedia.org/rOPUP4de4557a74826b67cdb445968548f27d95ff8e72
https://phabricator.wikimedia.org/rOPUP4fe93ca2f1495ae760e5bae12d27e1ae8a242293

Wed, Mar 18, 12:35 PM · Data-Persistence-Backup, media-backups
jcrespo triaged T420464: Setup ms-backup[12]00[34] and stop using and prepare for decommission ms-backup[12]00[12] as High priority.
Wed, Mar 18, 12:34 PM · Data-Persistence-Backup, media-backups
jcrespo created T420464: Setup ms-backup[12]00[34] and stop using and prepare for decommission ms-backup[12]00[12].
Wed, Mar 18, 12:33 PM · Data-Persistence-Backup, media-backups

Mon, Mar 16

jcrespo added a project to T410028: Unexpected media growth led to low disk resources on several media backup hosts: Infrastructure Security.

Moritz: I would like your assessment on deploying a new storage service for media backups through the profile temporarily called mediabackups::new_storage, so it use the existing mediabackups::storage for migration purposes, but will eventually substitute it.

Mon, Mar 16, 3:29 PM · Infrastructure Security, media-backups, Data-Persistence-Backup, SRE, Data-Persistence
jcrespo added a comment to T419970: backup2005 power supplies fried or overvoltage.

It's not posting at the moment. I have some tricks to try today and if not, i have some decommed servers i can pull parts from. It's my intention to have it back up by the end of my day. TY for letting me know that errors are okay. just need to get it to boot.

Mon, Mar 16, 2:59 PM · SRE, DC-Ops, Data-Persistence-Backup, media-backups, ops-codfw
jcrespo updated subscribers of T420205: Remove deprecated Type=simple from custom systemd units.

Only adding Moritz for awareness of trixie, not needing or expecting any work soon.

Mon, Mar 16, 12:41 PM · Infrastructure-Foundations
jcrespo triaged T420205: Remove deprecated Type=simple from custom systemd units as Low priority.
Mon, Mar 16, 12:38 PM · Infrastructure-Foundations
jcrespo updated the task description for T420205: Remove deprecated Type=simple from custom systemd units.
Mon, Mar 16, 12:37 PM · Infrastructure-Foundations
jcrespo created T420205: Remove deprecated Type=simple from custom systemd units.
Mon, Mar 16, 12:34 PM · Infrastructure-Foundations
jcrespo added a comment to T419970: backup2005 power supplies fried or overvoltage.

Please take your time, as I said it can be down for some time.

Mon, Mar 16, 10:19 AM · SRE, DC-Ops, Data-Persistence-Backup, media-backups, ops-codfw

Fri, Mar 13

jcrespo updated subscribers of T419980: ICU 72 upgrade: `categorylinks` table swap.

DBAs: Given the interactions with clouddbs you may want to be on top of that, as this is likely to be ran on a primary db.

Fri, Mar 13, 2:47 PM · Data-Engineering-Radar, DBA, Data-Persistence, Data-Engineering, Schema-change, User-Raine, ServiceOps new, ServiceOps-Upgrades-Hardware, ServiceOps-Mediawiki
jcrespo updated subscribers of T419958: db1258 connection went down at 10:43Z.

^ @Jclark-ctr Matthew (and Effie) were only the people on call. This should be directed to the owners of the service, the DBAs: @Ladsgroup @Marostegui and @FCeratto-WMF , some of which are not around today.

Fri, Mar 13, 12:13 PM · SRE, DC-Ops, ops-eqiad, DBA, Sustainability (Incident Followup), Data-Persistence
jcrespo created T419970: backup2005 power supplies fried or overvoltage.
Fri, Mar 13, 12:08 PM · SRE, DC-Ops, Data-Persistence-Backup, media-backups, ops-codfw
jcrespo updated the task description for T419958: db1258 connection went down at 10:43Z.
Fri, Mar 13, 11:39 AM · SRE, DC-Ops, ops-eqiad, DBA, Sustainability (Incident Followup), Data-Persistence
jcrespo updated the task description for T419958: db1258 connection went down at 10:43Z.
Fri, Mar 13, 11:38 AM · SRE, DC-Ops, ops-eqiad, DBA, Sustainability (Incident Followup), Data-Persistence
jcrespo renamed T419958: db1258 connection went down at 10:43Z from db1258 went down at 10:43Z to db1258 connection went down at 10:43Z.
Fri, Mar 13, 11:20 AM · SRE, DC-Ops, ops-eqiad, DBA, Sustainability (Incident Followup), Data-Persistence
jcrespo triaged T419958: db1258 connection went down at 10:43Z as Medium priority.
Fri, Mar 13, 11:18 AM · SRE, DC-Ops, ops-eqiad, DBA, Sustainability (Incident Followup), Data-Persistence
jcrespo updated the task description for T419958: db1258 connection went down at 10:43Z.
Fri, Mar 13, 11:14 AM · SRE, DC-Ops, ops-eqiad, DBA, Sustainability (Incident Followup), Data-Persistence
jcrespo updated the task description for T419958: db1258 connection went down at 10:43Z.
Fri, Mar 13, 11:13 AM · SRE, DC-Ops, ops-eqiad, DBA, Sustainability (Incident Followup), Data-Persistence
jcrespo added a project to T419958: db1258 connection went down at 10:43Z: ops-eqiad.
Fri, Mar 13, 11:12 AM · SRE, DC-Ops, ops-eqiad, DBA, Sustainability (Incident Followup), Data-Persistence
jcrespo added a project to T419958: db1258 connection went down at 10:43Z: DBA.
Fri, Mar 13, 10:51 AM · SRE, DC-Ops, ops-eqiad, DBA, Sustainability (Incident Followup), Data-Persistence
jcrespo updated the task description for T419957: Notice: Missing Signed-By in the sources.list(5) entry for 'http://mirrors.wikimedia.org/debian'.
Fri, Mar 13, 10:47 AM · Infrastructure Security, Infrastructure-Foundations
jcrespo updated the task description for T419957: Notice: Missing Signed-By in the sources.list(5) entry for 'http://mirrors.wikimedia.org/debian'.
Fri, Mar 13, 10:46 AM · Infrastructure Security, Infrastructure-Foundations
jcrespo updated subscribers of T419957: Notice: Missing Signed-By in the sources.list(5) entry for 'http://mirrors.wikimedia.org/debian'.
Fri, Mar 13, 10:45 AM · Infrastructure Security, Infrastructure-Foundations
jcrespo created T419957: Notice: Missing Signed-By in the sources.list(5) entry for 'http://mirrors.wikimedia.org/debian'.
Fri, Mar 13, 10:44 AM · Infrastructure Security, Infrastructure-Foundations

Thu, Mar 12

jcrespo added a comment to T411766: Replace MinIO component.

We at backups (data persistence) are probably going to replace minio with a low-tech, low-performance, high reliability s3 proxy. It seems to work ok for backups, but may not be the right choice for high performance needs. Let me know if you want to talk about it.

Thu, Mar 12, 3:33 PM · Wikibase Cloud

Mar 5 2026

jcrespo added a comment to T416578: Fix power power accounting for misc cluster.

Yes, zarcillo seems to be fine indeed as these hosts are indeed misc so the issue must be somewhere else:

cumin2024@db1215.eqiad.wmnet[zarcillo]> select * from instances where name like 'db22%' and `group`='misc';
+--------+--------------------+------+---------+------------+-------+
| name   | server             | port | version | last_start | group |
+--------+--------------------+------+---------+------------+-------+
| db2232 | db2232.codfw.wmnet | 3306 | NULL    | NULL       | misc  |
| db2233 | db2233.codfw.wmnet | 3306 | NULL    | NULL       | misc  |
| db2234 | db2234.codfw.wmnet | 3306 | NULL    | NULL       | misc  |
| db2235 | db2235.codfw.wmnet | 3306 | NULL    | NULL       | misc  |
+--------+--------------------+------+---------+------------+-------+
4 rows in set (0.002 sec)
Mar 5 2026, 3:25 PM · observability, Grafana, DBA

Mar 4 2026

jcrespo updated the task description for T418772: Eqiad: lsw1-d7-eqiad BGP maintenance.
Mar 4 2026, 8:38 AM · Prod-Kubernetes, ServiceOps new, netops, Infrastructure-Foundations, SRE
jcrespo updated the task description for T418772: Eqiad: lsw1-d7-eqiad BGP maintenance.
Mar 4 2026, 8:26 AM · Prod-Kubernetes, ServiceOps new, netops, Infrastructure-Foundations, SRE
jcrespo added a comment to T418772: Eqiad: lsw1-d7-eqiad BGP maintenance.

@Papaul for backup1007, dbprov1004, while they are a production host with important content, a small network interruption will not cause any issue. Just give us a heads up if the window gets larger. Let me downtime it for a day. Let me update the ticket.

Mar 4 2026, 8:25 AM · Prod-Kubernetes, ServiceOps new, netops, Infrastructure-Foundations, SRE

Mar 3 2026

jcrespo added a comment to T418839: Edits aren't saving correctly.

For context, replication broke on databases, edits were not lost during the incident, but it took an abnormal number of minutes to appear as applied everywhere. Sorry for the disruption, things should be fine, but please report if you see anything else out of the ordinary.

Mar 3 2026, 10:37 AM · DBA, SRE, Wikimedia-Incident
jcrespo added a comment to T288448: Possible obsolete files in MariaDB TLS configuration.

Sure, do as you see fit, all will be good, the old thingy was setup way before expose_puppet_certs existed. I would even go further and setup in the future dedicated certificates outside of puppet, but as you wish.

Mar 3 2026, 8:17 AM · DBA

Feb 26 2026

jcrespo added a comment to T414718: Q3:rack/setup/install ms-backup100[34].

@Jclark-ctr I just marged thew new recipe, please give it 30 minutes to propagate, and should be done. Apologies again for the mistake.

Feb 26 2026, 2:20 PM · Data-Persistence, SRE, ops-eqiad, DC-Ops
jcrespo updated the task description for T414717: Q3:rack/setup/install ms-backup200[34].
Feb 26 2026, 11:30 AM · Data-Persistence, SRE, ops-codfw, DC-Ops
jcrespo updated the task description for T414718: Q3:rack/setup/install ms-backup100[34].
Feb 26 2026, 11:28 AM · Data-Persistence, SRE, ops-eqiad, DC-Ops
jcrespo added a comment to T414718: Q3:rack/setup/install ms-backup100[34].

Will, do sorry, these should use standard recipes, so it should be easy to update.

Feb 26 2026, 11:27 AM · Data-Persistence, SRE, ops-eqiad, DC-Ops

Feb 25 2026

jcrespo added a comment to T417247: Reimage gerrit2002.

@jcrespo do you confirm backups are now OK?

Feb 25 2026, 3:26 PM · Patch-For-Review, Gerrit, collaboration-services

Feb 24 2026

jcrespo added a comment to T417247: Reimage gerrit2002.

This is the real error:

Feb 24 2026, 6:11 PM · Patch-For-Review, Gerrit, collaboration-services
jcrespo added a comment to T417247: Reimage gerrit2002.

Then that message may be misleading and not the cause of the issues, but the error is real:

Feb 24 2026, 6:06 PM · Patch-For-Review, Gerrit, collaboration-services
jcrespo reopened T417247: Reimage gerrit2002, a subtask of T387833: Gerrit switchover process, as Open.
Feb 24 2026, 5:21 PM · Gerrit, Patch-For-Review, collaboration-services
jcrespo reopened T417247: Reimage gerrit2002 as "Open".
Feb 24 2026, 5:21 PM · Patch-For-Review, Gerrit, collaboration-services
jcrespo added a comment to T417247: Reimage gerrit2002.

Backups from gerrit2002 are failing with:

Feb 24 2026, 5:21 PM · Patch-For-Review, Gerrit, collaboration-services

Feb 23 2026

jcrespo closed T329158: systemd-timer puppet code triggers an execution when applying a schedule change as Resolved.
Feb 23 2026, 3:59 PM · Puppet-Core, Infrastructure-Foundations

Feb 19 2026

jcrespo added a comment to T376370: mariadb: create a synthetic monitoring indicator for dc switchover readiness.

Suggestion: I wonder if could there be a more or less automated way to know if there is an ongoing schema change, e.g. that the script "locks" some file with the name of the section/dc and that is shown on a control panel somewhere, eg. on grafana? I don't know if that would be easy to implement.

@FCeratto-WMF is working on that with https://zarcillo.wikimedia.org/ui/locks and https://gitlab.wikimedia.org/repos/sre/schema-changes/-/merge_requests/42/diffs#7329d389feef6faed22f45ef93afd8d94da66ec0

Feb 19 2026, 11:50 AM · Data-Persistence-Automations, DBA
jcrespo added a comment to T376370: mariadb: create a synthetic monitoring indicator for dc switchover readiness.

Suggestion: I wonder if could there be a more or less automated way to know if there is an ongoing schema change, e.g. that the script "locks" some file with the name of the section/dc and that is shown on a control panel somewhere, eg. on grafana? I don't know if that would be easy to implement.

Feb 19 2026, 11:00 AM · Data-Persistence-Automations, DBA

Feb 18 2026

jcrespo added a comment to T414724: Q3:rack/setup/install backup2015.

Fixed at https://gerrit.wikimedia.org/r/c/operations/puppet/+/1240203

Feb 18 2026, 9:27 AM · Data-Persistence, SRE, ops-codfw, DC-Ops
jcrespo added a comment to T414727: Q3:rack/setup/install backup20[16-20].

@jcrespo i need an edit to the site.pp file. the backup20XX servers have eqiad in the name. they should be codfw. Thank you!

Feb 18 2026, 8:03 AM · Data-Persistence, SRE, ops-codfw, DC-Ops