Page MenuHomePhabricator

jcrespo (Jaime Crespo)
Sr Database Administrator

Projects (12)

Today

  • Clear sailing ahead.

Tomorrow

  • Clear sailing ahead.

Thursday

  • Clear sailing ahead.

User Details

User Since
May 11 2015, 8:31 AM (474 w, 22 h)
Availability
Available
IRC Nick
jynus
LDAP User
Jcrespo
MediaWiki User
JCrespo (WMF) [ Global Accounts ]

Recent Activity

Fri, Jun 7

jcrespo updated the task description for T358741: Decommission db2096-db2120.
Fri, Jun 7, 10:24 AM · DBA
jcrespo created T366892: decommission db2102.codw.wmnet.
Fri, Jun 7, 10:23 AM · Patch-For-Review, decommission-hardware
jcrespo updated the task description for T360751: Upgrade backup sources to MariaDB 10.6.
Fri, Jun 7, 10:00 AM · Data-Persistence, Data-Persistence-Backup
jcrespo added a comment to T362883: decommission db2099.codfw.wmnet.

This is ready for dc-ops.

Fri, Jun 7, 8:59 AM · SRE, DC-Ops, ops-codfw, decommission-hardware
jcrespo placed T362883: decommission db2099.codfw.wmnet up for grabs.
Fri, Jun 7, 8:59 AM · SRE, DC-Ops, ops-codfw, decommission-hardware
jcrespo added a comment to T366877: decommission db2098.codfw.wmnet.

This is ready for dc ops.

Fri, Jun 7, 8:58 AM · SRE, ops-codfw, DC-Ops, decommission-hardware
jcrespo renamed T366877: decommission db2098.codfw.wmnet from decommission db2098 to decommission db2098.codfw.wmnet.
Fri, Jun 7, 8:58 AM · SRE, ops-codfw, DC-Ops, decommission-hardware
jcrespo placed T362802: decommission db2097.codfw.wmnet up for grabs.
Fri, Jun 7, 8:56 AM · SRE, DC-Ops, ops-codfw, decommission-hardware
jcrespo added a comment to T362802: decommission db2097.codfw.wmnet.

This is ready for dc ops.

Fri, Jun 7, 8:56 AM · SRE, DC-Ops, ops-codfw, decommission-hardware
jcrespo updated the task description for T358741: Decommission db2096-db2120.
Fri, Jun 7, 8:52 AM · DBA
jcrespo updated the task description for T360751: Upgrade backup sources to MariaDB 10.6.
Fri, Jun 7, 8:40 AM · Data-Persistence, Data-Persistence-Backup
jcrespo updated the task description for T362883: decommission db2099.codfw.wmnet.
Fri, Jun 7, 8:22 AM · SRE, DC-Ops, ops-codfw, decommission-hardware
jcrespo updated the task description for T366877: decommission db2098.codfw.wmnet.
Fri, Jun 7, 8:21 AM · SRE, ops-codfw, DC-Ops, decommission-hardware
jcrespo updated the task description for T362802: decommission db2097.codfw.wmnet.
Fri, Jun 7, 8:21 AM · SRE, DC-Ops, ops-codfw, decommission-hardware
jcrespo updated the task description for T358741: Decommission db2096-db2120.
Fri, Jun 7, 7:42 AM · DBA
jcrespo added a subtask for T360751: Upgrade backup sources to MariaDB 10.6: T362883: decommission db2099.codfw.wmnet.
Fri, Jun 7, 7:32 AM · Data-Persistence, Data-Persistence-Backup
jcrespo added a parent task for T362883: decommission db2099.codfw.wmnet: T360751: Upgrade backup sources to MariaDB 10.6.
Fri, Jun 7, 7:32 AM · SRE, DC-Ops, ops-codfw, decommission-hardware
jcrespo added a subtask for T360751: Upgrade backup sources to MariaDB 10.6: T366877: decommission db2098.codfw.wmnet.
Fri, Jun 7, 7:26 AM · Data-Persistence, Data-Persistence-Backup
jcrespo added a parent task for T366877: decommission db2098.codfw.wmnet: T360751: Upgrade backup sources to MariaDB 10.6.
Fri, Jun 7, 7:26 AM · SRE, ops-codfw, DC-Ops, decommission-hardware
jcrespo added a subtask for T360751: Upgrade backup sources to MariaDB 10.6: T362802: decommission db2097.codfw.wmnet.
Fri, Jun 7, 7:26 AM · Data-Persistence, Data-Persistence-Backup
jcrespo added a parent task for T362802: decommission db2097.codfw.wmnet: T360751: Upgrade backup sources to MariaDB 10.6.
Fri, Jun 7, 7:26 AM · SRE, DC-Ops, ops-codfw, decommission-hardware
jcrespo created T366877: decommission db2098.codfw.wmnet.
Fri, Jun 7, 7:16 AM · SRE, ops-codfw, DC-Ops, decommission-hardware

Wed, May 29

jcrespo updated the task description for T360751: Upgrade backup sources to MariaDB 10.6.
Wed, May 29, 4:28 PM · Data-Persistence, Data-Persistence-Backup
jcrespo updated the task description for T364290: Upgrade s1 to MariaDB 10.6.
Wed, May 29, 4:27 PM · DBA
jcrespo added a comment to T364290: Upgrade s1 to MariaDB 10.6.

I will migrate the backups to 10.6 without removing yet the 10.4 backup sources.

Wed, May 29, 3:28 PM · DBA
jcrespo added a comment to T363581: Build a machine-readable catalogue of mariadb tables in production.

@Volans not Amir, but Re: your first question, my understanding is that this was a compromise to make sure there was something good enough and simple short term, rather than overengineering from the start. That doesn't mean that what you suggest is discarded, but something that could be improved later on. For example, I am personally interested on having a querable service/API later for backup checks, but this is better than nothing ATM, with relatively small effort. Later on, a database could import the file and generate it, for example. So I am a fan of interating slowly as long as it is an improvement 0:-D.

Wed, May 29, 2:21 PM · DBA

Thu, May 23

jcrespo updated the task description for T365607: Reprovision missing files due to backup1005 hw issues.
Thu, May 23, 8:06 AM · Data-Persistence-Backup, media-backups

Wed, May 22

jcrespo added a parent task for T361087: backup1005 crashed: T365607: Reprovision missing files due to backup1005 hw issues.
Wed, May 22, 2:46 PM · SRE, ops-eqiad, DC-Ops, Data-Persistence-Backup, media-backups
jcrespo added a comment to T365607: Reprovision missing files due to backup1005 hw issues.

Followup to T361087.

Wed, May 22, 2:46 PM · Data-Persistence-Backup, media-backups
jcrespo added a subtask for T365607: Reprovision missing files due to backup1005 hw issues: T361087: backup1005 crashed.
Wed, May 22, 2:46 PM · Data-Persistence-Backup, media-backups
jcrespo triaged T365607: Reprovision missing files due to backup1005 hw issues as High priority.
Wed, May 22, 2:46 PM · Data-Persistence-Backup, media-backups
jcrespo created T365607: Reprovision missing files due to backup1005 hw issues.
Wed, May 22, 2:46 PM · Data-Persistence-Backup, media-backups
jcrespo closed T365217: Degraded RAID on backup2010 as Resolved.

I did a disk stress test for an hour or so, saw no media errors, smart errors or raid controller weirdness.

image.png (1×1 px, 141 KB)
Resolving for now.

Wed, May 22, 1:50 PM · Data-Persistence-Backup, Data-Persistence, DC-Ops, SRE, ops-codfw
jcrespo added a comment to T365217: Degraded RAID on backup2010.

A disk was rebuilt on the 17 of May:

Wed, May 22, 7:25 AM · Data-Persistence-Backup, Data-Persistence, DC-Ops, SRE, ops-codfw

Tue, May 21

jcrespo added a comment to T363812: Setup backups for es6, es7 and archive old read only backups.
  • Stop es4 and es5 backups
  • Generate a full clusterX and clusterY last backup
  • Archive it into long term backups
  • Remove dump user
Tue, May 21, 11:31 AM · database-backups, Data-Persistence-Backup

Tue, May 14

jcrespo added a parent task for T364447: Make es4 and es5 RO: T363812: Setup backups for es6, es7 and archive old read only backups.
Tue, May 14, 8:43 AM · DBA
jcrespo added a subtask for T363812: Setup backups for es6, es7 and archive old read only backups: T364447: Make es4 and es5 RO.
Tue, May 14, 8:43 AM · database-backups, Data-Persistence-Backup
jcrespo removed a subtask for T364447: Make es4 and es5 RO: T363812: Setup backups for es6, es7 and archive old read only backups.
Tue, May 14, 8:42 AM · DBA
jcrespo removed a parent task for T363812: Setup backups for es6, es7 and archive old read only backups: T364447: Make es4 and es5 RO.
Tue, May 14, 8:42 AM · database-backups, Data-Persistence-Backup
jcrespo added a subtask for T364447: Make es4 and es5 RO: T363812: Setup backups for es6, es7 and archive old read only backups.
Tue, May 14, 8:42 AM · DBA
jcrespo removed a subtask for T355285: Productionize es10[35-40]: T363812: Setup backups for es6, es7 and archive old read only backups.
Tue, May 14, 8:42 AM · DBA
jcrespo edited parent tasks for T363812: Setup backups for es6, es7 and archive old read only backups, added: T364447: Make es4 and es5 RO; removed: T355285: Productionize es10[35-40].
Tue, May 14, 8:42 AM · database-backups, Data-Persistence-Backup
jcrespo updated the task description for T363812: Setup backups for es6, es7 and archive old read only backups.
Tue, May 14, 8:41 AM · database-backups, Data-Persistence-Backup
jcrespo updated the task description for T363812: Setup backups for es6, es7 and archive old read only backups.
Tue, May 14, 8:04 AM · database-backups, Data-Persistence-Backup

Mon, May 13

jcrespo added a comment to T364296: Reimage db1215 and db2185 (zarcillo) to bookworm.

Sorry about it :( how can I help?

Mon, May 13, 8:44 AM · DBA
jcrespo added a comment to T364296: Reimage db1215 and db2185 (zarcillo) to bookworm.

Thanks, the upgrade is no issue, but data will have a lot of backup errors due to not beeing depooled before maintenance, will need some work.

Mon, May 13, 8:14 AM · DBA

May 9 2024

jcrespo added a comment to T360751: Upgrade backup sources to MariaDB 10.6.

All backups now will be generated from 10.6 servers, with the exception of s1. Leaving a couple of hosts with 10.4 before upgrading them/decomming them.

May 9 2024, 10:13 AM · Data-Persistence, Data-Persistence-Backup
jcrespo updated the task description for T360751: Upgrade backup sources to MariaDB 10.6.
May 9 2024, 9:59 AM · Data-Persistence, Data-Persistence-Backup
jcrespo triaged T362509: Setup new dbprov hosts and decommission the old ones as High priority.
May 9 2024, 9:58 AM · Patch-For-Review, database-backups, Data-Persistence-Backup
jcrespo added a comment to T363812: Setup backups for es6, es7 and archive old read only backups.

@Marostegui es6 and es7 backups are enabled, and a first run was done here. They seem mostly empty, though:

image.png (558×2 px, 141 KB)

May 9 2024, 9:20 AM · database-backups, Data-Persistence-Backup
jcrespo updated the task description for T363812: Setup backups for es6, es7 and archive old read only backups.
May 9 2024, 9:16 AM · database-backups, Data-Persistence-Backup
jcrespo closed T363995: Commons: File:Gnome-edit-delete.svg not found as Resolved.
May 9 2024, 7:11 AM · SRE-swift-storage, Commons

May 7 2024

jcrespo added a comment to T363995: Commons: File:Gnome-edit-delete.svg not found.

[2024-05-06 14:33:33,903] INFO:backup '9/96/Gnome-edit-delete.svg' downloaded
[2024-05-06 14:33:33,904] INFO:backup sha256 sum of Gnome-edit-delete.svg is a45ec2020e0997a031bdd62a0dc30a518c82d9c1a100d6e8420bb2a6f938c48f
[2024-05-06 14:33:33,912] WARNING:backup A file with the same sha265 as "commonswiki Gnome-edit-delete.svg 3a36632c6569fcdf45aca81a71b53ae4faf80083 2024-05-06 14:16:14" was already uploaded, skipping.

So it was a "coincidence" that was backed up, and not the other way. I will check the backup logs to see why it was failing.
May 7 2024, 9:01 AM · SRE-swift-storage, Commons

May 6 2024

jcrespo added a comment to T363995: Commons: File:Gnome-edit-delete.svg not found.

It was failing back in 2021:

May 6 2024, 5:53 PM · SRE-swift-storage, Commons
jcrespo added a comment to T363995: Commons: File:Gnome-edit-delete.svg not found.
May 6 2024, 2:39 PM · SRE-swift-storage, Commons
jcrespo added a comment to T363995: Commons: File:Gnome-edit-delete.svg not found.

Here it is the 2 file versions (with the hash it can be checked they are the same files):

May 6 2024, 1:15 PM · SRE-swift-storage, Commons

Apr 30 2024

jcrespo closed T361087: backup1005 crashed as Resolved.
Apr 30 2024, 10:45 AM · SRE, ops-eqiad, DC-Ops, Data-Persistence-Backup, media-backups
jcrespo changed the status of T363812: Setup backups for es6, es7 and archive old read only backups from Open to In Progress.
Apr 30 2024, 10:42 AM · database-backups, Data-Persistence-Backup
jcrespo changed the status of T363812: Setup backups for es6, es7 and archive old read only backups, a subtask of T355285: Productionize es10[35-40], from Open to In Progress.
Apr 30 2024, 10:42 AM · DBA
jcrespo created T363812: Setup backups for es6, es7 and archive old read only backups.
Apr 30 2024, 10:35 AM · database-backups, Data-Persistence-Backup
jcrespo updated the task description for T362509: Setup new dbprov hosts and decommission the old ones.
Apr 30 2024, 10:23 AM · Patch-For-Review, database-backups, Data-Persistence-Backup

Apr 25 2024

jcrespo added a comment to T361087: backup1005 crashed.

In any case, at this point I 'd prefer to do an in-place upgrade rather than a reimage, given how unreliable a reimage is and how impactful it can be for stateful services.

Apr 25 2024, 3:38 PM · SRE, ops-eqiad, DC-Ops, Data-Persistence-Backup, media-backups
jcrespo updated subscribers of T361087: backup1005 crashed.

Booting failed (PXE):

PXELINUX 6.03 lwIP 20150819 Copyright (C) 1994-2014 H. Peter Anvin et al


Debian 12 (bookworm) amd64 (Wikimedia edition)

                                              boot: 
Loading debian-installer/amd64/linux... ok
Loading debian-installer/amd64/initrd.gz...
Boot failed: press a key to retry, or wait for reset...

Hmm. Not sure if we've seen this problem before. DHCP clearly worked as did the debian image download, but Linux failed to load for some reason.

@jcrespo the only difference was selecting bullseye rather than bookworm on the second attempt?

Apr 25 2024, 3:29 PM · SRE, ops-eqiad, DC-Ops, Data-Persistence-Backup, media-backups
jcrespo added a comment to T361087: backup1005 crashed.

If booted into bullseye.

Apr 25 2024, 11:40 AM · SRE, ops-eqiad, DC-Ops, Data-Persistence-Backup, media-backups
jcrespo added a comment to T361087: backup1005 crashed.

Booting failed (PXE):

PXELINUX 6.03 lwIP 20150819 Copyright (C) 1994-2014 H. Peter Anvin et al
Apr 25 2024, 11:15 AM · SRE, ops-eqiad, DC-Ops, Data-Persistence-Backup, media-backups

Apr 24 2024

jcrespo claimed T361087: backup1005 crashed.

Will reimage soon.

Apr 24 2024, 4:51 PM · SRE, ops-eqiad, DC-Ops, Data-Persistence-Backup, media-backups
jcrespo awarded T363186: Cache mw-mcrouter service ClusterIP in apcu cache a Love token.
Apr 24 2024, 12:02 PM · Patch-For-Review, MediaWiki-Platform-Team, MediaWiki-libs-BagOStuff, MediaWiki-Engineering, serviceops, Sustainability (Incident Followup)

Apr 23 2024

jcrespo closed T349397: Migrate the matomo host to bookworm as Resolved.

Looking good now:

Apr 23 2024, 6:11 PM · Data-Platform-SRE (2024.04.15 - 2024.05.05), Data-Engineering
jcrespo closed T349397: Migrate the matomo host to bookworm, a subtask of T288804: Upgrade the Data Engineering infrastructure to Debian Bullseye, as Resolved.
Apr 23 2024, 6:09 PM · Data-Platform-SRE, Epic
jcrespo added a comment to T349397: Migrate the matomo host to bookworm.

hi, backups of matomo database failed with:

Apr 23 2024, 3:46 PM · Data-Platform-SRE (2024.04.15 - 2024.05.05), Data-Engineering
jcrespo reopened T349397: Migrate the matomo host to bookworm, a subtask of T288804: Upgrade the Data Engineering infrastructure to Debian Bullseye, as Open.
Apr 23 2024, 3:45 PM · Data-Platform-SRE, Epic
jcrespo reopened T349397: Migrate the matomo host to bookworm as "Open".
Apr 23 2024, 3:45 PM · Data-Platform-SRE (2024.04.15 - 2024.05.05), Data-Engineering
jcrespo added a comment to T360751: Upgrade backup sources to MariaDB 10.6.

update: on both eqiad and codfw we are generating dumps and snapshots in 10.6 for x1, s2, s6, s5, s3.

Apr 23 2024, 11:20 AM · Data-Persistence, Data-Persistence-Backup

Apr 18 2024

jcrespo updated the task description for T362509: Setup new dbprov hosts and decommission the old ones.
Apr 18 2024, 9:08 AM · Patch-For-Review, database-backups, Data-Persistence-Backup
jcrespo added a comment to T362421: magru network setup.

Hi, after 73470d0dca68abee0 ntp no longer auto-restarts, but after one of the latest changes (I believe b48874a81565b7051be39659c056), it is pending. Can it be restarted or should it be kept with the old config for a while, and it should be acked?

Apr 18 2024, 9:07 AM · Patch-For-Review, netops, Infrastructure-Foundations, SRE

Apr 17 2024

jcrespo created T362766: 2024-04-17 mw-on-k8s eqiad outage.
Apr 17 2024, 11:22 AM · serviceops, Sustainability (Incident Followup)
jcrespo created P60740 (An Untitled Masterwork).
Apr 17 2024, 8:39 AM
jcrespo updated the task description for T360751: Upgrade backup sources to MariaDB 10.6.
Apr 17 2024, 8:00 AM · Data-Persistence, Data-Persistence-Backup
jcrespo updated the task description for T360751: Upgrade backup sources to MariaDB 10.6.
Apr 17 2024, 7:55 AM · Data-Persistence, Data-Persistence-Backup

Apr 16 2024

jcrespo updated the task description for T358741: Decommission db2096-db2120.
Apr 16 2024, 1:54 PM · DBA
jcrespo added a comment to T358936: Kubernetes apiserver probe failures on restart.

Hi, today we had another occurrence of this. We didn't consider it a full-blown incident due to the no direct (or almost no) impact on users during the service down. After kubemaster1002 was detected as down during its automatic restart (due to a puppet change), it took a long time to come back- with lots of incoming network connections stuck/failing, and maximizing cpu usage. https://grafana.wikimedia.org/goto/KbF5zPaIg?orgId=1

image.png (410×1 px, 58 KB)

Apr 16 2024, 11:04 AM · Prod-Kubernetes, serviceops, SRE
jcrespo updated the task description for T360751: Upgrade backup sources to MariaDB 10.6.
Apr 16 2024, 8:15 AM · Data-Persistence, Data-Persistence-Backup
jcrespo added a comment to T360751: Upgrade backup sources to MariaDB 10.6.

@Marostegui Update: backups for x1, s2, s6, s5 and s3 are generating dumps and snapshots with MariaDB 10.6 currently on codfw. Doing s5 and s3 on eqiad next. You may see a lot of 10.4 servers, but they are idle and only kept just in case, they are not active, and will be just eventually upgraded or discarded.

Apr 16 2024, 8:11 AM · Data-Persistence, Data-Persistence-Backup
jcrespo closed T362611: Alert in need of triage: SystemdUnitFailed (instance db2200:9100) as Resolved.
[09:44] <jinxer-wm> (SystemdUnitFailed) resolved: prometheus-mysqld-exporter.service on db2200:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
Apr 16 2024, 7:59 AM · DBA, sre-alert-triage

Apr 15 2024

jcrespo added a comment to T355353: Q3:rack/setup/install dbprov100[56].

Thank you a lot, to everybody!

Apr 15 2024, 8:34 PM · Patch-For-Review, SRE, Data-Persistence, ops-eqiad, DC-Ops
jcrespo updated the task description for T360751: Upgrade backup sources to MariaDB 10.6.
Apr 15 2024, 5:54 PM · Data-Persistence, Data-Persistence-Backup
jcrespo placed T362311: Decommission db2101 (was: db2101 crashed) up for grabs.

CC @ABran-WMF in case I missed something.

Apr 15 2024, 5:52 PM · SRE, ops-codfw, decommission-hardware, DC-Ops, database-backups, Data-Persistence-Backup, DBA
jcrespo updated the task description for T362311: Decommission db2101 (was: db2101 crashed).
Apr 15 2024, 5:51 PM · SRE, ops-codfw, decommission-hardware, DC-Ops, database-backups, Data-Persistence-Backup, DBA
jcrespo updated the task description for T358741: Decommission db2096-db2120.
Apr 15 2024, 9:26 AM · DBA
jcrespo added a subtask for T358741: Decommission db2096-db2120: T362311: Decommission db2101 (was: db2101 crashed).
Apr 15 2024, 9:25 AM · DBA
jcrespo added a parent task for T362311: Decommission db2101 (was: db2101 crashed): T358741: Decommission db2096-db2120.
Apr 15 2024, 9:25 AM · SRE, ops-codfw, decommission-hardware, DC-Ops, database-backups, Data-Persistence-Backup, DBA
jcrespo renamed T362311: Decommission db2101 (was: db2101 crashed) from db2101 crashed to Decommission db2101 (was: db2101 crashed).
Apr 15 2024, 9:21 AM · SRE, ops-codfw, decommission-hardware, DC-Ops, database-backups, Data-Persistence-Backup, DBA
jcrespo created T362509: Setup new dbprov hosts and decommission the old ones.
Apr 15 2024, 8:03 AM · Patch-For-Review, database-backups, Data-Persistence-Backup

Apr 12 2024

jcrespo added a comment to T355422: Productionize db2196-db2220.

I think we can resolve this and track that at T358741, as long as everybody is aware.

Apr 12 2024, 10:51 AM · database-backups, DBA
jcrespo added a project to T355422: Productionize db2196-db2220: database-backups.

This is now done, although it depends on the definition of productionize- as some of the backup sources have the exact same data and config than the original ones, but have not yet taken over the service, and some backups still use the old hosts.

Apr 12 2024, 10:35 AM · database-backups, DBA
jcrespo updated the task description for T355422: Productionize db2196-db2220.
Apr 12 2024, 10:33 AM · database-backups, DBA
jcrespo reopened T355353: Q3:rack/setup/install dbprov100[56] as "Open".

hi, we cannot ssh into dbprov1006.eqiad.wmnet

Apr 12 2024, 7:25 AM · Patch-For-Review, SRE, Data-Persistence, ops-eqiad, DC-Ops

Apr 11 2024

jcrespo added a comment to T360149: Create a database for Striker test instance.

No need. I just wanted to warn the DBAs- althought you may find it interesting, as the last issue was with wikireplicas. No need to change anything at the moment (actual data and name), but the current grants providing access.

Apr 11 2024, 4:06 PM · Data-Persistence-Backup, Patch-For-Review, DBA, cloud-services-team, Striker
jcrespo reassigned T360149: Create a database for Striker test instance from jcrespo to ABran-WMF.

Please see my last comment. Other than that, my work is done.

Apr 11 2024, 3:52 PM · Data-Persistence-Backup, Patch-For-Review, DBA, cloud-services-team, Striker
jcrespo moved T360149: Create a database for Striker test instance from Done to Refine on the DBA board.
Apr 11 2024, 3:51 PM · Data-Persistence-Backup, Patch-For-Review, DBA, cloud-services-team, Striker