@Volans not Amir, but Re: your first question, my understanding is that this was a compromise to make sure there was something good enough and simple short term, rather than overengineering from the start. That doesn't mean that what you suggest is discarded, but something that could be improved later on. For example, I am personally interested on having a querable service/API later for backup checks, but this is better than nothing ATM, with relatively small effort. Later on, a database could import the file and generate it, for example. So I am a fan of interating slowly as long as it is an improvement 0:-D.

Wed, May 29, 2:21 PM · DBA

Thu, May 23

jcrespo updated the task description for T365607: Reprovision missing files due to backup1005 hw issues.

Thu, May 23, 8:06 AM · Data-Persistence-Backup, media-backups

Wed, May 22

jcrespo added a parent task for T361087: backup1005 crashed: T365607: Reprovision missing files due to backup1005 hw issues.

Wed, May 22, 2:46 PM · SRE, ops-eqiad, DC-Ops, Data-Persistence-Backup, media-backups

jcrespo added a comment to T365607: Reprovision missing files due to backup1005 hw issues.

Followup to T361087.

Wed, May 22, 2:46 PM · Data-Persistence-Backup, media-backups

jcrespo added a subtask for T365607: Reprovision missing files due to backup1005 hw issues: T361087: backup1005 crashed.

Wed, May 22, 2:46 PM · Data-Persistence-Backup, media-backups

jcrespo triaged T365607: Reprovision missing files due to backup1005 hw issues as High priority.

Wed, May 22, 2:46 PM · Data-Persistence-Backup, media-backups

jcrespo created T365607: Reprovision missing files due to backup1005 hw issues.

Wed, May 22, 2:46 PM · Data-Persistence-Backup, media-backups

jcrespo closed T365217: Degraded RAID on backup2010 as Resolved.

I did a disk stress test for an hour or so, saw no media errors, smart errors or raid controller weirdness.

Resolving for now.

Wed, May 22, 1:50 PM · Data-Persistence-Backup, Data-Persistence, DC-Ops, SRE, ops-codfw

jcrespo added a comment to T365217: Degraded RAID on backup2010.

A disk was rebuilt on the 17 of May:

Wed, May 22, 7:25 AM · Data-Persistence-Backup, Data-Persistence, DC-Ops, SRE, ops-codfw

Tue, May 21

jcrespo added a comment to T363812: Setup backups for es6, es7 and archive old read only backups.

Stop es4 and es5 backups
Generate a full clusterX and clusterY last backup
Archive it into long term backups
Remove dump user

Tue, May 21, 11:31 AM · database-backups, Data-Persistence-Backup

Tue, May 14

jcrespo added a parent task for T364447: Make es4 and es5 RO: T363812: Setup backups for es6, es7 and archive old read only backups.

Tue, May 14, 8:43 AM · DBA

jcrespo added a subtask for T363812: Setup backups for es6, es7 and archive old read only backups: T364447: Make es4 and es5 RO.

Tue, May 14, 8:43 AM · database-backups, Data-Persistence-Backup

jcrespo removed a subtask for T364447: Make es4 and es5 RO: T363812: Setup backups for es6, es7 and archive old read only backups.

Tue, May 14, 8:42 AM · DBA

jcrespo removed a parent task for T363812: Setup backups for es6, es7 and archive old read only backups: T364447: Make es4 and es5 RO.

Tue, May 14, 8:42 AM · database-backups, Data-Persistence-Backup

jcrespo added a subtask for T364447: Make es4 and es5 RO: T363812: Setup backups for es6, es7 and archive old read only backups.

Tue, May 14, 8:42 AM · DBA

jcrespo removed a subtask for T355285: Productionize es10[35-40]: T363812: Setup backups for es6, es7 and archive old read only backups.

Tue, May 14, 8:42 AM · DBA

jcrespo edited parent tasks for T363812: Setup backups for es6, es7 and archive old read only backups, added: T364447: Make es4 and es5 RO; removed: T355285: Productionize es10[35-40].

Tue, May 14, 8:42 AM · database-backups, Data-Persistence-Backup

jcrespo updated the task description for T363812: Setup backups for es6, es7 and archive old read only backups.

Tue, May 14, 8:41 AM · database-backups, Data-Persistence-Backup

jcrespo updated the task description for T363812: Setup backups for es6, es7 and archive old read only backups.

Tue, May 14, 8:04 AM · database-backups, Data-Persistence-Backup

Mon, May 13

jcrespo added a comment to T364296: Reimage db1215 and db2185 (zarcillo) to bookworm.

In T364296#9789045, @Marostegui wrote:

Sorry about it :( how can I help?

Mon, May 13, 8:44 AM · DBA

jcrespo added a comment to T364296: Reimage db1215 and db2185 (zarcillo) to bookworm.

Thanks, the upgrade is no issue, but data will have a lot of backup errors due to not beeing depooled before maintenance, will need some work.

Mon, May 13, 8:14 AM · DBA

May 9 2024

jcrespo added a comment to T360751: Upgrade backup sources to MariaDB 10.6.

All backups now will be generated from 10.6 servers, with the exception of s1. Leaving a couple of hosts with 10.4 before upgrading them/decomming them.

May 9 2024, 10:13 AM · Data-Persistence, Data-Persistence-Backup

jcrespo updated the task description for T360751: Upgrade backup sources to MariaDB 10.6.

May 9 2024, 9:59 AM · Data-Persistence, Data-Persistence-Backup

jcrespo triaged T362509: Setup new dbprov hosts and decommission the old ones as High priority.

May 9 2024, 9:58 AM · Patch-For-Review, database-backups, Data-Persistence-Backup

jcrespo added a comment to T363812: Setup backups for es6, es7 and archive old read only backups.

@Marostegui es6 and es7 backups are enabled, and a first run was done here. They seem mostly empty, though:

May 9 2024, 9:20 AM · database-backups, Data-Persistence-Backup

jcrespo updated the task description for T363812: Setup backups for es6, es7 and archive old read only backups.

May 9 2024, 9:16 AM · database-backups, Data-Persistence-Backup

jcrespo closed T363995: Commons: File:Gnome-edit-delete.svg not found as Resolved.

May 9 2024, 7:11 AM · SRE-swift-storage, Commons

May 7 2024

jcrespo added a comment to T363995: Commons: File:Gnome-edit-delete.svg not found.

In T363995#9775321, @jcrespo wrote:
[2024-05-06 14:33:33,903] INFO:backup '9/96/Gnome-edit-delete.svg' downloaded
[2024-05-06 14:33:33,904] INFO:backup sha256 sum of Gnome-edit-delete.svg is a45ec2020e0997a031bdd62a0dc30a518c82d9c1a100d6e8420bb2a6f938c48f
[2024-05-06 14:33:33,912] WARNING:backup A file with the same sha265 as "commonswiki Gnome-edit-delete.svg 3a36632c6569fcdf45aca81a71b53ae4faf80083 2024-05-06 14:16:14" was already uploaded, skipping.
So it was a "coincidence" that was backed up, and not the other way. I will check the backup logs to see why it was failing.

May 7 2024, 9:01 AM · SRE-swift-storage, Commons

May 6 2024

jcrespo added a comment to T363995: Commons: File:Gnome-edit-delete.svg not found.

It was failing back in 2021:

May 6 2024, 5:53 PM · SRE-swift-storage, Commons

jcrespo added a comment to T363995: Commons: File:Gnome-edit-delete.svg not found.

In T363995#9763970, @MatthewVernon wrote:

May 6 2024, 2:39 PM · SRE-swift-storage, Commons

jcrespo added a comment to T363995: Commons: File:Gnome-edit-delete.svg not found.

Here it is the 2 file versions (with the hash it can be checked they are the same files):

May 6 2024, 1:15 PM · SRE-swift-storage, Commons

Apr 30 2024

jcrespo closed T361087: backup1005 crashed as Resolved.

Apr 30 2024, 10:45 AM · SRE, ops-eqiad, DC-Ops, Data-Persistence-Backup, media-backups

jcrespo changed the status of T363812: Setup backups for es6, es7 and archive old read only backups from Open to In Progress.

Apr 30 2024, 10:42 AM · database-backups, Data-Persistence-Backup

jcrespo changed the status of T363812: Setup backups for es6, es7 and archive old read only backups, a subtask of T355285: Productionize es10[35-40], from Open to In Progress.

Apr 30 2024, 10:42 AM · DBA

jcrespo created T363812: Setup backups for es6, es7 and archive old read only backups.

Apr 30 2024, 10:35 AM · database-backups, Data-Persistence-Backup

jcrespo updated the task description for T362509: Setup new dbprov hosts and decommission the old ones.

Apr 30 2024, 10:23 AM · Patch-For-Review, database-backups, Data-Persistence-Backup

Apr 25 2024

jcrespo added a comment to T361087: backup1005 crashed.

In any case, at this point I 'd prefer to do an in-place upgrade rather than a reimage, given how unreliable a reimage is and how impactful it can be for stateful services.

Apr 25 2024, 3:38 PM · SRE, ops-eqiad, DC-Ops, Data-Persistence-Backup, media-backups

jcrespo updated subscribers of T361087: backup1005 crashed.

In T361087#9744977, @cmooney wrote:
In T361087#9744384, @jcrespo wrote:
Booting failed (PXE):
PXELINUX 6.03 lwIP 20150819 Copyright (C) 1994-2014 H. Peter Anvin et al


Debian 12 (bookworm) amd64 (Wikimedia edition)

                                              boot: 
Loading debian-installer/amd64/linux... ok
Loading debian-installer/amd64/initrd.gz...
Boot failed: press a key to retry, or wait for reset...
Hmm. Not sure if we've seen this problem before. DHCP clearly worked as did the debian image download, but Linux failed to load for some reason.

@jcrespo the only difference was selecting bullseye rather than bookworm on the second attempt?

Apr 25 2024, 3:29 PM · SRE, ops-eqiad, DC-Ops, Data-Persistence-Backup, media-backups

jcrespo added a comment to T361087: backup1005 crashed.

If booted into bullseye.

Apr 25 2024, 11:40 AM · SRE, ops-eqiad, DC-Ops, Data-Persistence-Backup, media-backups

jcrespo added a comment to T361087: backup1005 crashed.

Booting failed (PXE):

PXELINUX 6.03 lwIP 20150819 Copyright (C) 1994-2014 H. Peter Anvin et al

Apr 25 2024, 11:15 AM · SRE, ops-eqiad, DC-Ops, Data-Persistence-Backup, media-backups

Apr 24 2024

jcrespo claimed T361087: backup1005 crashed.

Will reimage soon.

Apr 24 2024, 4:51 PM · SRE, ops-eqiad, DC-Ops, Data-Persistence-Backup, media-backups

jcrespo awarded T363186: Cache mw-mcrouter service ClusterIP in apcu cache a Love token.

Apr 24 2024, 12:02 PM · Patch-For-Review, MediaWiki-Platform-Team, MediaWiki-libs-BagOStuff, MediaWiki-Engineering, serviceops, Sustainability (Incident Followup)

Apr 23 2024

jcrespo closed T349397: Migrate the matomo host to bookworm as Resolved.

Looking good now:

Apr 23 2024, 6:11 PM · Data-Platform-SRE (2024.04.15 - 2024.05.05), Data-Engineering

jcrespo closed T349397: Migrate the matomo host to bookworm, a subtask of T288804: Upgrade the Data Engineering infrastructure to Debian Bullseye, as Resolved.

Apr 23 2024, 6:09 PM · Data-Platform-SRE, Epic

jcrespo added a comment to T349397: Migrate the matomo host to bookworm.

hi, backups of matomo database failed with:

Apr 23 2024, 3:46 PM · Data-Platform-SRE (2024.04.15 - 2024.05.05), Data-Engineering

jcrespo reopened T349397: Migrate the matomo host to bookworm, a subtask of T288804: Upgrade the Data Engineering infrastructure to Debian Bullseye, as Open.

Apr 23 2024, 3:45 PM · Data-Platform-SRE, Epic

jcrespo reopened T349397: Migrate the matomo host to bookworm as "Open".

Apr 23 2024, 3:45 PM · Data-Platform-SRE (2024.04.15 - 2024.05.05), Data-Engineering

jcrespo added a comment to T360751: Upgrade backup sources to MariaDB 10.6.

update: on both eqiad and codfw we are generating dumps and snapshots in 10.6 for x1, s2, s6, s5, s3.

Apr 23 2024, 11:20 AM · Data-Persistence, Data-Persistence-Backup

Apr 18 2024

jcrespo updated the task description for T362509: Setup new dbprov hosts and decommission the old ones.

Apr 18 2024, 9:08 AM · Patch-For-Review, database-backups, Data-Persistence-Backup

jcrespo added a comment to T362421: magru network setup.

Hi, after 73470d0dca68abee0 ntp no longer auto-restarts, but after one of the latest changes (I believe b48874a81565b7051be39659c056), it is pending. Can it be restarted or should it be kept with the old config for a while, and it should be acked?

Apr 18 2024, 9:07 AM · Patch-For-Review, netops, Infrastructure-Foundations, SRE

Apr 17 2024

jcrespo created T362766: 2024-04-17 mw-on-k8s eqiad outage.

Apr 17 2024, 11:22 AM · serviceops, Sustainability (Incident Followup)

jcrespo created P60740 (An Untitled Masterwork).

Apr 17 2024, 8:39 AM

jcrespo updated the task description for T360751: Upgrade backup sources to MariaDB 10.6.

Apr 17 2024, 8:00 AM · Data-Persistence, Data-Persistence-Backup

jcrespo updated the task description for T360751: Upgrade backup sources to MariaDB 10.6.

Apr 17 2024, 7:55 AM · Data-Persistence, Data-Persistence-Backup

Apr 16 2024

jcrespo updated the task description for T358741: Decommission db2096-db2120.

Apr 16 2024, 1:54 PM · DBA

jcrespo added a comment to T358936: Kubernetes apiserver probe failures on restart.

Hi, today we had another occurrence of this. We didn't consider it a full-blown incident due to the no direct (or almost no) impact on users during the service down. After kubemaster1002 was detected as down during its automatic restart (due to a puppet change), it took a long time to come back- with lots of incoming network connections stuck/failing, and maximizing cpu usage. https://grafana.wikimedia.org/goto/KbF5zPaIg?orgId=1

Apr 16 2024, 11:04 AM · Prod-Kubernetes, serviceops, SRE

jcrespo updated the task description for T360751: Upgrade backup sources to MariaDB 10.6.

Apr 16 2024, 8:15 AM · Data-Persistence, Data-Persistence-Backup

jcrespo added a comment to T360751: Upgrade backup sources to MariaDB 10.6.

@Marostegui Update: backups for x1, s2, s6, s5 and s3 are generating dumps and snapshots with MariaDB 10.6 currently on codfw. Doing s5 and s3 on eqiad next. You may see a lot of 10.4 servers, but they are idle and only kept just in case, they are not active, and will be just eventually upgraded or discarded.

Apr 16 2024, 8:11 AM · Data-Persistence, Data-Persistence-Backup

jcrespo closed T362611: Alert in need of triage: SystemdUnitFailed (instance db2200:9100) as Resolved.

[09:44] <jinxer-wm> (SystemdUnitFailed) resolved: prometheus-mysqld-exporter.service on db2200:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed

Apr 16 2024, 7:59 AM · DBA, sre-alert-triage

Apr 15 2024

jcrespo added a comment to T355353: Q3:rack/setup/install dbprov100[56].

Thank you a lot, to everybody!

Apr 15 2024, 8:34 PM · Patch-For-Review, SRE, Data-Persistence, ops-eqiad, DC-Ops

jcrespo updated the task description for T360751: Upgrade backup sources to MariaDB 10.6.

Apr 15 2024, 5:54 PM · Data-Persistence, Data-Persistence-Backup

jcrespo placed T362311: Decommission db2101 (was: db2101 crashed) up for grabs.

CC @ABran-WMF in case I missed something.

Apr 15 2024, 5:52 PM · SRE, ops-codfw, decommission-hardware, DC-Ops, database-backups, Data-Persistence-Backup, DBA

jcrespo updated the task description for T362311: Decommission db2101 (was: db2101 crashed).

Apr 15 2024, 5:51 PM · SRE, ops-codfw, decommission-hardware, DC-Ops, database-backups, Data-Persistence-Backup, DBA

jcrespo updated the task description for T358741: Decommission db2096-db2120.

Apr 15 2024, 9:26 AM · DBA

jcrespo added a subtask for T358741: Decommission db2096-db2120: T362311: Decommission db2101 (was: db2101 crashed).

Apr 15 2024, 9:25 AM · DBA

jcrespo added a parent task for T362311: Decommission db2101 (was: db2101 crashed): T358741: Decommission db2096-db2120.

Apr 15 2024, 9:25 AM · SRE, ops-codfw, decommission-hardware, DC-Ops, database-backups, Data-Persistence-Backup, DBA

jcrespo renamed T362311: Decommission db2101 (was: db2101 crashed) from db2101 crashed to Decommission db2101 (was: db2101 crashed).

Apr 15 2024, 9:21 AM · SRE, ops-codfw, decommission-hardware, DC-Ops, database-backups, Data-Persistence-Backup, DBA

jcrespo created T362509: Setup new dbprov hosts and decommission the old ones.

Apr 15 2024, 8:03 AM · Patch-For-Review, database-backups, Data-Persistence-Backup

Apr 12 2024

jcrespo added a comment to T355422: Productionize db2196-db2220.

I think we can resolve this and track that at T358741, as long as everybody is aware.

Apr 12 2024, 10:51 AM · database-backups, DBA

jcrespo added a project to T355422: Productionize db2196-db2220: database-backups.

This is now done, although it depends on the definition of productionize- as some of the backup sources have the exact same data and config than the original ones, but have not yet taken over the service, and some backups still use the old hosts.

Apr 12 2024, 10:35 AM · database-backups, DBA

jcrespo updated the task description for T355422: Productionize db2196-db2220.

Apr 12 2024, 10:33 AM · database-backups, DBA

jcrespo reopened T355353: Q3:rack/setup/install dbprov100[56] as "Open".

hi, we cannot ssh into dbprov1006.eqiad.wmnet

Apr 12 2024, 7:25 AM · Patch-For-Review, SRE, Data-Persistence, ops-eqiad, DC-Ops

Apr 11 2024

jcrespo added a comment to T360149: Create a database for Striker test instance.

No need. I just wanted to warn the DBAs- althought you may find it interesting, as the last issue was with wikireplicas. No need to change anything at the moment (actual data and name), but the current grants providing access.

Apr 11 2024, 4:06 PM · Data-Persistence-Backup, Patch-For-Review, DBA, cloud-services-team, Striker

jcrespo reassigned T360149: Create a database for Striker test instance from jcrespo to ABran-WMF.

Please see my last comment. Other than that, my work is done.

Apr 11 2024, 3:52 PM · Data-Persistence-Backup, Patch-For-Review, DBA, cloud-services-team, Striker

jcrespo moved T360149: Create a database for Striker test instance from Done to Refine on the DBA board.

Apr 11 2024, 3:51 PM · Data-Persistence-Backup, Patch-For-Review, DBA, cloud-services-team, Striker

jcrespo (Jaime Crespo)
Sr Database Administrator

Projects (12)
View All

Calendar

Today

Tomorrow

Thursday

User Details

Recent Activity
View All

Fri, Jun 7

Wed, May 29

Thu, May 23

Wed, May 22

Tue, May 21

Tue, May 14

Mon, May 13

May 9 2024

May 7 2024

May 6 2024

Apr 30 2024

Apr 25 2024

Apr 24 2024

Apr 23 2024

Apr 18 2024

Apr 17 2024

Apr 16 2024

Apr 15 2024

Apr 12 2024

Apr 11 2024

jcrespo (Jaime Crespo)Sr Database Administrator

Projects (12)View All

Calendar

Today

Tomorrow

Thursday

User Details

Recent ActivityView All

Fri, Jun 7

Wed, May 29

Thu, May 23

Wed, May 22

Tue, May 21

Tue, May 14

Mon, May 13

May 9 2024

May 7 2024

May 6 2024

Apr 30 2024

Apr 25 2024

Apr 24 2024

Apr 23 2024

Apr 18 2024

Apr 17 2024

Apr 16 2024

Apr 15 2024

Apr 12 2024

Apr 11 2024

jcrespo (Jaime Crespo)
Sr Database Administrator

Projects (12)
View All

Recent Activity
View All