Page MenuHomePhabricator

jcrespo (Jaime Crespo)
Sr Database Administrator

Projects (12)

Today

  • Clear sailing ahead.

Tomorrow

  • Clear sailing ahead.

Wednesday

  • Clear sailing ahead.

User Details

User Since
May 11 2015, 8:31 AM (447 w, 6 d)
Availability
Available
IRC Nick
jynus
LDAP User
Jcrespo
MediaWiki User
JCrespo (WMF) [ Global Accounts ]

Recent Activity

Tue, Dec 5

jcrespo added a comment to T352695: MediaWiki PHPUnit test suite can result in non-test database being modified.

Could it be https://jira.mariadb.org/browse/MDEV-28334 ?

Tue, Dec 5, 6:06 PM · MW-1.42-notes (1.42.0-wmf.9; 2023-12-12), Upstream, Data-Persistence, MediaWiki-libs-Rdbms, MediaWiki-Core-Tests
jcrespo added a comment to T352695: MediaWiki PHPUnit test suite can result in non-test database being modified.

Apparently a temporary table with the name _ can match any SHOW TABLES LIKE query.

MariaDB [wiki1]> CREATE TEMPORARY TABLE _ (a INT);
Query OK, 0 rows affected (0.001 sec)

MariaDB [wiki1]> SHOW TABLES LIKE 'T352695';
+---------------------------+
| Tables_in_wiki1 (T352695) |
+---------------------------+
| _                         |
+---------------------------+
1 row in set (0.001 sec)
Tue, Dec 5, 5:55 PM · MW-1.42-notes (1.42.0-wmf.9; 2023-12-12), Upstream, Data-Persistence, MediaWiki-libs-Rdbms, MediaWiki-Core-Tests
jcrespo added a comment to T343707: Migrate SRE repositories to GitLab - Archiving unused Gerrit repositories.

The default setting is to allow maintainers to push and merge and set force push to off. All SREs are owners of /repos/sre so in an emergency situation would be able to enable force push for the main branch of a repository if needed - would that be sufficient from your point of view (with additional documentation on how to do it)?

Tue, Dec 5, 4:55 PM · Projects-Cleanup, Release-Engineering-Team (Priority Backlog 📥), collaboration-services
jcrespo added a comment to T352655: Automate backup file deletion by using a batch file.

Docs: https://wikitech.wikimedia.org/wiki/Media_storage/Backups#Batch_query%2C_recovery_and_deletion

Tue, Dec 5, 2:52 PM · Data-Persistence, Data-Persistence-Backup, media-backups
jcrespo closed T352655: Automate backup file deletion by using a batch file as Resolved.

This is now down (with further improvements down the line), will now document it on wikitech.

Tue, Dec 5, 2:30 PM · Data-Persistence, Data-Persistence-Backup, media-backups
jcrespo added a comment to T343707: Migrate SRE repositories to GitLab - Archiving unused Gerrit repositories.

I migrated:

Tue, Dec 5, 1:12 PM · Projects-Cleanup, Release-Engineering-Team (Priority Backlog 📥), collaboration-services
jcrespo committed rOSMB900eb03bf85b: Implement batch deletion, restoration and query of files (authored by jcrespo).
Implement batch deletion, restoration and query of files
Tue, Dec 5, 11:48 AM
jcrespo committed rOSMBd3f54f903346: Prepare for 0.3.0 release (authored by jcrespo).
Prepare for 0.3.0 release
Tue, Dec 5, 11:48 AM
jcrespo placed T350020: Access request to deleted image files in the production Swift cluster up for grabs.
Tue, Dec 5, 11:44 AM · SRE, UploadWizard, Structured-Data-Backlog, SRE-Access-Requests

Mon, Dec 4

jcrespo added a comment to T352655: Automate backup file deletion by using a batch file.

@Marostegui (or anyone else) If you have the time, could I ask you for a quick "cli UI" review of the above patch? I have installed a test 0.3.0 package on ms-backup1002 only.

Mon, Dec 4, 10:06 PM · Data-Persistence, Data-Persistence-Backup, media-backups
jcrespo added a comment to T350020: Access request to deleted image files in the production Swift cluster.

Updating title to reflect current request.

Mon, Dec 4, 11:52 AM · SRE, UploadWizard, Structured-Data-Backlog, SRE-Access-Requests
jcrespo renamed T350020: Access request to deleted image files in the production Swift cluster from Access request to deleted image files in the backup cluster to Access request to deleted image files in the production Swift cluster.
Mon, Dec 4, 11:52 AM · SRE, UploadWizard, Structured-Data-Backlog, SRE-Access-Requests
Marostegui awarded T352655: Automate backup file deletion by using a batch file a Love token.
Mon, Dec 4, 11:45 AM · Data-Persistence, Data-Persistence-Backup, media-backups
jcrespo triaged T352655: Automate backup file deletion by using a batch file as High priority.
Mon, Dec 4, 11:44 AM · Data-Persistence, Data-Persistence-Backup, media-backups
jcrespo created T352655: Automate backup file deletion by using a batch file.
Mon, Dec 4, 11:43 AM · Data-Persistence, Data-Persistence-Backup, media-backups

Fri, Dec 1

jcrespo committed rOSMB1b10de348e3c: Migrate TLS configuration to separate file and prepare for puppet call (authored by jcrespo).
Migrate TLS configuration to separate file and prepare for puppet call
Fri, Dec 1, 7:41 PM
jcrespo committed rOSMB772e81d8a6c5: Prepare for 0.2.0 release (authored by jcrespo).
Prepare for 0.2.0 release
Fri, Dec 1, 7:41 PM
jcrespo committed rOSMB5c8ae59a39a8: add_recent_uploads: Be more resilient against errors (authored by jcrespo).
add_recent_uploads: Be more resilient against errors
Fri, Dec 1, 7:41 PM
jcrespo added a comment to T350020: Access request to deleted image files in the production Swift cluster.

@jcrespo , would it be possible to use the internal reverse proxy to directly download deleted images via HTTP like here?

Fri, Dec 1, 7:13 PM · SRE, UploadWizard, Structured-Data-Backlog, SRE-Access-Requests
jcrespo added a comment to T327157: Create and deploy the logic to generate incremental backups of MediaWiki media files, to keep its file storage backup up to date, automatically.

This is now ready for Puppet 7 and the new CA- only pending thing is to add it to configuration.

Fri, Dec 1, 3:40 PM · Data-Persistence, media-backups, Data-Persistence-Backup, Goal

Wed, Nov 29

jcrespo updated the task description for T313582: Migrate bacula director to new hardware and setup independent bacula directors/storage/metadata for each primary datacenter for increased redundancy.
Wed, Nov 29, 12:27 PM · Patch-For-Review, Goal, bacula, Data-Persistence-Backup
jcrespo updated the task description for T313582: Migrate bacula director to new hardware and setup independent bacula directors/storage/metadata for each primary datacenter for increased redundancy.
Wed, Nov 29, 12:27 PM · Patch-For-Review, Goal, bacula, Data-Persistence-Backup
jcrespo added a comment to T351253: Add support for GitLab markdown linebreak requirement.

Sorry for the terrible suggestion, but how difficult would be to change the gitlab parser to parse Bug: footers differently and link them to phabricator/render nicely on gitlab? I think it would be important if that is something that could potentially happen in the future or not- My guess is that is a standard that was inherited from bugzilla times and that at some point was implemented into gerrit (?).

Wed, Nov 29, 12:17 PM · GitLab (Upstream pit of despair 🕳️), commit-message-validator

Tue, Nov 28

jcrespo committed rOSMBb14c95e8f4db: Increase unit test coverage for File, MySQLMedia and MySQLMetadata (authored by jcrespo).
Increase unit test coverage for File, MySQLMedia and MySQLMetadata
Tue, Nov 28, 10:47 AM
jcrespo committed rOSMB6f4dfd5300c3: Add tmpdir removal, now that upload is stable (authored by jcrespo).
Add tmpdir removal, now that upload is stable
Tue, Nov 28, 10:47 AM

Mon, Nov 27

jcrespo added a comment to T347740: wmfbackups packages for Debian Bookworm.

Thank you, and sorry for the urgency- normally these kind of packages always keep backwards compatibility (and they did here too), and only have optional improvements, but this was a kind of a emergency due to puppet 7 upgrade.

Mon, Nov 27, 2:50 PM · cloud-services-team (FY2023/2024-Q1-Q2), Infrastructure-Foundations, Packaging
jcrespo added a comment to T347740: wmfbackups packages for Debian Bookworm.

I'm afraid you don't have the latest version, https://debmonitor.wikimedia.org/packages/python3-wmfbackups you should upgrade to 0.8.3+deb12u2.

Mon, Nov 27, 11:12 AM · cloud-services-team (FY2023/2024-Q1-Q2), Infrastructure-Foundations, Packaging

Thu, Nov 23

jcrespo added a comment to T351895: Make it easy to retrieve disk usage trends on backup storage for hw provisioning.

We can retrieve disk utilization for the last 1 year from prometheus:

backups.png (762×1 px, 166 KB)

Thu, Nov 23, 3:14 PM · database-backups, media-backups, bacula, Data-Persistence-Backup
jcrespo triaged T351895: Make it easy to retrieve disk usage trends on backup storage for hw provisioning as Medium priority.
Thu, Nov 23, 3:14 PM · database-backups, media-backups, bacula, Data-Persistence-Backup
jcrespo created T351895: Make it easy to retrieve disk usage trends on backup storage for hw provisioning.
Thu, Nov 23, 3:14 PM · database-backups, media-backups, bacula, Data-Persistence-Backup
jcrespo added a project to T320636: smart-data-dump fails occasionally due to facter timeouts: Puppet (Puppet 7.0).
Thu, Nov 23, 1:53 PM · Puppet (Puppet 7.0), SRE Observability (FY2022/2023-Q2), Observability-Alerting
jcrespo added a comment to T351725: Daily backup job not running for gerrit1003.

And in the latter there are quite a lot of big .db files,

See T351658#9352274.

"Gerrit hasn't been restarted in a while and did not have an opportunity to compact the H2 databases"

and the following comments after that.

Thu, Nov 23, 9:26 AM · collaboration-services, bacula, Data-Persistence-Backup

Wed, Nov 22

jcrespo added a comment to T351725: Daily backup job not running for gerrit1003.

/var/lib/gerrit2/review_site/cache that sounds like cache. Is that really necessary on a site recovery or would cache get regenerated? Not opposed to backup that anyway just in case, but maybe not hourly? Obviously, I also speak from a place of ignorance-any change should be tested- but wanting to know if the hourly backups are really being effective, given its size (and how they used to be much smaller).

Wed, Nov 22, 9:07 AM · collaboration-services, bacula, Data-Persistence-Backup
jcrespo moved T191804: Allow to store files between 4 and 5 GB from Triage to Done on the Data-Persistence-Backup board.
Wed, Nov 22, 8:29 AM · Data-Persistence-Backup, media-backups, SRE-swift-storage, MediaWiki-File-management, Commons, Multimedia
jcrespo added a project to T191804: Allow to store files between 4 and 5 GB: Data-Persistence-Backup.

This is now deployed and media-backups schema is up to date. Media backups are flowing as usual. I am no longer a blocker here.

Wed, Nov 22, 8:29 AM · Data-Persistence-Backup, media-backups, SRE-swift-storage, MediaWiki-File-management, Commons, Multimedia
jcrespo closed T351617: Several backup alerts fired as Resolved.

All fixed.

Screenshot_20231122_092458.png (656×2 px, 189 KB)

Wed, Nov 22, 8:26 AM · Data-Persistence-Backup

Tue, Nov 21

jcrespo moved T351617: Several backup alerts fired from Triage to Done on the Data-Persistence-Backup board.

All work is now done here and only pending the last snapshots to finish to resolve this issue.

Screenshot_20231121_193022.png (1×2 px, 229 KB)

Tue, Nov 21, 6:31 PM · Data-Persistence-Backup
jcrespo assigned T351725: Daily backup job not running for gerrit1003 to LSobanski.

I have unclogged things from my side (bacula) and backups are now flowing normally- usually I would close this are resolved, but leaving it to you in case you want to check something else from client side (gerrit1003). Please reassign to who corresponds. Specially regarding my questions at https://phabricator.wikimedia.org/T351725#9350316

Tue, Nov 21, 6:21 PM · collaboration-services, bacula, Data-Persistence-Backup
jcrespo added a comment to T351725: Daily backup job not running for gerrit1003.

New filesystem order arrived! https://grafana.wikimedia.org/goto/2I_2H9IIz?orgId=1

Tue, Nov 21, 5:53 PM · collaboration-services, bacula, Data-Persistence-Backup
jcrespo added a comment to T351725: Daily backup job not running for gerrit1003.

Despite https://gerrit.wikimedia.org/r/c/operations/puppet/+/976286 , I wonder if backups are well executed- the full backup is around 50GB, but each hour right now 20GB are copied. I wonder if that makes sense, if 20GB of new files are generated each hour, or there some avoidable overhead- specially given that a few months ago it was 8GB, and I would expect gerrit usage to go stale as people start using gitlab more.

Tue, Nov 21, 5:44 PM · collaboration-services, bacula, Data-Persistence-Backup
jcrespo added a comment to T351725: Daily backup job not running for gerrit1003.

"Job ... is waiting. Cannot find any appendable volumes"

Tue, Nov 21, 5:19 PM · collaboration-services, bacula, Data-Persistence-Backup
jcrespo added a comment to T351617: Several backup alerts fired.

Metadata from yesterday's backups are starting to pour in:

Screenshot_20231121_125655.png (419×2 px, 57 KB)

Tue, Nov 21, 11:58 AM · Data-Persistence-Backup
jcrespo added a comment to T284150: Bring an-mariadb100[12] into service.

BTullis closed this task as Resolved.

Tue, Nov 21, 11:31 AM · Patch-For-Review, Data-Platform-SRE
jcrespo changed the status of T351617: Several backup alerts fired from Open to In Progress.

I believe to have fixed the issue, puppet was wrong after the package fix at T351491.

Tue, Nov 21, 11:16 AM · Data-Persistence-Backup
jcrespo awarded T351588: Puppet failing ln dbprov2004 a Love token.
Tue, Nov 21, 10:21 AM · Data-Persistence-Backup

Fri, Nov 17

jcrespo closed T351491: pymysql.err.OperationalError: (2003, "Can't connect to MySQL server on 'db1164.eqiad.wmnet' ([SSL: CERTIFICATE_VERIFY_FAILED] certificate verify failed: self signed certificate in certificate chain (_ssl.c:1123))") on backup as Resolved.

This should be now fixed with the new packages + puppet code (but we will see over the weekend).

Fri, Nov 17, 4:10 PM · Puppet (Puppet 7.0), Data-Persistence, Data-Persistence-Backup, database-backups
jcrespo added a project to T191804: Allow to store files between 4 and 5 GB: media-backups.
Fri, Nov 17, 11:09 AM · Data-Persistence-Backup, media-backups, SRE-swift-storage, MediaWiki-File-management, Commons, Multimedia
jcrespo added a comment to T339894: cloudservices: codfw1dev: fix backups.

I belive the work on those tickets was done here CC @fnegri But please double check if there are additonal dbs that were not part of the migration.

Fri, Nov 17, 10:38 AM · User-aborrero, cloud-services-team, Cloud-VPS
jcrespo merged tasks T284483: migrate clouddb backups (openstack) from the old mysqldump system to the new wmfbackups (mydumper/mariabackup), T316664: Make a script to backup galera/openstack databases into T339894: cloudservices: codfw1dev: fix backups.
Fri, Nov 17, 10:37 AM · User-aborrero, cloud-services-team, Cloud-VPS
jcrespo merged task T316664: Make a script to backup galera/openstack databases into T339894: cloudservices: codfw1dev: fix backups.
Fri, Nov 17, 10:37 AM · cloud-services-team, Cloud-VPS
jcrespo merged task T284483: migrate clouddb backups (openstack) from the old mysqldump system to the new wmfbackups (mydumper/mariabackup) into T339894: cloudservices: codfw1dev: fix backups.
Fri, Nov 17, 10:37 AM · cloud-services-team, database-backups, Data-Persistence-Backup, Data-Services
jcrespo added a comment to T347740: wmfbackups packages for Debian Bookworm.

Please know an important update of wmfbackups package for compatibility with Puppet 7 will be pushed soon (wmfbackups 0.8.3 - and its related subpackages) T351491 - it should not affect cloud usage- as you don't use the dbbackups stata gathering, but you should upgrade ASAP to avoid future errors.

Fri, Nov 17, 10:24 AM · cloud-services-team (FY2023/2024-Q1-Q2), Infrastructure-Foundations, Packaging
jcrespo triaged T351491: pymysql.err.OperationalError: (2003, "Can't connect to MySQL server on 'db1164.eqiad.wmnet' ([SSL: CERTIFICATE_VERIFY_FAILED] certificate verify failed: self signed certificate in certificate chain (_ssl.c:1123))") on backup as Unbreak Now! priority.
Fri, Nov 17, 8:05 AM · Puppet (Puppet 7.0), Data-Persistence, Data-Persistence-Backup, database-backups
jcrespo created T351491: pymysql.err.OperationalError: (2003, "Can't connect to MySQL server on 'db1164.eqiad.wmnet' ([SSL: CERTIFICATE_VERIFY_FAILED] certificate verify failed: self signed certificate in certificate chain (_ssl.c:1123))") on backup.
Fri, Nov 17, 8:04 AM · Puppet (Puppet 7.0), Data-Persistence, Data-Persistence-Backup, database-backups

Tue, Nov 14

jcrespo awarded T212783: cumin: Make output path sane and flexible (was: allow to suppress output and progress bars) a Grey Medal token.
Tue, Nov 14, 12:43 PM · Cumin, Infrastructure-Foundations
jcrespo awarded T330882: transferpy should not log cumin subcomands as ERRORs on a normal, succesful run a Grey Medal token.
Tue, Nov 14, 12:42 PM · Patch-For-Review, database-backups, Data-Persistence-Backup
jcrespo claimed T330882: transferpy should not log cumin subcomands as ERRORs on a normal, succesful run.
Tue, Nov 14, 12:40 PM · Patch-For-Review, database-backups, Data-Persistence-Backup
jcrespo added a comment to T212783: cumin: Make output path sane and flexible (was: allow to suppress output and progress bars).

Volans, based on your comments at T330882, the original scope I needed for transfer.py is already fullfilled. I am guessing this is just open to "improve it" beyond that in the latest, increased scope?

Tue, Nov 14, 11:16 AM · Cumin, Infrastructure-Foundations

Mon, Nov 13

jcrespo awarded T347390: Create backups for puppetservers a Orange Medal token.
Mon, Nov 13, 4:08 PM · Patch-For-Review, Data-Persistence-Backup, Puppet-Infrastructure, Puppet (Puppet 7.0), Infrastructure-Foundations, SRE
jcrespo committed rOSMBee2c49752320: sql: Migrate mediabackups metadata size from int to bigint (authored by jcrespo).
sql: Migrate mediabackups metadata size from int to bigint
Mon, Nov 13, 10:05 AM

Nov 10 2023

jcrespo added a comment to T350924: Swift container for archived mariadb tables.

+1

Nov 10 2023, 3:34 PM · SRE-swift-storage
jcrespo added a comment to T191804: Allow to store files between 4 and 5 GB.

Indeed, the same schema change for production has to be applied to backup metadata, as we mirrored the size from mediawiki as an unsigned int:

Nov 10 2023, 12:32 PM · Data-Persistence-Backup, media-backups, SRE-swift-storage, MediaWiki-File-management, Commons, Multimedia
jcrespo added a comment to T191804: Allow to store files between 4 and 5 GB.

Thank you, @AlexisJazz that's useful feedback that without doubt will make our media storage happy- still there are additional technical operations and challenges to overcome. Cost is not as much the concern (specially for enwiki needs)- we get more concerned about Commons with its almost half a PB of storage, but still servers have to be purchased, racked and installed, data resharded, and everything planned, and it takes some time- it is not a question of just "buying larger disks". :-D

Nov 10 2023, 12:25 PM · Data-Persistence-Backup, media-backups, SRE-swift-storage, MediaWiki-File-management, Commons, Multimedia
jcrespo added a comment to T191804: Allow to store files between 4 and 5 GB.

@AlexisJazz While we are happy that you are excited about this, this is by far not ready for discussion. Developers just handed out the code, but this requires still a lot of preparation and discussion to be able to be implemented at WMF by system administrators due to the scale of operations- with a lot of open questions regarding Swift extra space needed, backup compatibility, schema changes deployment, and many other work needed.

Nov 10 2023, 9:35 AM · Data-Persistence-Backup, media-backups, SRE-swift-storage, MediaWiki-File-management, Commons, Multimedia

Nov 9 2023

jcrespo added a comment to T350020: Access request to deleted image files in the production Swift cluster.

Feel free to contact other SREs that can support you (can be those in data engineering, as they may know more about Hadoop) to support you and they can get back to me directly too, if that helps.

Nov 9 2023, 4:07 PM · SRE, UploadWizard, Structured-Data-Backlog, SRE-Access-Requests
jcrespo added a comment to T191804: Allow to store files between 4 and 5 GB.

Please loop me in in the progress, while this doesn't affect production, I may have assumed in some cases that files were always smaller than 4 GB for backups, and I may need to review its storage compatibility- even if it is just applying the same schema change on backup metadata.

Nov 9 2023, 3:11 PM · Data-Persistence-Backup, media-backups, SRE-swift-storage, MediaWiki-File-management, Commons, Multimedia
jcrespo committed rOSTPf6f4f6ed9192: Tranferrer: Enable transfers other than misc, core or x1 sections (authored by jcrespo).
Tranferrer: Enable transfers other than misc, core or x1 sections
Nov 9 2023, 10:25 AM
jcrespo added a project to T277160: Make recover-dump show the time taken: database-backups.
Nov 9 2023, 10:10 AM · database-backups, Data-Persistence-Backup
jcrespo added a project to T277162: recover-mariadb should use logging (logger) to indicate actions taken: database-backups.
Nov 9 2023, 9:09 AM · database-backups, Data-Persistence-Backup, Patch-For-Review
jcrespo added a project to T277754: Improve filename regex in cli/recover-dump : database-backups.
Nov 9 2023, 9:08 AM · database-backups, Data-Persistence-Backup, good first task

Nov 8 2023

jcrespo triaged T350020: Access request to deleted image files in the production Swift cluster as High priority.
Nov 8 2023, 8:10 PM · SRE, UploadWizard, Structured-Data-Backlog, SRE-Access-Requests
aaron awarded T133523: Decide how to improve parsercache replication, sharding and HA a Orange Medal token.
Nov 8 2023, 5:54 PM · SRE-Sprint-Week-Sustainability-March2023, MW-1.39-notes (1.39.0-wmf.22; 2022-07-25), Patch-For-Review, Epic, Sustainability (Incident Followup), DBA
jcrespo added a comment to T350020: Access request to deleted image files in the production Swift cluster.

While checking the things I need to apply the change, I need 2 additional data points-

Nov 8 2023, 4:54 PM · SRE, UploadWizard, Structured-Data-Backlog, SRE-Access-Requests
jcrespo added a comment to T350020: Access request to deleted image files in the production Swift cluster.

One last thing- legal rarely comments on stuff here in public on Phab- you may want to reach them directly.

Nov 8 2023, 4:36 PM · SRE, UploadWizard, Structured-Data-Backlog, SRE-Access-Requests
jcrespo added a comment to T350020: Access request to deleted image files in the production Swift cluster.

Ok. Then it seems we are ok for the most part- I will start working then on access, as this is the first time such access is requested, so please be patient- it won't be as fast as regular access request- but it shouldn't be too hard either. I will take the opportunity to document the grant creation in case in the future there are other requests.

Nov 8 2023, 4:32 PM · SRE, UploadWizard, Structured-Data-Backlog, SRE-Access-Requests
jcrespo added a comment to T330882: transferpy should not log cumin subcomands as ERRORs on a normal, succesful run.

Thank you a lot, worker.reporter was exactly what I think @MatthewVernon wanted (ignoring logging by the executioner and roll our own logging of what was an error), not ok_codes. I've updated the ticket to reflect more accurately the request.

Nov 8 2023, 3:34 PM · Patch-For-Review, database-backups, Data-Persistence-Backup
jcrespo renamed T330882: transferpy should not log cumin subcomands as ERRORs on a normal, succesful run from transferpy should take advantage of cumin's ok_codes to avoid spurious ERRORs to transferpy should not log cumin subcomands as ERRORs on a normal, succesful run.
Nov 8 2023, 3:30 PM · Patch-For-Review, database-backups, Data-Persistence-Backup

Nov 7 2023

jcrespo added a comment to T350022: Switchover m1 master (db1164-> db1119).

Looks good to me!

Nov 7 2023, 4:16 PM · DBA
jcrespo added a comment to T344036: Productionize db12[26-49].

It can be used for production for now, temporarily, but I will eventually need it for testing backups. Sadly, testing is not in the top of priorities, but the idea was to finally use it (or a replacement) for that this fiscal.

Nov 7 2023, 9:22 AM · DBA

Nov 6 2023

jcrespo closed T347674: Expand bacula space by provisioning new backup hosts, a subtask of T313582: Migrate bacula director to new hardware and setup independent bacula directors/storage/metadata for each primary datacenter for increased redundancy, as Resolved.
Nov 6 2023, 6:00 PM · Patch-For-Review, Goal, bacula, Data-Persistence-Backup
jcrespo closed T347674: Expand bacula space by provisioning new backup hosts as Resolved.

After much struggle, reimage worked correctly.

Nov 6 2023, 6:00 PM · bacula, Data-Persistence-Backup
jcrespo added a comment to T284150: Bring an-mariadb100[12] into service.

Ah, I see the issue:

Nov 6 2023, 1:30 PM · Patch-For-Review, Data-Platform-SRE
jcrespo added a comment to T284150: Bring an-mariadb100[12] into service.

the given socket does not have a known format

Nov 6 2023, 1:20 PM · Patch-For-Review, Data-Platform-SRE

Nov 2 2023

jcrespo added a comment to T350192: On-call batphone escalation configuration holidays FY2023-24.

One additional request- the last week of the year it is wmf holidays and I was scheduled for clinic duty. I still want to do my part, so if you can override that (nobody is supposed to be around during the Christmas week) and add me at the end of the schedule- Or if it is too much change, I can just be the person to ping if someone cannot do their clinic duty turn, or help the person after christmas, whatever you see fit?

Nov 2 2023, 5:38 PM · SRE Observability (FY2023/2024-Q2)
jcrespo awarded T350360: Evaluate "drop in" replacement for nrpe scripts a Party Time token.
Nov 2 2023, 10:43 AM · Observability-Alerting
jcrespo updated the task description for T138562: Improve regular production database backups handling.
Nov 2 2023, 9:52 AM · Sustainability (Incident Followup), Data-Persistence-Backup, Epic
jcrespo updated subscribers of T156462: Framework to transfer files over the LAN.

Adding Arnaud so he can add this ticket to his reading list, for context of previous conversations about transference tooling needs- but requiring nothing else.

Nov 2 2023, 9:17 AM · DBA
jcrespo closed T156462: Framework to transfer files over the LAN as Resolved.

Yeah, there are already tickets open for the pending issues (logs, arguments, etc.).

Nov 2 2023, 9:16 AM · DBA
jcrespo closed T156462: Framework to transfer files over the LAN, a subtask of T138562: Improve regular production database backups handling, as Resolved.
Nov 2 2023, 9:16 AM · Sustainability (Incident Followup), Data-Persistence-Backup, Epic
jcrespo closed T156462: Framework to transfer files over the LAN, a subtask of T156461: [META ticket] Automation for our DBs tracking task, as Resolved.
Nov 2 2023, 9:16 AM · DBA, Epic
jcrespo added a comment to T327384: Simplify manual runs of generate-mysqld-exporter-config (was: Some prometheus jobs for codfw are reporting availability problems / uncollectable despite metrics being collected).

It happened again, bumping priority.

Nov 2 2023, 9:11 AM · Observability-Metrics, DBA
jcrespo raised the priority of T327384: Simplify manual runs of generate-mysqld-exporter-config (was: Some prometheus jobs for codfw are reporting availability problems / uncollectable despite metrics being collected) from Low to Medium.
Nov 2 2023, 9:10 AM · Observability-Metrics, DBA

Oct 31 2023

jcrespo added a comment to T350020: Access request to deleted image files in the production Swift cluster.

We should discuss this a bit- as this changes not only the initial hypothesis, but also the restrictions of your project:

Oct 31 2023, 1:27 PM · SRE, UploadWizard, Structured-Data-Backlog, SRE-Access-Requests
jcrespo claimed T350020: Access request to deleted image files in the production Swift cluster.
Oct 31 2023, 1:08 PM · SRE, UploadWizard, Structured-Data-Backlog, SRE-Access-Requests
jcrespo added a comment to T347740: wmfbackups packages for Debian Bookworm.
reprepro changes:
add bookworm-wikimedia deb main amd64 wmfbackups 0.8.3+deb12u1 -- pool/main/w/wmfbackups/wmfbackups_0.8.3+deb12u1_amd64.deb
add bookworm-wikimedia deb main amd64 python3-wmfbackups 0.8.3+deb12u1 -- pool/main/w/wmfbackups/python3-wmfbackups_0.8.3+deb12u1_amd64.deb
add bookworm-wikimedia deb main amd64 wmfbackups-check 0.8.3+deb12u1 -- pool/main/w/wmfbackups/wmfbackups-check_0.8.3+deb12u1_amd64.deb
add bookworm-wikimedia deb main amd64 wmfbackups-remote 0.8.3+deb12u1 -- pool/main/w/wmfbackups/wmfbackups-remote_0.8.3+deb12u1_amd64.deb
Oct 31 2023, 12:53 PM · cloud-services-team (FY2023/2024-Q1-Q2), Infrastructure-Foundations, Packaging
jcrespo added a comment to T349360: Clean up dbbackups.backup_files table.

@Ladsgroup Yeah, I thought so- so let's try to come up both of us with a strategy that is relatively simple to implement in automation and we can discuss it next Monday- to either reduce granularity or archival (for which there is currently no mechanism in place).

Oct 31 2023, 9:12 AM · database-backups, Data-Persistence-Backup

Oct 30 2023

jcrespo added a comment to T350022: Switchover m1 master (db1164-> db1119).

For bacula it will be a bit more complicated- the first week there is usually more overload. So maybe, if it can be done tomorrow, wait until 11 am UTC (or a week later). I can prepare the patches by then.

Oct 30 2023, 1:55 PM · DBA
jcrespo added a comment to T350022: Switchover m1 master (db1164-> db1119).

Tomorrow (or if you need more time, one week later) would be a good day- backups will have ran the night before (they usually finish at 5:20 UTC) and you are free to do any maintenance there.

Oct 30 2023, 1:53 PM · DBA
jcrespo added a comment to T347740: wmfbackups packages for Debian Bookworm.

I was planning on creating a package for bookworm soon, but I cannot provide any timeline.

Oct 30 2023, 10:22 AM · cloud-services-team (FY2023/2024-Q1-Q2), Infrastructure-Foundations, Packaging

Oct 24 2023

jcrespo added a comment to T343109: Recover dbstore1007:s2 from the database provisioning service.

As a followup- consider in the future documenting the special grants on puppet. We don't have a good solution to monitor and assign them, but at least for now we document them at this location: https://phabricator.wikimedia.org/source/operations-puppet/browse/production/modules/role/templates/mariadb/grants/analytics-replica.sql

Oct 24 2023, 2:50 PM · Data-Platform-SRE, Data-Engineering, DBA