Page MenuHomePhabricator

jcrespo (Jaime Crespo)
Sr Database Administrator

Today

  • Clear sailing ahead.

Tomorrow

  • Clear sailing ahead.

Wednesday

  • Clear sailing ahead.

User Details

User Since
May 11 2015, 8:31 AM (350 w, 6 h)
Availability
Available
IRC Nick
jynus
LDAP User
Jcrespo
MediaWiki User
JCrespo (WMF) [ Global Accounts ]

Recent Activity

Today

jcrespo triaged T299920: Rebalance db1102 backup source, which often causes alert spam due to network throughput as Medium priority.
Mon, Jan 24, 2:55 PM · database-backups, Data-Persistence-Backup
jcrespo created T299920: Rebalance db1102 backup source, which often causes alert spam due to network throughput.
Mon, Jan 24, 2:54 PM · database-backups, Data-Persistence-Backup
jcrespo triaged T299876: Upgrade database backup sources to Bullseye + MariaDB 10.4 as Low priority.
Mon, Jan 24, 8:15 AM · database-backups, Data-Persistence-Backup
jcrespo added a project to T299876: Upgrade database backup sources to Bullseye + MariaDB 10.4: database-backups.
Mon, Jan 24, 8:03 AM · database-backups, Data-Persistence-Backup

Fri, Jan 21

jcrespo triaged T299764: Document media recovery use case proposals and decide their priority as High priority.
Fri, Jan 21, 1:09 PM · media-backups, Data-Persistence-Backup, Goal, SRE
jcrespo created T299764: Document media recovery use case proposals and decide their priority.
Fri, Jan 21, 1:07 PM · media-backups, Data-Persistence-Backup, Goal, SRE
jcrespo added a comment to T299624: Switchover m1 master (db1159 -> db1128).

^@Marostegui I just remembered that dbbackups point to db1159, and not the proxy, due to the current TLS certificate limitation, and the worry about sensitive data being accessed cross-datacenter. It will have to be deployed after switchover. I can do it but involving you in case I don't happen to be around.

Fri, Jan 21, 11:43 AM · Patch-For-Review, DBA
jcrespo added a comment to T299624: Switchover m1 master (db1159 -> db1128).

+1 as owner of database dbbackups and not sure if something else there.

Fri, Jan 21, 11:19 AM · Patch-For-Review, DBA

Wed, Jan 19

jcrespo added a comment to T297605: Shutdown Tendril and dbtree.

Any dbprov host, if it is a snapshot on /srv/backups/snapshots/latest and if it is a logical dump on /srv/backups/dumps/latest. Ideally with a recognizable file or dir name, even if the section is made up (e.g. dump.tendril.<date>).

Wed, Jan 19, 10:28 AM · serviceops-radar, Patch-For-Review, DBA
jcrespo added a comment to T270101: Grants not working with DB hosts with to ipv6.

To expand marostegui's answer (as I also reasearched it at T271148#6735477):

Can we "just" add the following

Wed, Jan 19, 9:50 AM · Infrastructure-Foundations, netbox, DBA
jcrespo added a comment to T299479: Upgrade s6 to Bullseye.

I need to talk to LSobanski, I mentioned the need for thinking about when upgrading the backup infrastructure, but I don't remember if we ended up scheduling it for this quarter or for a future one. It is certainly in the TODO, but because there should be no hard dependency this time against databases, I can take care this, sooner or later, as it may require some non-trivial upgrades and dependency upgrades/building new packages for backup software that may not be yet ready or need some planing (e.g. bacula upgrade, while technically not a dependency, it will be a relatively large project), plus it needs weighting against the several backup's priorities (if there are more urgent things, we may delay it a bit).

Wed, Jan 19, 9:07 AM · Patch-For-Review, DBA
jcrespo awarded T295965: Test MariaDB 10.4 with Bullseye a Love token.
Wed, Jan 19, 6:42 AM · Patch-For-Review, DBA
jcrespo removed projects from T268258: transfer.py argument parsing exception: Google-Summer-of-Code (2021), good first task.
Wed, Jan 19, 4:40 AM · Data-Persistence-Backup, Patch-For-Review
jcrespo removed projects from T277160: Make recover-dump show the time taken: Google-Summer-of-Code (2021), good first task.
Wed, Jan 19, 4:39 AM · Data-Persistence-Backup, Patch-For-Review
jcrespo removed projects from T277162: recover-mariadb should use logging (logger) to indicate actions taken: Google-Summer-of-Code (2021), good first task.
Wed, Jan 19, 4:38 AM · Data-Persistence-Backup, Patch-For-Review
jcrespo added a comment to T262668: WMF media storage must be adequately backed up.

Codfw first pass finished for all wikis, this is the percentage of errors:

Wed, Jan 19, 4:37 AM · Patch-For-Review, media-backups, Data-Persistence-Backup, Epic, Goal, SRE, SRE-swift-storage

Tue, Jan 18

jcrespo awarded T299416: New table: title table a Like token.
Tue, Jan 18, 6:28 PM · Patch-For-Review, User-Ladsgroup, DBA, Platform Engineering
jcrespo updated subscribers of T299387: Bad revision in German Wikipedia.

Sorry to add @daniel but you are the biggest expert I know about metadata tables, and the content table in particular. Could you help me come up with an explanation why something like "object type text table references on dewiki" may have regressed? Or php serialization changes? See my above comments.

Tue, Jan 18, 6:20 PM · MediaWiki-Revision-backend, Wikimedia-production-error
jcrespo added a comment to T299387: Bad revision in German Wikipedia.

After reading some docs, the text row seem to indicate "it has the same content as text row oldid = 5815762 (which is revision 5806950: https://de.wikipedia.org/w/index.php?title=Enzym&oldid=5806950 ) and that seems very plausible in context.

Tue, Jan 18, 6:07 PM · MediaWiki-Revision-backend, Wikimedia-production-error
jcrespo added a comment to T299387: Bad revision in German Wikipedia.

I think I found it, the referenced text row says:

Tue, Jan 18, 5:52 PM · MediaWiki-Revision-backend, Wikimedia-production-error
jcrespo added a comment to T299387: Bad revision in German Wikipedia.

Hey, should I try to search backups for that blob? I guess there is a very small chance it is there, but it takes very little to check. Do you know the missing key on ES database to search for?

Tue, Jan 18, 5:16 PM · MediaWiki-Revision-backend, Wikimedia-production-error

Wed, Jan 12

jcrespo added a comment to T299095: Links tables corrupted due to incorrectly parenthesized delete queries.

I mean the insert traffic will be there already

Wed, Jan 12, 11:51 PM · MW-1.38-notes (1.38.0-wmf.17; 2022-01-10), Wikimedia-Incident, Patch-For-Review, Platform Engineering, Wikimedia-production-error
jcrespo added a comment to T299095: Links tables corrupted due to incorrectly parenthesized delete queries.

Exceptions seem to be nice now after full revert. :-)

Wed, Jan 12, 11:43 PM · MW-1.38-notes (1.38.0-wmf.17; 2022-01-10), Wikimedia-Incident, Patch-For-Review, Platform Engineering, Wikimedia-production-error
jcrespo added a comment to T299095: Links tables corrupted due to incorrectly parenthesized delete queries.

The current .16 background issues seem to be exceptions for failing to lock pages at:
/srv/mediawiki/php-1.38.0-wmf.16/includes/deferred/LinksUpdate.php
https://logstash.wikimedia.org/goto/b7d2781a9e247a0595896ba649c19a6b

Wed, Jan 12, 11:25 PM · MW-1.38-notes (1.38.0-wmf.17; 2022-01-10), Wikimedia-Incident, Patch-For-Review, Platform Engineering, Wikimedia-production-error
jcrespo added a comment to T299095: Links tables corrupted due to incorrectly parenthesized delete queries.

Here the synthoms that continue :-(:

Wed, Jan 12, 10:19 PM · MW-1.38-notes (1.38.0-wmf.17; 2022-01-10), Wikimedia-Incident, Patch-For-Review, Platform Engineering, Wikimedia-production-error
jcrespo added a comment to T299095: Links tables corrupted due to incorrectly parenthesized delete queries.

The number of row writes per second doesn't seem to have yet gone back to pre-deployment levels:

Wed, Jan 12, 9:41 PM · MW-1.38-notes (1.38.0-wmf.17; 2022-01-10), Wikimedia-Incident, Patch-For-Review, Platform Engineering, Wikimedia-production-error

Tue, Jan 11

jcrespo added a comment to T295965: Test MariaDB 10.4 with Bullseye.

I just saw that db2078 has a failed service: prometheus-mysqld-exporter.service I haven't researched further, don't know if it fails because new package, it is a one time failure because of the reimage, or a WIP/known issue, but notifying it here, as you told me to bring up anything weird I saw. This doesn't affect backups or my work in any way.

Tue, Jan 11, 6:20 PM · Patch-For-Review, DBA

Mon, Jan 10

jcrespo closed T274206: (Need By: 2021-03-31) rack/setup/install ms-backup100[12] as Resolved.

Deployed the change as this:

commit baeb288d4d9713814ac88e9537bbcf0ece5bb9e4                                                
Author: generate-dns-snippets <noc@wikimedia.org>                                              
Date:   Mon Jan 10 16:17:22 2022 +0000
Mon, Jan 10, 4:55 PM · SRE, ops-eqiad, DC-Ops

Dec 23 2021

jcrespo added a comment to T262668: WMF media storage must be adequately backed up.

Commonswiki codfw backup copy, with 91823709 files backed up and a 0.04% error rate.

Dec 23 2021, 5:13 PM · Patch-For-Review, media-backups, Data-Persistence-Backup, Epic, Goal, SRE, SRE-swift-storage
jcrespo closed T160229: Back up of Commons files, a subtask of T262668: WMF media storage must be adequately backed up, as Resolved.
Dec 23 2021, 5:09 PM · Patch-For-Review, media-backups, Data-Persistence-Backup, Epic, Goal, SRE, SRE-swift-storage
jcrespo closed T160229: Back up of Commons files as Resolved.

Commonswiki is now backed up on 2 geographically redundant locations within WMF infrastructure.

Dec 23 2021, 5:09 PM · Goal, media-backups, Datasets-Archiving, SRE, Datasets-General-or-Unknown, Community-Wishlist-Survey-2016, Commons
jcrespo claimed T274206: (Need By: 2021-03-31) rack/setup/install ms-backup100[12].

Thank you for the feedback! As we all think this was not disabled for some reason (I've seen the setup scripts fail on some steps sometimes, I will do: https://wikitech.wikimedia.org/wiki/DNS/Netbox#How_can_I_add_the_IPv6_AAAA/PTR_records_to_a_host_that_doesn't_have_it? as documented, and that should take care of it.

Dec 23 2021, 3:42 PM · SRE, ops-eqiad, DC-Ops
jcrespo reopened T274206: (Need By: 2021-03-31) rack/setup/install ms-backup100[12] as "Open".

I just realized with jbond that the 2 servers (ms-backup1001.eqiad.wmnet and ms-backup1002.eqiad.wmnet) do not have ipv6 AAAA dns records.

Dec 23 2021, 12:37 PM · SRE, ops-eqiad, DC-Ops

Dec 22 2021

jcrespo created P18255 DB name.
Dec 22 2021, 7:34 PM

Dec 21 2021

jcrespo created T298120: Delay run of weekly backups for es4 and s5 content to avoid running it while the dump is incomplete.
Dec 21 2021, 6:03 PM · Data-Persistence-Backup, bacula, database-backups
jcrespo awarded T298110: Provide an easier way to drop spam mail a Like token.
Dec 21 2021, 5:00 PM · Patch-For-Review, Infrastructure-Foundations, Mail

Dec 17 2021

jcrespo added a comment to T262668: WMF media storage must be adequately backed up.

Codfw commonswiki backups are at 75% completion (68854627 files/301887395014767 bytes backed up), and will likely finish by next week.

Dec 17 2021, 10:19 AM · Patch-For-Review, media-backups, Data-Persistence-Backup, Epic, Goal, SRE, SRE-swift-storage

Dec 16 2021

jcrespo added a comment to T262668: WMF media storage must be adequately backed up.

This is a prototype version of the (trivial/non-massive) recovery script, interactive version:

Dec 16 2021, 3:40 PM · Patch-For-Review, media-backups, Data-Persistence-Backup, Epic, Goal, SRE, SRE-swift-storage
jcrespo awarded T296537: Check and fix GRANT issues of wikiuser a Yellow Medal token.
Dec 16 2021, 2:32 PM · User-Ladsgroup, DBA
jcrespo updated the task description for T294974: (Need By: TBD) rack/setup/install backup1008.
Dec 16 2021, 8:53 AM · SRE, Data-Persistence-Backup, ops-eqiad, DC-Ops
jcrespo awarded T294973: (Need By: TBD) rack/setup/install backup2008 a Like token.
Dec 16 2021, 8:51 AM · Patch-For-Review, SRE, Data-Persistence-Backup, ops-codfw, DC-Ops
jcrespo added a comment to P18236 (An Untitled Masterwork).

You had a few typos:

Dec 16 2021, 8:42 AM

Dec 15 2021

jcrespo edited P18236 (An Untitled Masterwork).
Dec 15 2021, 4:34 PM
jcrespo created P18236 (An Untitled Masterwork).
Dec 15 2021, 4:33 PM
jcrespo added a comment to T262668: WMF media storage must be adequately backed up.

A first pass on eqiad finished successfully: 101,970,844 files backed up successfully, with a total size of 373,335,321,603,376 bytes and an error rate (by size) of 0.035%.

Dec 15 2021, 3:23 PM · Patch-For-Review, media-backups, Data-Persistence-Backup, Epic, Goal, SRE, SRE-swift-storage
jcrespo closed T280232: Uncached wiki requests partially unavailable due to excessive request rates from a bot as Resolved.

I don't think it is worth this being open anymore- there indeed is a need to review it to generate a better description, but hopefully it will be caught up in the incident review process that is kickstarting now.

Dec 15 2021, 11:18 AM · SRE, Wikimedia-Incident
jcrespo added a comment to T262668: WMF media storage must be adequately backed up.

This is the list of media backup errors (making it NDA-only, as I haven't checked yet everything there is non-private):
{P18232}

Dec 15 2021, 10:11 AM · Patch-For-Review, media-backups, Data-Persistence-Backup, Epic, Goal, SRE, SRE-swift-storage

Dec 14 2021

jcrespo added a comment to T262668: WMF media storage must be adequately backed up.

Proof it is working:

Dec 14 2021, 6:40 PM · Patch-For-Review, media-backups, Data-Persistence-Backup, Epic, Goal, SRE, SRE-swift-storage
jcrespo committed rLPRIc67b997429c6: mediabackup: Add dummy age private key for mediabackups (authored by jcrespo).
mediabackup: Add dummy age private key for mediabackups
Dec 14 2021, 4:41 PM
jcrespo updated the task description for T294973: (Need By: TBD) rack/setup/install backup2008.
Dec 14 2021, 2:22 PM · Patch-For-Review, SRE, Data-Persistence-Backup, ops-codfw, DC-Ops

Dec 13 2021

Nemo_bis awarded T262668: WMF media storage must be adequately backed up a Burninate token.
Dec 13 2021, 8:10 PM · Patch-For-Review, media-backups, Data-Persistence-Backup, Epic, Goal, SRE, SRE-swift-storage
jcrespo added a comment to T69818: Local private files on deployment host should be backed up somewhere.

Check now I've made another restore into /home/krinkle/restore2/ and I think you should be able to see it. I think I see the issue- the setgid forces the inheritance from the parent, which will do weird things when restoring to not-root. This is an important thing to notice when doing non-destructive restores, but I think it shouldn't hit us for a regular restore.

Dec 13 2021, 7:09 PM · bacula, Sustainability (Incident Followup), Deployments
jcrespo added a comment to T69818: Local private files on deployment host should be backed up somewhere.

Actually, you may have found something that I am unable to answer you on: that directory (the original one) seems to have the setgid bit on, which of course bacula won't maintain. If that is intended, or how to handle, I cannot tell you. I would ask you to inquiry what is the best way to solve this for the original files, or if we have to workaround it.

Dec 13 2021, 5:28 PM · bacula, Sustainability (Incident Followup), Deployments
jcrespo added a comment to T69818: Local private files on deployment host should be backed up somewhere.

Thanks, this was useful to detect something applicable for future automated recoveries. Bacula backups full paths, and keeps the permissions of those paths, so restore, restore/srv, and restore/srv/mediawiki-staging were kept as root.

Dec 13 2021, 5:13 PM · bacula, Sustainability (Incident Followup), Deployments
jcrespo added a comment to T294355: Several Wikidata Grafana boards missing data before October 2021.
Terminated Jobs:
 JobId  Level      Files    Bytes   Status   Finished        Name 
====================================================================
396417  Full     108,320    11.70 G  OK       13-Dec-21 09:34 graphite1004.eqiad.wmnet-Weekly-Mon-production-srv-carbon-whisper-daily
396418  Full     108,320    11.70 G  OK       13-Dec-21 09:35 graphite2003.codfw.wmnet-Weekly-Mon-production-srv-carbon-whisper-daily

https://grafana.wikimedia.org/d/413r2vbWk/bacula?orgId=1&var-site=eqiad&var-job=graphite1004.eqiad.wmnet-Weekly-Mon-production-srv-carbon-whisper-daily&from=1639384883943&to=1639388483943

Dec 13 2021, 9:39 AM · Data-Persistence, Data-Persistence-Backup, bacula, SRE Observability (FY2021/2022-Q2), Wikidata Analytics, Wikidata, Graphite
jcrespo added a comment to T294355: Several Wikidata Grafana boards missing data before October 2021.
Running Jobs:
Console connected using TLS at 13-Dec-21 09:20
 JobId  Type Level     Files     Bytes  Name              Status
======================================================================
396417  Back Full      4,568    412.9 M graphite1004.eqiad.wmnet-Weekly-Mon-production-srv-carbon-whisper-daily is running
396418  Back Full          0         0  graphite2003.codfw.wmnet-Weekly-Mon-production-srv-carbon-whisper-daily is running
====
Dec 13 2021, 9:22 AM · Data-Persistence, Data-Persistence-Backup, bacula, SRE Observability (FY2021/2022-Q2), Wikidata Analytics, Wikidata, Graphite

Dec 10 2021

jcrespo added a comment to T294355: Several Wikidata Grafana boards missing data before October 2021.

Let me give it a deeper look, while the patch by itself looks good as is, I want to check if a different (non-default) backup policy would be more advantageous in frequency and space. :-)

Dec 10 2021, 1:36 PM · Data-Persistence, Data-Persistence-Backup, bacula, SRE Observability (FY2021/2022-Q2), Wikidata Analytics, Wikidata, Graphite
jcrespo added a comment to T296537: Check and fix GRANT issues of wikiuser.

db1102 is back up again, BTW (T296546).

Dec 10 2021, 12:33 PM · User-Ladsgroup, DBA

Dec 9 2021

jcrespo awarded T295706: Improve TransactionProfiler as replacement for tendril's slow queries a Love token.
Dec 9 2021, 7:57 PM · Performance-Team-publish, MW-1.38-notes (1.38.0-wmf.9; 2021-11-16), Patch-For-Review, Performance-Team (Radar), Developer Productivity, Wikimedia-Rdbms, DBA, User-Ladsgroup
jcrespo added a comment to T297297: Investigate the unusual dbs in s3.

Related: T173606

Dec 9 2021, 11:53 AM · DBA
jcrespo added a comment to T297297: Investigate the unusual dbs in s3.

Sorry, when I said closed, I really meant deleted.dblist.

Dec 9 2021, 11:47 AM · DBA
jcrespo added a comment to T297297: Investigate the unusual dbs in s3.

sys schema is https://mariadb.com/kb/en/sys-schema/ and should be on all wmf databases, and the admin user should have access to it on mw databases.
ops is used for the log and functions of the query killer, should be on all mw databases (at least for now).
You should check against the closed.dblist- I have been an advocate that if those were to be kept, they shouldn't be on s3, but on a hypothetical "s0" very small section, (eg. on a vm) to save resources.
Others will be relics of the past of maintenance jobs/creation of wikis by mistake.

Dec 9 2021, 7:46 AM · DBA

Dec 3 2021

jcrespo added a comment to T296992: Remove dbbackups-dashboard project and shutdown its instances.

@Majavah: A reminder that I will let you or the rest of cloud services team to mark adequately its status at https://wikitech.wikimedia.org/wiki/News/Cloud_VPS_2021_Purge#dbbackups-dashboard for your internal coordination.

Dec 3 2021, 11:16 AM · cloud-services-team (Kanban), Cloud-VPS (Project-requests)
jcrespo awarded T296992: Remove dbbackups-dashboard project and shutdown its instances a Like token.
Dec 3 2021, 11:13 AM · cloud-services-team (Kanban), Cloud-VPS (Project-requests)
jcrespo added a comment to T296546: hw troubleshooting: memory stick failure (uncorrectable error + reduced available memory) for db1102.

The host is down and ready to be serviced, please let us know if stick could be removed successfully, or any other issue arises, as it may require puppet memory adjustments before putting it back into production.

Dec 3 2021, 11:13 AM · SRE, Data-Persistence-Backup, database-backups, ops-eqiad, DC-Ops
jcrespo updated the task description for T272559: Unused puppet resources audit, 2021.
Dec 3 2021, 11:01 AM · Infrastructure-Foundations, Patch-For-Review, SRE, Puppet
jcrespo updated the task description for T272559: Unused puppet resources audit, 2021.
Dec 3 2021, 10:59 AM · Infrastructure-Foundations, Patch-For-Review, SRE, Puppet
jcrespo updated the task description for T272559: Unused puppet resources audit, 2021.
Dec 3 2021, 10:58 AM · Infrastructure-Foundations, Patch-For-Review, SRE, Puppet
jcrespo added a comment to T296546: hw troubleshooting: memory stick failure (uncorrectable error + reduced available memory) for db1102.

@Cmjohnson Today would be preferred, as Monday I will be off, and won't be able to put it down and back up. I will shutdown the server, and if it cannot be served, we can do it any day on or after the 9th.

Dec 3 2021, 10:46 AM · SRE, Data-Persistence-Backup, database-backups, ops-eqiad, DC-Ops
jcrespo created T296992: Remove dbbackups-dashboard project and shutdown its instances.
Dec 3 2021, 10:44 AM · cloud-services-team (Kanban), Cloud-VPS (Project-requests)
jcrespo reopened T296285: Split mariadb::dbstore_multiinstance into 2 separate roles (backup sources and analytics) as "Open".
from:	SYSTEMDTIMER
to:	root@cumin2001.codfw.wmnet
Dec 3 2021, 10:16 AM · Puppet, Data-Engineering, DBA, Infrastructure-Foundations
jcrespo reopened T296285: Split mariadb::dbstore_multiinstance into 2 separate roles (backup sources and analytics), a subtask of T256972: Refactor mariadb puppet code, as Open.
Dec 3 2021, 10:16 AM · Patch-For-Review, DBA, User-jbond, User-Kormat

Dec 2 2021

jcrespo added a comment to T296930: codfw: relocate servers in rack D6.

But @Kormat will need to bring it back up the following day (or wait till 9th for you).

Dec 2 2021, 1:41 PM · SRE, DBA, ops-codfw
jcrespo added a comment to T296930: codfw: relocate servers in rack D6.

Sadly, I won't be around on the 7th. There is no issue regarding the move (backups should have finished by that time, ip changes should not affect backups), but either the date has to be moved, or someone will have to stop the servers for me. I can put them up on the 9th when I return (no big issue with those being down for an extended time), but I won't be able to shut them down on the 7th.

Dec 2 2021, 1:38 PM · SRE, DBA, ops-codfw
jcrespo added a comment to T205378: Support ECH on Wikimedia servers.

unsubing, as I think I was added to this ticket by mistake. This is traffic/traffic security expertise, and they already triaged and aware of the task.

Dec 2 2021, 1:03 PM · Traffic-Icebox, Upstream, HTTPS, SRE

Dec 1 2021

jcrespo added a comment to T296289: swift-proxy not starting on ms-fe2009 due to missing python-monotonic.

Ah, so I guess not in production, that leaves me less worried. Sorry, I searched for ms-fe and monotonic but couldn't find this ticket.

Dec 1 2021, 9:22 PM · SRE-swift-storage
jcrespo updated the task description for T296883: ms-fe2010, ms-fe2011, ms-fe2012 had its swift-proxy.service failed.
Dec 1 2021, 9:21 PM · SRE-swift-storage
jcrespo created T296883: ms-fe2010, ms-fe2011, ms-fe2012 had its swift-proxy.service failed.
Dec 1 2021, 9:17 PM · SRE-swift-storage
jcrespo reassigned T296285: Split mariadb::dbstore_multiinstance into 2 separate roles (backup sources and analytics) from jcrespo to BTullis.

Reassigning to btullis, as he was the person to bring up the alias issue, so he can evaluate/review if the patch is a good solution to his concerns or not (I don't use aliases).

Dec 1 2021, 8:16 AM · Puppet, Data-Engineering, DBA, Infrastructure-Foundations

Nov 30 2021

jcrespo added a comment to T296546: hw troubleshooting: memory stick failure (uncorrectable error + reduced available memory) for db1102.

My only concern is it may want the memory to be mirrored

Nov 30 2021, 4:29 PM · SRE, Data-Persistence-Backup, database-backups, ops-eqiad, DC-Ops
jcrespo added a comment to T254646: Reconsidering how we name things.

I think your intentions were good :-). I edited my comment, it started to be supported since MariaDB 10.5 and once we can stop supporting other versions, we should use the new command- for the sake of consistency (primary/replica) :-D.

Nov 30 2021, 2:12 PM · MW-1.38-notes (1.38.0-wmf.12; 2021-12-06), MW-1.37-notes (1.37.0-wmf.23; 2021-09-13), User-brennen, MediaWiki-extensions-General, WMF-General-or-Unknown, MediaWiki-General, Patch-For-Review, Voice & Tone
jcrespo added a comment to T254646: Reconsidering how we name things.

@Dinoguy1000 I don't think you should do that- "SHOW REPLICA STATUS"; is not a valid mariadb command (yet- it started being accepted on 10.5), "SHOW SLAVE STATUS" is the working command that has to be sent- we don't have a choice on that. <strike>I think you should file an upstream bug instead to mariadb to ask to rename the command</strike> You can file a ticket to use SHOW REPLICA STATUS once we stop supporting <10.5, but as a separate ticket- otherwise those kind of edits would be confusing (they are not names we can chose to change- yet, unlike other identifiers).

Nov 30 2021, 2:07 PM · MW-1.38-notes (1.38.0-wmf.12; 2021-12-06), MW-1.37-notes (1.37.0-wmf.23; 2021-09-13), User-brennen, MediaWiki-extensions-General, WMF-General-or-Unknown, MediaWiki-General, Patch-For-Review, Voice & Tone
jcrespo added a comment to T296546: hw troubleshooting: memory stick failure (uncorrectable error + reduced available memory) for db1102.

Are you ok leaving this server as is, until the refresh happens in Q3

Nov 30 2021, 11:48 AM · SRE, Data-Persistence-Backup, database-backups, ops-eqiad, DC-Ops

Nov 26 2021

jcrespo added a comment to T296511: Drop wikiadmin@localhost MySQL user from core dbs.

I checked grants on all mariadb::backup_source s (db[2097-2101,2139,2141].codfw.wmnet,db[1102,1116,1139-1140,1145,1150,1171].eqiad.wmnet) at the same time I did a rolling upgrade/restart. I think I kept them all in sync with production (as much as I cook, s6, s7 and x1 have some custom ones), but I removed the wikiadmin@localhost grants when I found it, as I copied the grants from the model I was told to follow. FYI

Nov 26 2021, 5:28 PM · DBA, User-Ladsgroup
jcrespo added projects to T296546: hw troubleshooting: memory stick failure (uncorrectable error + reduced available memory) for db1102: database-backups, Data-Persistence-Backup.
Nov 26 2021, 3:07 PM · SRE, Data-Persistence-Backup, database-backups, ops-eqiad, DC-Ops
jcrespo created T296546: hw troubleshooting: memory stick failure (uncorrectable error + reduced available memory) for db1102.
Nov 26 2021, 3:06 PM · SRE, Data-Persistence-Backup, database-backups, ops-eqiad, DC-Ops
jcrespo added a comment to T294355: Several Wikidata Grafana boards missing data before October 2021.

I don't have the answer to that question, but whenever any of you have the servers and path(s), you can follow the instructions at https://wikitech.wikimedia.org/wiki/Bacula#Adding_a_new_client to send a preliminary backup proposal to Puppet, and I will assist you to merge it with the proper setup (e.g. schedule, day, etc.) - I think it will be more useful to discuss the details over a patch :-).

Nov 26 2021, 2:44 PM · Data-Persistence, Data-Persistence-Backup, bacula, SRE Observability (FY2021/2022-Q2), Wikidata Analytics, Wikidata, Graphite
jcrespo awarded T96499: dbtree loads third party resources (from google.com/jsapi) a Love token.
Nov 26 2021, 1:59 PM · Privacy Engineering, Privacy, HTTPS, SRE, Patch-For-Review, DBA, WMF-Legal

Nov 25 2021

jcrespo added a comment to T294355: Several Wikidata Grafana boards missing data before October 2021.

One more question, to finally decide if setting up weekly full backups or daily but incremental- do all files mostly change completely, or only a subset of them? Incrementals are able to be done with file granularity only (it will backup fully files as long as its path or hash has changed), if value.wsp changes every minute, and there is only 1 per value, we will do "weekly only full", otherwise the daily incrementals may be preferred.

Nov 25 2021, 9:59 AM · Data-Persistence, Data-Persistence-Backup, bacula, SRE Observability (FY2021/2022-Q2), Wikidata Analytics, Wikidata, Graphite

Nov 24 2021

jcrespo added a comment to T291332: Alert when auto-increment fields on any MW-related databases reach a threshold.

if it's higher than a percentage

Nov 24 2021, 7:45 PM · Platform Engineering, Performance-Team (Radar), Sustainability (Incident Followup), DBA
jcrespo added a comment to T291332: Alert when auto-increment fields on any MW-related databases reach a threshold.

Adding this here, in case someone else was confused on how some of those values could shrink (e.g. compared to T63111#5782953) and apparently, some PKs were doubled by being converted to unsigned (e.g. wikidata.rcs).

Nov 24 2021, 7:23 PM · Platform Engineering, Performance-Team (Radar), Sustainability (Incident Followup), DBA
jcrespo added projects to T294355: Several Wikidata Grafana boards missing data before October 2021: bacula, Data-Persistence-Backup, Data-Persistence.

number of files are (within reason) a non-blocker for bacula, as files are packaged into volumes. It is true that each file is stored as a mysql record, but that should be able to scale until dozens of (US) billons, although it may be slow to recover when rebuilding metadata.

Nov 24 2021, 4:09 PM · Data-Persistence, Data-Persistence-Backup, bacula, SRE Observability (FY2021/2022-Q2), Wikidata Analytics, Wikidata, Graphite
jcrespo created P17818 (An Untitled Masterwork).
Nov 24 2021, 3:16 PM
jcrespo added a comment to T296285: Split mariadb::dbstore_multiinstance into 2 separate roles (backup sources and analytics).

Deployment went as expected- but now that I thought a bit, I think btullis brought up a confusing status now:

Nov 24 2021, 3:00 PM · Puppet, Data-Engineering, DBA, Infrastructure-Foundations
jcrespo updated the task description for T289996: Media storage metadata inconsistent with Swift.
Nov 24 2021, 11:55 AM · SRE-swift-storage

Nov 23 2021

jcrespo created P17801 mariadb query planner for mediabackups.
Nov 23 2021, 12:27 PM
jcrespo awarded T259746: Process for granting wmf LDAP access is vulnerable to impersonation (after creating a Wikitech account with an unconfirmed email address) a Yellow Medal token.
Nov 23 2021, 11:52 AM · Infrastructure-Foundations, SRE, Security, Security-Team
jcrespo added a comment to T296285: Split mariadb::dbstore_multiinstance into 2 separate roles (backup sources and analytics).

it will give us greater flexibility if and when we want the dbstore* and db* configurations to diverge. Is that about right?

Nov 23 2021, 11:06 AM · Puppet, Data-Engineering, DBA, Infrastructure-Foundations
jcrespo added a comment to T296285: Split mariadb::dbstore_multiinstance into 2 separate roles (backup sources and analytics).

^What do you think Data-Engineering people?

Nov 23 2021, 10:58 AM · Puppet, Data-Engineering, DBA, Infrastructure-Foundations
jcrespo added a subtask for T256972: Refactor mariadb puppet code: T296285: Split mariadb::dbstore_multiinstance into 2 separate roles (backup sources and analytics).
Nov 23 2021, 10:57 AM · Patch-For-Review, DBA, User-jbond, User-Kormat