Page MenuHomePhabricator

jcrespo (Jaime Crespo)
Sr Database Administrator

Today

  • Clear sailing ahead.

Tomorrow

  • Clear sailing ahead.

Monday

  • Clear sailing ahead.

User Details

User Since
May 11 2015, 8:31 AM (223 w, 5 d)
Availability
Available
IRC Nick
jynus
LDAP User
Jcrespo
MediaWiki User
JCrespo (WMF) [ Global Accounts ]

Recent Activity

Yesterday

jcrespo added a comment to T107610: Setup separate logical External Store for Flow in production.

There are two clusters in the normal ES, with writes (if I understand correctly) randomly going into either. I assume that's for handling load (of which there's not much in Flow) so we are fine with just one?

Fri, Aug 23, 3:09 PM · Growth-Team (Current Sprint), DBA, Operations, WorkType-Maintenance, Collaboration-Team-Triage, StructuredDiscussions
jcrespo added a comment to T107610: Setup separate logical External Store for Flow in production.

We will have to do most of this anyway for T226704.

Fri, Aug 23, 3:06 PM · Growth-Team (Current Sprint), DBA, Operations, WorkType-Maintenance, Collaboration-Team-Triage, StructuredDiscussions
jcrespo added a comment to T102099: Fix IPv6 autoconf issues once and for all, across the fleet..

Hi, I am bit disconnected about the planning of deployment of this- Once all hosts (or all hosts that are planned above being migrated, is the puppet line supposed to go on the profile (or role) or on base.pp with some exclussions? It is not clear based on the ticket description and comments, or I may have missed it as it is a long ticket :-D.

Fri, Aug 23, 11:15 AM · Patch-For-Review, Traffic, netops, Operations, IPv6

Mon, Aug 19

jcrespo edited Description on Wikimedia-database-error.
Mon, Aug 19, 1:51 PM
jcrespo added a comment to T196366: Implement (or refactor) a script to move slaves when the master is not available.

Maybe gtid will become usable at 10.4 ? https://jira.mariadb.org/browse/MDEV-12012?focusedCommentId=132462#comment-132462

Mon, Aug 19, 9:07 AM · DBA
jcrespo added a comment to T196366: Implement (or refactor) a script to move slaves when the master is not available.

Ah, I see!.
Yeah, I was thinking about a very primitive way to do it (for now), which would require human intervention to decide which is the most suitable host to be the new master and then the script to actually execute the batch of change master to master host.

Mon, Aug 19, 9:03 AM · DBA
jcrespo added a comment to T196366: Implement (or refactor) a script to move slaves when the master is not available.

Sadly switchover.py wouldn't be reusable or helpful (the replication and other libraries may be) for an emergency- it has to start from 0. Switchover.py assumes all hosts are reachable and have very low lag, replication is working, etc. which won't be the case on a failover. A failover is a much harder case where every possibility of breakage has to be contemplated separately and some safe compromises have to be taken (e.g. what to do if we detect X amount of data has been lost).

Mon, Aug 19, 8:48 AM · DBA

Fri, Aug 16

jcrespo added a comment to T230485: Create replica of napwikisource on labs.

Please note sanitarium has to be handled first, aka T210762. This should not be handled until that one is handled. @Reedy to prevent issues, normally that is handled on that ticket.

Fri, Aug 16, 10:28 AM · DBA, Data-Services
jcrespo added a comment to T230459: Replace db2044 with db2063.

Not sure what is the status of this, considering T228258 exists. db2063 mysql is down, but I ain't touching it just to prevent breaking something.

Fri, Aug 16, 10:24 AM · DBA

Jul 19 2019

jcrespo placed T143896: MySQL metrics monitoring up for grabs.
Jul 19 2019, 5:31 PM · observability, DBA, Patch-For-Review, Operations, Prometheus-metrics-monitoring
jcrespo placed T216240: Reboot, upgrade firmware and kernel of db1096-db1106, db2071-db2092 up for grabs.
Jul 19 2019, 5:30 PM · Operations, DBA
jcrespo placed T224656: A Query takes suddenly really much too long – something corrupt? up for grabs.

Going on vacations, will not work on this at the moment.

Jul 19 2019, 5:30 PM · Data-Services
jcrespo closed T219631: Create a recovery/provisioning script for database binary backups, a subtask of T206203: Implement database binary backups into the production infrastructure, as Resolved.
Jul 19 2019, 5:29 PM · Goal, DBA
jcrespo closed T219631: Create a recovery/provisioning script for database binary backups as Resolved.

Considered resolved with just T219631#5172069, and documented at https://wikitech.wikimedia.org/wiki/transfer.py A more complete automation will be done later (on a separate task) where automated provisioning is done, including setup replication, starting the server, loading grants, etc.

Jul 19 2019, 5:29 PM · DBA
jcrespo reassigned T225378: db2097 (codfw s1&s6 source backups) mariadb@s6 *process* (10.1.39) crashed on 2019-06-08 from jcrespo to Marostegui.
Jul 19 2019, 5:27 PM · Operations, DBA
jcrespo reassigned T196055: Remove table `math` from the database from jcrespo to Marostegui.
Jul 19 2019, 5:26 PM · Patch-For-Review, DBA, Math
jcrespo added a comment to T228465: TorBlock maintenance failures on labweb hosts.

For posterity, this issue generates the following error being mailed:

Jul 19 2019, 8:28 AM · MW-1.34-notes (1.34.0-wmf.14; 2019-07-16), Patch-For-Review, MediaWiki-extensions-TorBlock
jcrespo added a parent task for T228465: TorBlock maintenance failures on labweb hosts: T132324: Tracking and Reducing cron-spam to root@ .
Jul 19 2019, 8:27 AM · MW-1.34-notes (1.34.0-wmf.14; 2019-07-16), Patch-For-Review, MediaWiki-extensions-TorBlock
jcrespo added a subtask for T132324: Tracking and Reducing cron-spam to root@ : T228465: TorBlock maintenance failures on labweb hosts.
Jul 19 2019, 8:27 AM · Patch-For-Review, Operations

Jul 18 2019

jcrespo added a comment to T228436: web request timeout after 200 seconds due to Wikimedia\Rdbms\LBFactory->__destruct() > Wikimedia\Rdbms\LBFactory->commitMasterChanges().

See immediate effect: https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&var-dc=eqiad%20prometheus%2Fops&var-server=db1067&var-port=9104&from=1563449050608&to=1563468250610&panelId=37&fullscreen

Jul 18 2019, 3:44 PM · Performance-Team, Wikimedia-Rdbms, Wikimedia-production-error
jcrespo added a comment to T228436: web request timeout after 200 seconds due to Wikimedia\Rdbms\LBFactory->__destruct() > Wikimedia\Rdbms\LBFactory->commitMasterChanges().

https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&var-dc=eqiad%20prometheus%2Fops&var-server=db1067&var-port=9104&from=1563453267456&to=1563464067457&panelId=37&fullscreen

Jul 18 2019, 3:34 PM · Performance-Team, Wikimedia-Rdbms, Wikimedia-production-error
jcrespo added a comment to T228436: web request timeout after 200 seconds due to Wikimedia\Rdbms\LBFactory->__destruct() > Wikimedia\Rdbms\LBFactory->commitMasterChanges().

I can see an abnormal number of transactions on the enwiki master in Sleep state with 195+ seconds of connection time. This is not normal. The errors may be the watchdog killing connections to prevent a larger issue.

Jul 18 2019, 3:30 PM · Performance-Team, Wikimedia-Rdbms, Wikimedia-production-error
jcrespo added a comment to T228436: web request timeout after 200 seconds due to Wikimedia\Rdbms\LBFactory->__destruct() > Wikimedia\Rdbms\LBFactory->commitMasterChanges().

I was also reported high "lag" between edits and them showing into recentchanges on at least cswiki, but databases had virtually no lag at the moment. This could be related- transaction taking too long to commit.

Jul 18 2019, 3:26 PM · Performance-Team, Wikimedia-Rdbms, Wikimedia-production-error
jcrespo added a comment to T228360: Narrow scope of MediaWiki-Database workboard.

I personally don't have any opinion about Wikimedia-Rdbms I don't interact with it or need it or I am not subscribed to it. It is nice that we are pinged here because of the obvious connection with DBA (thanks!) but I will let any users or potential users decide its usefulness (redefine it, rename it, reorganize it, delete it, merge it, etc.). I believe it was useful in the past because there was no official owner to the RDBMS, but I think that is no longer the case.

Jul 18 2019, 2:43 PM · Project-Admins, Performance-Team, Core Platform Team

Jul 17 2019

jcrespo added a comment to T225131: (OoW) Degraded RAID on es2003.

SAS HD disks of 1.819 TB.

Jul 17 2019, 4:41 PM · Operations, ops-codfw
jcrespo added a comment to T227829: Degraded RAID on db2044.

There is no spare USED disks.

Jul 17 2019, 8:16 AM · Operations, ops-codfw
jcrespo added a comment to T227829: Degraded RAID on db2044.

@Marostegui Double checking, should we replace this or is it being decommed now?

Jul 17 2019, 7:54 AM · Operations, ops-codfw

Jul 16 2019

jcrespo added a comment to T198939: Decommission servermon.

Could also confirm all puppet grants (mysql database is understood, of course) on puppet database are no longer needed? You can find it on the misc production grants.

Jul 16 2019, 12:28 PM · Patch-For-Review, Operations
jcrespo awarded T169440: Pending global renames in need of sysadmin supervision (tracking) a Party Time token.
Jul 16 2019, 11:25 AM · GlobalRename, MediaWiki-extensions-CentralAuth, Tracking-Neverending
jcrespo added a comment to T198939: Decommission servermon.

Also the passwords have to be removed from the private repo (and possibly from labs/private).

Jul 16 2019, 8:40 AM · Patch-For-Review, Operations
jcrespo added a comment to T198939: Decommission servermon.

What about the puppet database on m1?

Jul 16 2019, 8:37 AM · Patch-For-Review, Operations
jcrespo added a comment to T225713: CPU scaling governor audit.

FYI, after applying the above change, I expected a huge shift on reported load (even if performance didn't change) or on temperatures, given this (wikireplicas on labs) are our busiest databases on cpu resources due to long-running queries, however, I didn't see much difference, unlike other reporters, except on the temperatures of labsdb1011, none on the load or the temperatures of the others. Maybe CPU was already a problem in scaling for database load or something else? https://grafana.wikimedia.org/d/000000607/cluster-overview?orgId=1&from=1563087571551&to=1563260371552&var-datasource=eqiad%20prometheus%2Fops&var-cluster=mysql&var-instance=labsdb1009&var-instance=labsdb1010&var-instance=labsdb1011

Jul 16 2019, 7:02 AM · User-fgiunchedi, Operations

Jul 15 2019

jcrespo added a comment to T228089: Logstash down for MediaWiki.

Once the backlog is processed, https://grafana.wikimedia.org/d/000000102/production-logging?refresh=5m&panelId=8&fullscreen&orgId=1 This can be lowered to high, but something should be put in place to prevent another logs outage, even if it is a rough way, such as an alert to identify it and a runbook to drop a source of logs like above.

Jul 15 2019, 8:12 PM · Wikimedia-Incident, observability, Operations, Wikimedia-Logstash
jcrespo added a comment to T228089: Logstash down for MediaWiki.

I never talked about this issue, and had not idea why @Urbanecm thoughout I was talking about this while I was having a private conversation with other person.

Jul 15 2019, 6:48 PM · Wikimedia-Incident, observability, Operations, Wikimedia-Logstash
jcrespo updated the task description for T225713: CPU scaling governor audit.
Jul 15 2019, 4:44 PM · User-fgiunchedi, Operations
jcrespo added a comment to T196055: Remove table `math` from the database.

So I have left a dump of the math table on public wikis at: https://people.wikimedia.org/~jynus/math.tar There is an example of the format of the tar at: https://people.wikimedia.org/~jynus/aawiki-math.sql I have also backed up the private and closed ones for short term. What would it be the next steps, comparing the hashes with existing images?

Jul 15 2019, 4:03 PM · Patch-For-Review, DBA, Math
jcrespo removed a project from T227838: Obsessive serverIsReadOnly() checking in MySQL: DBA.

I will remove DBA but remain subscribed, as all points this should be a decision of mediawiki optimization for 3rd party usage, not WMF.

Jul 15 2019, 7:57 AM · Patch-For-Review, Performance-Team, Wikimedia-Rdbms
jcrespo added a comment to T194125: [RFC] Future of charset and collation for mediawiki on mysql .

Because T135969 has spilled here, I can see the annoyance with someone using a "bad" encoding, I will try to be a bit conciliatory, however, I don't see any way to solve that, other than documenting (maybe it was done already):

Jul 15 2019, 7:49 AM · MediaWiki-Installer, MediaWiki-General, Core Platform Team (Needs Cleaning - Security, stability, performance, and scalability (TEC1)), Wikimedia-Rdbms

Jul 12 2019

jcrespo moved T196055: Remove table `math` from the database from Blocked external/Not db team to In progress on the DBA board.
Jul 12 2019, 2:15 PM · Patch-For-Review, DBA, Math
jcrespo added a comment to T196055: Remove table `math` from the database.

Thanks, that is great info, I will come back with a list of images and/or a dump at the beginning of next week.

Jul 12 2019, 2:15 PM · Patch-For-Review, DBA, Math
jcrespo added a comment to T196055: Remove table `math` from the database.

See, we had a bit of a communication loss here, but we may be able to move forward :-D Each person only has a limited amount of information. I will give a look and do some tests and (if you can) I may ask additional questions. Requesting for help is always ok, and we are here to help. :-) Thanks for yours! We are the people most interested on cleaning up the existing tables.

Jul 12 2019, 2:11 PM · Patch-For-Review, DBA, Math
jcrespo updated the task description for T196055: Remove table `math` from the database.
Jul 12 2019, 2:09 PM · Patch-For-Review, DBA, Math
jcrespo reopened T196055: Remove table `math` from the database, a subtask of T54921: Database tables to be dropped on Wikimedia wikis and other WMF databases (tracking), as Open.
Jul 12 2019, 2:07 PM · Epic, DBA, Tracking-Neverending
jcrespo reopened T196055: Remove table `math` from the database, a subtask of T195847: Clean up artifacts from LaTeX based math rendering, as Open.
Jul 12 2019, 2:07 PM · Operations, Math
jcrespo reopened T196055: Remove table `math` from the database as "Open".

Thanks, if you allow me- I will copy that and reopen so this is still tracked. And maybe I can check or delete on a low traffic db and see what are the results. If you have time, you can help us with your knowledge and my access to figure it out :-)

Jul 12 2019, 2:07 PM · Patch-For-Review, DBA, Math
jcrespo added a comment to T196055: Remove table `math` from the database.

Sorry, but I may not be understanding your comments. Is the extension on WMF ready to have them deleted, if yes, please say so ("please delete ASAP") on the summary. Deleting things carefuly should take less than a week. If you don't have time to do the preparation needed/there is a blocker, I am ok with closing it, although I would mention it also on the summary what is the pending work because may be someone else can do it in the future. :-D

Jul 12 2019, 1:58 PM · Patch-For-Review, DBA, Math
jcrespo added a comment to T226704: Setup es4 and es5 replica sets for new read-write external store service.

Thanks, that was probably it. Thanks for clarifying it!

Jul 12 2019, 1:54 PM · Epic, DBA
jcrespo added a comment to T196055: Remove table `math` from the database.

I don't know what I could do to make the task more ready. I will just close the task. Maybe it's best to keep the table forever.

Jul 12 2019, 1:53 PM · Patch-For-Review, DBA, Math
jcrespo added a comment to T196055: Remove table `math` from the database.

Why declining it? We are waiting for the blocking work to own this, but nobody seems to be progressing on it.

Jul 12 2019, 1:49 PM · Patch-For-Review, DBA, Math
jcrespo added a comment to T227862: (OoW) db2045 failed battery.
root@db1115.eqiad.wmnet[zarcillo]> update masters set instance='db2069' where section='x1' and dc='codfw';
Query OK, 1 row affected (0.00 sec)
Rows matched: 1  Changed: 1  Warnings: 0
Jul 12 2019, 11:16 AM · ops-codfw, DBA, Operations
jcrespo added a comment to T227862: (OoW) db2045 failed battery.

Everything went well except:

Updating tendril...
[WARNING] Old master not found on tendril server list
Updating zarcillo...
[WARNING] Old master not found on zarcillo master list

And here is probably the issue:

Jul 12 2019, 11:13 AM · ops-codfw, DBA, Operations
jcrespo moved T227862: (OoW) db2045 failed battery from Triage to In progress on the DBA board.
Jul 12 2019, 11:11 AM · ops-codfw, DBA, Operations
jcrespo added a comment to T227838: Obsessive serverIsReadOnly() checking in MySQL.

Based on your feedback, my guess is that because you are reading from the "master" (there is no other host, really), this effect happens. Probably this doesn't happen on WMF production where reads are from replicas and they are already read only, and lag is (I believe) cached. I am guessing the lack of caching infrastructure + single master topology is causing this. That doesn't mean it is not an issue, there is probably a way to optimize this, but I will let others comment if/how, as I am more familiar with WMF use case than mediawiki in general.

Jul 12 2019, 11:02 AM · Patch-For-Review, Performance-Team, Wikimedia-Rdbms
jcrespo added a comment to T152080: Frequent duplicate key errors by page assessments.

I saw a few occurrences (5) of this on trwiki: https://logstash.wikimedia.org/goto/d9c3d20188039cd82d1d9367270f842c

Jul 12 2019, 10:53 AM · Community-Tech, MW-1.29-release (WMF-deploy-2017-01-03_(1.29.0-wmf.7)), Patch-For-Review, Wikimedia-production-error, MediaWiki-extensions-PageAssessments
jcrespo triaged T227862: (OoW) db2045 failed battery as Normal priority.
Jul 12 2019, 10:27 AM · ops-codfw, DBA, Operations
jcrespo added a comment to T227862: (OoW) db2045 failed battery.

Everything went well except:

Updating tendril...
[WARNING] Old master not found on tendril server list
Updating zarcillo...
[WARNING] Old master not found on zarcillo master list
Jul 12 2019, 10:27 AM · ops-codfw, DBA, Operations
jcrespo committed rOSMDa2c2eaec25fe: switchover.py: Fix small formatting bug when printing ROW format (authored by jcrespo).
switchover.py: Fix small formatting bug when printing ROW format
Jul 12 2019, 10:26 AM
jcrespo added a comment to T227862: (OoW) db2045 failed battery.

Based on https://noc.wikimedia.org/conf/highlight.php?file=db-codfw.php&1 and T184888 I will switchover codfw master to db2069.

Jul 12 2019, 10:14 AM · ops-codfw, DBA, Operations
jcrespo added a comment to T227862: (OoW) db2045 failed battery.

I will try to force a relearn process/reboot, in case that works.

Jul 12 2019, 10:06 AM · ops-codfw, DBA, Operations
jcrespo created T227862: (OoW) db2045 failed battery.
Jul 12 2019, 10:06 AM · ops-codfw, DBA, Operations
jcrespo added a comment to T227717: Drop DB tables for now-deleted zerowiki from production.

I'd say it's safe to store a backup

Jul 12 2019, 7:21 AM · Release-Engineering-Team-TODO, DBA, Product-Infrastructure-Team-Backlog
jcrespo added a subtask for T227829: Degraded RAID on db2044: Unknown Object (Task).
Jul 12 2019, 7:11 AM · Operations, ops-codfw
jcrespo added a comment to T227829: Degraded RAID on db2044.

@Papaul We will ask you to replace a disk here from T226406, when they arrive.

Jul 12 2019, 7:11 AM · Operations, ops-codfw
jcrespo added a comment to T149077: Certain ApiQueryRecentChanges::run api query is too slow, slowing down dewiki.

@Anomie I support strongly to close old tasks like this that can no longer be reproduced, too much in the code and infrastructure changes to be relevant later in an open status. I belive both mw improvements, mysql status e.g. (ANALYIZE + reboot, SSDs) improvements and mariadb code improvement may have left many of these obsolete.

Jul 12 2019, 7:05 AM · Core Platform Team Workboards (Clinic Duty Team), Core Platform Team (Needs Cleaning - Security, stability, performance, and scalability (TEC1)), Wikimedia-production-error, MediaWiki-API, DBA
jcrespo added a comment to T222050: db1107 (eventlogging db master) possibly memory issues.

Chris, you will need to coordinate with @elukey principally, as he is the person in touch directly with users affected to agree on a date.

Jul 12 2019, 7:00 AM · Analytics, Operations, ops-eqiad, Analytics-EventLogging, DBA
jcrespo added a comment to T226704: Setup es4 and es5 replica sets for new read-write external store service.

If I recall correctly, the maintenance was along the lines of running MediaWiki's recompression scripts and other similar scripts, which would first require making sure the scripts still work right.

Jul 12 2019, 6:58 AM · Epic, DBA
jcrespo added a comment to T71222: list=logevents slow for users with last log action long time ago.

@Anomie it is probably the old decision of Mediawiki vs Wikimedia. I don't have visibility of what is the impact outside of WMF, but I would suggest to set up a lower priority, generic task to review optimizer hints and document them or remove the unnecessary ones, starting by this one.

Jul 12 2019, 6:54 AM · User-Marostegui, MW-1.33-notes (1.33.0-wmf.22; 2019-03-19), DBA, Performance, MediaWiki-API
jcrespo added a comment to T227739: Contention on User::getActorId ?.

It is ok to close it if it is a duplicate or you think is unlikely to happen again or is a very rare occurence. I just report when I see something out of the ordinary on the logs FYI, but lack the knowledge of a deep analysis.

Jul 12 2019, 6:49 AM · Wikimedia-database-error, Core Platform Team Workboards (Clinic Duty Team), Wikimedia-production-error, MediaWiki-User-management
jcrespo added a comment to T227838: Obsessive serverIsReadOnly() checking in MySQL.

May I ask where you tested this, and if it was on your own installation, more data about it (version, topology, configuration, etc.?), and in any case, how you did profile the queries executed (just setup debug for all queries?)? Also please point us to the code entry for that function (is it using a master or a replica to perform the reads?).

Jul 12 2019, 6:46 AM · Patch-For-Review, Performance-Team, Wikimedia-Rdbms
jcrespo updated the task description for T208323: Predictive failures on disk S.M.A.R.T. status.
Jul 12 2019, 6:37 AM · Operations, DBA
jcrespo added a comment to T227829: Degraded RAID on db2044.

See also recent T217755

Jul 12 2019, 6:36 AM · Operations, ops-codfw

Jul 11 2019

jcrespo edited projects for T88084: Using both rvuser and rvcontinue with prop=revisions causes database error on pages with a lot of revisions, added: Wikimedia-Rdbms; removed Performance.

I think your classification was already right, I was proposing to add on top a new yellow one, as normally work on those require a combination of performance, DBAs and either core or other team on the product side, depending on the code module. The reasoning for that is that it would help avoid duplicate reports, in the same spirit as Wikimedia-production-error. It was just a suggestion, I understand if you consider it may not be useful.

Jul 11 2019, 4:27 PM · Wikimedia-database-error, MediaWiki-API
jcrespo added a comment to T88084: Using both rvuser and rvcontinue with prop=revisions causes database error on pages with a lot of revisions.

@Krinkle It took me some time to understand your comment and classification. Wouldn't be nice to have a specific tag for #query-performance or #wikimedia-query-performance or #slow-database-queries for discovery reasons (e.g. finding duplicates and previous examples)? What do you think?

Jul 11 2019, 3:38 PM · Wikimedia-database-error, MediaWiki-API
jcrespo triaged T227717: Drop DB tables for now-deleted zerowiki from production as Low priority.

Needs some research to see if it is safe. Low priority because it shouldn't block any other task.

Jul 11 2019, 3:34 PM · Release-Engineering-Team-TODO, DBA, Product-Infrastructure-Team-Backlog
jcrespo added a comment to T222224: Normalize MediaWiki link tables.

Manuel is on vacations ATM, I am glad to answer any questions, although DBAs need more concrete questions (e.g. we can answer how much space and iops would be saved for a particular wiki or table) as costs such as development time would be better calculated by the people involved on the Wikimedia-Rdbms code bits.

Jul 11 2019, 2:17 PM · Core Platform Team, MediaWiki-Page-derived-data, Schema-change, Patch-For-Review, TechCom-RFC
jcrespo updated the task description for T143896: MySQL metrics monitoring.
Jul 11 2019, 11:33 AM · observability, DBA, Patch-For-Review, Operations, Prometheus-metrics-monitoring
jcrespo added a comment to T143896: MySQL metrics monitoring.
root@prometheus2003:/srv/prometheus/ops/targets$ ls -la mysql-*
-r--r--r-- 1 root       root  2592 Jul 11 11:27 mysql-core_codfw.yaml
-r--r--r-- 1 root       root   612 Jul 11 11:27 mysql-dbstore_codfw.yaml
-r--r--r-- 1 root       root   544 Jul 10 10:57 mysql-labs_codfw.yaml
-rw-r--r-- 1 root       root   544 Jul 10 10:48 mysql-labsdb_codfw.yaml
-r--r--r-- 1 root       root   621 Jul 11 11:27 mysql-misc_codfw.yaml
-r--r--r-- 1 root       root   275 Jul 11 11:27 mysql-parsercache_codfw.yaml
root@prometheus2003:/srv/prometheus/ops/targets$ date
Thu Jul 11 11:29:19 UTC 2019
root@prometheus2003:/srv/prometheus/ops/targets$ run-puppet-agent 
Warning: Downgrading to PSON for future requests
Info: Using configured environment 'production'
Info: Retrieving pluginfacts
Info: Retrieving plugin
Info: Loading facts
Info: Caching catalog for prometheus2003.codfw.wmnet
Info: Applying configuration version '1562844569'
Notice: /Stage[main]/Profile::Prometheus::Ops_mysql/Exec[generate-mysqld-exporter-config]/returns: executed successfully
Notice: Applied catalog in 18.99 seconds
root@prometheus2003:/srv/prometheus/ops/targets$ ls -la mysql-*
-r--r--r-- 1 root root 2592 Jul 11 11:27 mysql-core_codfw.yaml
-r--r--r-- 1 root root  612 Jul 11 11:27 mysql-dbstore_codfw.yaml
-r--r--r-- 1 root root  544 Jul 10 10:57 mysql-labs_codfw.yaml
-rw-r--r-- 1 root root  544 Jul 10 10:48 mysql-labsdb_codfw.yaml
-r--r--r-- 1 root root  621 Jul 11 11:27 mysql-misc_codfw.yaml
-r--r--r-- 1 root root  275 Jul 11 11:27 mysql-parsercache_codfw.yaml
Jul 11 2019, 11:32 AM · observability, DBA, Patch-For-Review, Operations, Prometheus-metrics-monitoring
jcrespo added a reverting change for rLPRIc73947e0daaf: Revert "prometheus: move prometheus secrets back to the original role": rLPRI224100e43026: Revert "Revert "prometheus: move prometheus secrets back to the original role"".
Jul 11 2019, 11:17 AM
jcrespo committed rLPRI224100e43026: Revert "Revert "prometheus: move prometheus secrets back to the original role"" (authored by jcrespo).
Revert "Revert "prometheus: move prometheus secrets back to the original role""
Jul 11 2019, 11:17 AM
jcrespo added a reverting change for rLPRI0cc83bae3ad3: prometheus: move prometheus secrets back to the original role: rLPRIc73947e0daaf: Revert "prometheus: move prometheus secrets back to the original role".
Jul 11 2019, 11:16 AM
jcrespo committed rLPRIc73947e0daaf: Revert "prometheus: move prometheus secrets back to the original role" (authored by jcrespo).
Revert "prometheus: move prometheus secrets back to the original role"
Jul 11 2019, 11:16 AM
jcrespo committed rLPRI0cc83bae3ad3: prometheus: move prometheus secrets back to the original role (authored by jcrespo).
prometheus: move prometheus secrets back to the original role
Jul 11 2019, 9:25 AM
jcrespo created T227739: Contention on User::getActorId ?.
Jul 11 2019, 7:50 AM · Wikimedia-database-error, Core Platform Team Workboards (Clinic Duty Team), Wikimedia-production-error, MediaWiki-User-management
jcrespo committed rLPRI6aa78168423c: prometheus-mysqld-exporter: move variable to profile (authored by jcrespo).
prometheus-mysqld-exporter: move variable to profile
Jul 11 2019, 7:35 AM

Jul 10 2019

jcrespo committed rLPRI3e95207086fa: prometheus: Add fake prometheus labs password (authored by jcrespo).
prometheus: Add fake prometheus labs password
Jul 10 2019, 9:44 AM
jcrespo added a comment to T68025: [Story] Monitor size of some Wikidata database tables.

We already have sizes of all uncompressed and compressed tables on zarcillo, those are planned to be shown in a dashboard. The reasons why those are not more public is that we were told not to put those on public prometheus by security as they could compromise the anonymity of certain users on smaller wikis. Please talk to security before doing it. Please talk to us DBAs befere reimplementing an existing feature.

Jul 10 2019, 8:55 AM · Patch-For-Review, WMDE-Analytics-Engineering, DBA, Story, Wikidata, Wikidata.org

Jul 9 2019

jcrespo added a comment to T226952: Failover m2 master db1065 to db1132.
$ ./replication_tree.py db1065
db1065, version: 10.1.33, up: 1y, RO: OFF, binlog: MIXED, lag: None, processes: None, latency: 0.0991
+ db1117:3322, version: 10.1.39, up: 32d, RO: ON, binlog: MIXED, lag: 0, processes: 15, latency: 0.0423
+ db1132, version: 10.1.39, up: 14h, RO: ON, binlog: MIXED, lag: 0, processes: 16, latency: 0.0416
+ db2044, version: 10.1.39, up: 4d, RO: ON, binlog: MIXED, lag: 0, processes: None, latency: 0.0046
  + db2078:3322, version: 10.1.39, up: 47d, RO: ON, binlog: MIXED, lag: 0, processes: 14, latency: 0.0056
Jul 9 2019, 5:21 AM · SRE-tools, OTRS, Recommendation-API, Operations, DBA

Jul 8 2019

jcrespo removed a project from T225378: db2097 (codfw s1&s6 source backups) mariadb@s6 *process* (10.1.39) crashed on 2019-06-08: ops-codfw.
Jul 8 2019, 3:40 PM · Operations, DBA
jcrespo reopened T225378: db2097 (codfw s1&s6 source backups) mariadb@s6 *process* (10.1.39) crashed on 2019-06-08, a subtask of T206203: Implement database binary backups into the production infrastructure, as Open.
Jul 8 2019, 3:39 PM · Goal, DBA
jcrespo reopened T225378: db2097 (codfw s1&s6 source backups) mariadb@s6 *process* (10.1.39) crashed on 2019-06-08 as "Open".
Mem:         515690

HW seems to be fixed, owning for the followup (software) steps.

Jul 8 2019, 3:39 PM · Operations, DBA
jcrespo created P8723 switchover process.
Jul 8 2019, 3:09 PM

Jul 4 2019

jcrespo created P8709 (An Untitled Masterwork).
Jul 4 2019, 1:58 PM
jcrespo committed rOSMD23173d6419e6: Add 2 simple scripts: move_replica.py and stop_in_sync.py (authored by jcrespo).
Add 2 simple scripts: move_replica.py and stop_in_sync.py
Jul 4 2019, 12:14 PM

Jul 3 2019

Restricted Application added a project to T227197: [BUG] Reading Lists Not Syncing — 11 lists synced: Product-Infrastructure-Team-Backlog.

Just to discard any kind of database anomaly, I ran:

Jul 3 2019, 3:36 PM · Product-Infrastructure-Team-Backlog (Kanban), Reading List Service, iOS-app-Bugs, Wikipedia-iOS-App-Backlog
jcrespo committed rOSMD9df778c4ca37: Add 2 simple scripts: move_replica.py and stop_in_sync.py (authored by jcrespo).
Add 2 simple scripts: move_replica.py and stop_in_sync.py
Jul 3 2019, 3:16 PM
jcrespo committed rOSMD0e577bd04f3a: Add 2 simple scripts: move_replica.py and stop_in_sync.py (authored by jcrespo).
Add 2 simple scripts: move_replica.py and stop_in_sync.py
Jul 3 2019, 3:14 PM
jcrespo committed rOSMD8bf12d499727: Add 2 simple scripts: move_replica.py and stop_in_sync.py (authored by jcrespo).
Add 2 simple scripts: move_replica.py and stop_in_sync.py
Jul 3 2019, 2:10 PM
jcrespo committed rOSMDdd810002a640: Add 2 simple scripts: move_replica.py and stop_in_sync.py (authored by jcrespo).
Add 2 simple scripts: move_replica.py and stop_in_sync.py
Jul 3 2019, 2:01 PM
jcrespo committed rOSMD1657b539100d: Add 2 simple scripts: move_replica.py and stop_in_sync.py (authored by jcrespo).
Add 2 simple scripts: move_replica.py and stop_in_sync.py
Jul 3 2019, 2:01 PM
jcrespo reassigned T225378: db2097 (codfw s1&s6 source backups) mariadb@s6 *process* (10.1.39) crashed on 2019-06-08 from jcrespo to Papaul.

a memory stick of db2097 is literally broken:

Jul 3 2019, 1:44 PM · Operations, DBA