jcrespo (Jaime Crespo)
Sr Database Administrator

Today

  • Clear sailing ahead.

Tomorrow

  • Clear sailing ahead.

Sunday

  • Clear sailing ahead.

User Details

User Since
May 11 2015, 8:31 AM (166 w, 3 d)
Availability
Available
IRC Nick
jynus
LDAP User
Jcrespo
MediaWiki User
JCrespo (WMF) [ Global Accounts ]

Recent Activity

Yesterday

jcrespo updated the task description for T200039: db1067 /srv usage is at 82%.
Thu, Jul 19, 6:57 PM · DBA
jcrespo updated the task description for T200039: db1067 /srv usage is at 82%.
Thu, Jul 19, 6:57 PM · DBA
jcrespo created T200039: db1067 /srv usage is at 82%.
Thu, Jul 19, 6:51 PM · DBA
jcrespo committed rOSMDcf5c16d2271a: transfer.py: Make checksum optional (authored by jcrespo).
transfer.py: Make checksum optional
Thu, Jul 19, 6:12 PM
jcrespo committed rOSMD65c0c20d60ef: [WIP] Add replication managing (authored by jcrespo).
[WIP] Add replication managing
Thu, Jul 19, 6:12 PM
jcrespo added a comment to T200035: DB backup restore skip empty databases.

While the software is generically known as mydumper, mydumper actually dumps the databases, it is the command myloaded that skips it, as it iterates only over existing lower level objects.

Thu, Jul 19, 6:02 PM · Upstream, DBA
jcrespo updated the task description for T200035: DB backup restore skip empty databases.
Thu, Jul 19, 6:01 PM · Upstream, DBA
jcrespo added a project to T200035: DB backup restore skip empty databases: Upstream.
Thu, Jul 19, 6:00 PM · Upstream, DBA
jcrespo added a comment to T187980: Memcached error "A TIMEOUT OCCURRED" for keys.

some big problem on the 11th

Thu, Jul 19, 3:19 PM · Core-Platform-Team, Performance-Team (Radar), Wikimedia-log-errors, MediaWiki-Cache
jcrespo added a comment to T45647: Sometimes (at peak usage?), dumps.wikimedia.org becomes very slow for users (sometimes unresponsive).

Is someone still suffering from this issue anymore? If not, it should be closed.

Thu, Jul 19, 3:10 PM · Operations, Datasets-General-or-Unknown
jcrespo added a comment to T195253: Special:Notifications gives a consistent PHP exception on load ("The trash icon is not registered") for users with OpenStackManager notifications.

Is it not fixed for you?

Thu, Jul 19, 7:43 AM · MW-1.32-release-notes (WMF-deploy-2018-06-26 (1.32.0-wmf.10)), Growth-Team, Collaboration-Team-Triage (Collab-Team-This-Quarter), MediaWiki-extensions-OpenStackManager, wikitech.wikimedia.org, Wikimedia-log-errors, Notifications
jcrespo added a comment to T198176: Mediawiki page deletions should happen in batches of revisions.

@tstarling Thank you for your evaluation, it is very useful! My one comment is that, even if the jobqueue is to take care of it, the user would expect some immediate feedback (not sure how true that is?- maybe a UI message would be enough) to see a logical delete, and that may still need some storage design implications, even if not T20493 fully. I guess it could be done without it with a "Page X is scheduled for delation, please be patient"- and the user will be able to see revisions disappearing. Also, this may be so unusual it may not be necessary.

Thu, Jul 19, 7:35 AM · MediaWiki-Page-deletion
jcrespo added a comment to T195253: Special:Notifications gives a consistent PHP exception on load ("The trash icon is not registered") for users with OpenStackManager notifications.

I am guessing this has been released, but wikitech needs to be upgraded still?

Thu, Jul 19, 7:24 AM · MW-1.32-release-notes (WMF-deploy-2018-06-26 (1.32.0-wmf.10)), Growth-Team, Collaboration-Team-Triage (Collab-Team-This-Quarter), MediaWiki-extensions-OpenStackManager, wikitech.wikimedia.org, Wikimedia-log-errors, Notifications

Wed, Jul 18

jcrespo removed a project from T199812: Update wikimediafoundation.org to foundation.wikimedia.org across numerous repos: monitoring.

Actually, monitoring is correct (not sure about mobile), it is the text on the websites that needs update.

Wed, Jul 18, 1:01 PM · MW-1.32-release-notes (WMF-deploy-2018-07-24 (1.32.0-wmf.14)), Patch-For-Review, MediaWiki-extensions-General, Wikimedia-General-or-Unknown
jcrespo added a project to T199812: Update wikimediafoundation.org to foundation.wikimedia.org across numerous repos: monitoring.

Also on monitoring:

[14:42] <icinga-wm> PROBLEM - Ensure legal html en.wb on en.wikibooks.org is CRITICAL: additional\sterms\smay\sapply\. By\susing\sthis\ssite,\syou\sagree\sto\sthe a\shref=(https:)?\/\/foundation\.wikimedia\.org\/wiki\/Terms_of_UseTerms\sof\sUse/a html not found
[14:43] <icinga-wm> PROBLEM - Ensure legal html en.wp on en.wikipedia.org is CRITICAL: additional\sterms\smay\sapply\. By\susing\sthis\ssite,\syou\sagree\sto\sthe a\shref=(https:)?\/\/foundation\.wikimedia\.org\/wiki\/Terms_of_UseTerms\sof\sUse/a html not found
Wed, Jul 18, 12:51 PM · MW-1.32-release-notes (WMF-deploy-2018-07-24 (1.32.0-wmf.14)), Patch-For-Review, MediaWiki-extensions-General, Wikimedia-General-or-Unknown
jcrespo added a comment to T197073: switchover es1014 to es1017.

https://wikitech.wikimedia.org/w/index.php?title=Deployments&type=revision&diff=1797449&oldid=1797439

Wed, Jul 18, 9:05 AM · Patch-For-Review, DBA
jcrespo added a comment to T183585: Rack/cable/configure asw2-b-eqiad switch stack.

As an addemdum to T183585#4427995, because of T180918, we need to depool, ahead of the maintenance, the other replica dbs, too, but that should be trivial to do.

Wed, Jul 18, 7:18 AM · cloud-services-team, Cloud-VPS, ops-eqiad, Operations
jcrespo added a comment to T197073: switchover es1014 to es1017.

What do you think of doing this next wednesday?

Wed, Jul 18, 7:16 AM · Patch-For-Review, DBA
jcrespo awarded T197069: Failover db1052 (s1) db primary master a 100 token.
Wed, Jul 18, 6:43 AM · Patch-For-Review, DBA
jcrespo added a comment to T199861: Decommission db1052.

I thought a bit about how to go over this, and given the importance and history of this host, this would be one proposal, see what you think about it:

Wed, Jul 18, 6:39 AM · Patch-For-Review, DBA

Tue, Jul 17

jcrespo added a comment to T198987: Gather statistics about the backups on a database.

What's the file_date vs backup_date? backup_date is when the backup started and file_date when the file was last modified on the filesystem?

Tue, Jul 17, 1:11 PM · Patch-For-Review, DBA
jcrespo added a comment to T198987: Gather statistics about the backups on a database.

What files do we have related to the potential recovery of enwiki.categorylinks?

root@db1115.eqiad.wmnet[zarcillo]> select backups.id, backups.source, backup_files.file_name, backup_files.size, backup_files.file_date, backups.creation_date as backup_date FROM backup_objects JOIN backup_files ON backup_objects.id = backup_files.backup_object_id JOIN backups ON backup_objects.backup_id = backups.id WHERE backup_objects.db = 'enwiki' and backup_objects.name='categorylinks';
+----+------------------+------------------------------------+-----------+---------------------+---------------------+
| id | source           | file_name                          | size      | file_date           | backup_date         |
+----+------------------+------------------------------------+-----------+---------------------+---------------------+
|  1 | dbstore1001:3311 | enwiki.categorylinks-schema.sql.gz |       377 | 2018-07-03 19:49:03 | 2018-07-03 17:57:43 |
|  1 | dbstore1001:3311 | enwiki.categorylinks.00000.sql.gz  | 557821984 | 2018-07-03 18:22:12 | 2018-07-03 17:57:43 |
|  1 | dbstore1001:3311 | enwiki.categorylinks.00001.sql.gz  | 409172922 | 2018-07-03 18:15:06 | 2018-07-03 17:57:43 |
|  1 | dbstore1001:3311 | enwiki.categorylinks.00002.sql.gz  | 383738072 | 2018-07-03 18:14:13 | 2018-07-03 17:57:43 |
|  1 | dbstore1001:3311 | enwiki.categorylinks.00003.sql.gz  | 375765813 | 2018-07-03 18:14:51 | 2018-07-03 17:57:43 |
|  1 | dbstore1001:3311 | enwiki.categorylinks.00004.sql.gz  | 352323341 | 2018-07-03 18:15:00 | 2018-07-03 17:57:43 |
|  1 | dbstore1001:3311 | enwiki.categorylinks.00005.sql.gz  | 336266280 | 2018-07-03 18:17:21 | 2018-07-03 17:57:43 |
|  2 | dbstore2002:3311 | enwiki.categorylinks-schema.sql.gz |       389 | 2018-07-04 03:52:37 | 2018-07-04 01:24:18 |
|  2 | dbstore2002:3311 | enwiki.categorylinks.00000.sql.gz  | 647009518 | 2018-07-04 02:03:46 | 2018-07-04 01:24:18 |
|  2 | dbstore2002:3311 | enwiki.categorylinks.00001.sql.gz  | 474168689 | 2018-07-04 01:53:37 | 2018-07-04 01:24:18 |
|  2 | dbstore2002:3311 | enwiki.categorylinks.00002.sql.gz  | 455466328 | 2018-07-04 01:52:35 | 2018-07-04 01:24:18 |
|  2 | dbstore2002:3311 | enwiki.categorylinks.00003.sql.gz  | 429877847 | 2018-07-04 01:52:20 | 2018-07-04 01:24:18 |
|  2 | dbstore2002:3311 | enwiki.categorylinks.00004.sql.gz  | 408667555 | 2018-07-04 01:51:54 | 2018-07-04 01:24:18 |
+----+------------------+------------------------------------+-----------+---------------------+---------------------+
13 rows in set (0.01 sec)
Tue, Jul 17, 11:31 AM · Patch-For-Review, DBA
jcrespo added a comment to T191199: Page allocation stalls on scb1001, scb1002.

I believe this, or something similar related to memory-related stalls happened on scb2006.

Tue, Jul 17, 8:51 AM · SCB, Services (watching), Operations
jcrespo added a comment to T194403: Wikimedia\Rdbms\ChronologyProtector::initPositions: expected but failed to find position index..

Since Jul 11 this is happening very rarely (from 1100 per day to ~8 per day).

Tue, Jul 17, 7:55 AM · MW-1.32-release-notes (WMF-deploy-2018-07-17 (1.32.0-wmf.13)), Release-Engineering-Team (Watching / External), Performance-Team, MediaWiki-Database, Wikimedia-log-errors

Mon, Jul 16

jcrespo added a comment to T197134: Announce 30 minutes read-only time for enwiki 18th July 06:00AM UTC.

Can you confirm this?

Mon, Jul 16, 5:36 PM · CommRel-Specialists-Support (Jul-Sep-2018), User-Johan

Sat, Jul 14

jcrespo added a comment to T199614: dbstore1002 MySQL crashed and got restarted.

I had already commented at https://phabricator.wikimedia.org/T198174#4425077

Sat, Jul 14, 12:02 PM · Analytics

Fri, Jul 13

jcrespo closed T199518: Undocumented grants on striker from californium as Resolved.

Done:

root@db1073.eqiad.wmnet[(none)]> select user, host from mysql.user WHERE host='208.80.154.147';
Empty set (0.00 sec)
Fri, Jul 13, 6:34 PM · Striker
jcrespo added a comment to T192092: setup replacements for maintenance_server (terbium, wasat) on Stretch.

No more grants on m5 referencing 10.64.32.13 (terbium):

$ ./software/dbtools/section m5 | while read host port; do mysql.py -BN -h$host:$port -e "select user, host from mysql.user WHERE host='10.64.32.13';"; done
Fri, Jul 13, 9:25 AM · Patch-For-Review, Operations
jcrespo added a comment to T192092: setup replacements for maintenance_server (terbium, wasat) on Stretch.

I have created T199518.

Fri, Jul 13, 9:21 AM · Patch-For-Review, Operations
jcrespo created T199518: Undocumented grants on striker from californium.
Fri, Jul 13, 9:21 AM · Striker
jcrespo updated subscribers of T192092: setup replacements for maintenance_server (terbium, wasat) on Stretch.

There is an undocumented grant from californium.wikimedia.org to striker @bd808 - I will delete it if it is not puppetized it. I will create a separate ticket if this is offtopic here.

Fri, Jul 13, 9:12 AM · Patch-For-Review, Operations
jcrespo added a subtask for T177782: Reduce false positives on database pages: T197126: Create tool to handle the state of database configuration in MediaWiki in etcd.
Fri, Jul 13, 8:43 AM · Patch-For-Review, Epic, Wikimedia-Incident, monitoring, DBA
jcrespo added a parent task for T197126: Create tool to handle the state of database configuration in MediaWiki in etcd: T177782: Reduce false positives on database pages.
Fri, Jul 13, 8:43 AM · Patch-For-Review, User-Joe, MediaWiki-Configuration, Operations, DBA
jcrespo removed a parent task for T177782: Reduce false positives on database pages: T197126: Create tool to handle the state of database configuration in MediaWiki in etcd.
Fri, Jul 13, 8:43 AM · Patch-For-Review, Epic, Wikimedia-Incident, monitoring, DBA
jcrespo removed a subtask for T197126: Create tool to handle the state of database configuration in MediaWiki in etcd: T177782: Reduce false positives on database pages.
Fri, Jul 13, 8:43 AM · Patch-For-Review, User-Joe, MediaWiki-Configuration, Operations, DBA
jcrespo added a parent task for T177782: Reduce false positives on database pages: T197126: Create tool to handle the state of database configuration in MediaWiki in etcd.
Fri, Jul 13, 8:42 AM · Patch-For-Review, Epic, Wikimedia-Incident, monitoring, DBA
jcrespo added a subtask for T197126: Create tool to handle the state of database configuration in MediaWiki in etcd: T177782: Reduce false positives on database pages.
Fri, Jul 13, 8:42 AM · Patch-For-Review, User-Joe, MediaWiki-Configuration, Operations, DBA
jcrespo changed the status of T193226: Test MySQL 8.0 with production data and evaluate its fit for WMF databases from Open to Stalled.

This is stalled until we implement a way to mix different GTID implementation on the same section: e.g. T172497#4309959

Fri, Jul 13, 8:40 AM · Patch-For-Review, DBA
jcrespo changed the status of T193226: Test MySQL 8.0 with production data and evaluate its fit for WMF databases, a subtask of T193224: Evaluate and decide the future of relational datastore at WMF after the upgrade of MariaDB 10.1 is finished, from Open to Stalled.
Fri, Jul 13, 8:40 AM · MediaWiki-Database, Operations, DBA
jcrespo moved T59176: ApiQueryExtLinksUsage::run query has crazy limit from Triage to Blocked external/Not db team on the DBA board.
Fri, Jul 13, 8:35 AM · MW-1.32-release-notes (WMF-deploy-2018-05-22 (1.32.0-wmf.5)), MW-1.29-release-notes, Patch-For-Review, Schema-change, DBA, MediaWiki-API, Performance, MediaWiki-Database
jcrespo moved T198755: Log the query that caused a lock timeout from Triage to Meta/Epic on the DBA board.
Fri, Jul 13, 8:16 AM · MediaWiki-Debug-Logger, MediaWiki-Database, DBA
jcrespo added a comment to T198755: Log the query that caused a lock timeout.

I don't think "interactive logging" would be easy to implement- the query knows if it cannot continue or if it reaches a timeout, but it is not notified of who handles the row locks. There could be, however, queries of heuristics causes with processlist or some of the suggestions mentioned.

Fri, Jul 13, 8:02 AM · MediaWiki-Debug-Logger, MediaWiki-Database, DBA
jcrespo moved T196547: Extension:JADE scalability concerns due to creating a page per revision from Triage to Backlog on the DBA board.
Fri, Jul 13, 7:59 AM · TechCom-RFC, DBA, Scoring-platform-team (Current), User-Joe, Operations, JADE
jcrespo added projects to T199504: Editing of content model other than wikitext fails: MediaWiki-General-or-Unknown, Multi-Content-Revisions (MCR Deployment).

Preventively adding MCR (please remove if that doesn't apply) in case there is some ongoing experiment on beta- apologies if my guess is wrong.

Fri, Jul 13, 7:58 AM · MW-1.32-release-notes (WMF-deploy-2018-07-17 (1.32.0-wmf.13)), MediaWiki-General-or-Unknown
jcrespo added a comment to T198156: Server-side deletion of User:LorenzoMilano/sandbox.

enwiki read only is scheduled for 18th at 6 am - T197134 If someone has an already tested script that will take less than 5 minutes to run and can be there to run it and check its execution at 6, we can do it there and then, if not it will have to wait until next time we go to read only (dbas cannot do the planned maintenance and this at the same time, but maybe someone can).

Fri, Jul 13, 7:09 AM · MediaWiki-Database, Wikimedia-Site-requests

Thu, Jul 12

jcrespo added a comment to T20493: Unify various deletion systems.

The archive table is the only one that repeats the same title over and over.

Thu, Jul 12, 8:14 AM · TechCom-RFC, Stewards-and-global-tools, MediaWiki-Page-deletion
jcrespo added a comment to T20493: Unify various deletion systems.

Have a page_deleted field that can be set to archived. Then the question is what should happen when a page with the same name is created

Thu, Jul 12, 7:22 AM · TechCom-RFC, Stewards-and-global-tools, MediaWiki-Page-deletion

Wed, Jul 11

jcrespo added a project to T199353: kafka eqiad cluster keeps crashing: Wikimedia-Incident.
Wed, Jul 11, 6:06 PM · Services (done), Wikimedia-Incident, WMF-JobQueue, Operations, Analytics-EventLogging, Analytics, EventBus
jcrespo updated subscribers of T199353: kafka eqiad cluster keeps crashing.
Wed, Jul 11, 6:05 PM · Services (done), Wikimedia-Incident, WMF-JobQueue, Operations, Analytics-EventLogging, Analytics, EventBus
jcrespo created T199353: kafka eqiad cluster keeps crashing.
Wed, Jul 11, 6:03 PM · Services (done), Wikimedia-Incident, WMF-JobQueue, Operations, Analytics-EventLogging, Analytics, EventBus
jcrespo added a comment to T199325: +2 for Addshore on operations/puppet.

To elaborate on that ("access requests lacks a clear rationale") access requests (or many other tickets) are meant to solve a problem, not just provide a solution- maybe we can help with the original problem you are trying to solve?

Wed, Jul 11, 5:26 PM · SRE-Access-Requests, Operations
jcrespo added a comment to T174047: Hide deprecated/unused fields on toolforge replica [MCR].

Thank you, I will see how we can best communicate this.

Wed, Jul 11, 1:21 PM · Cloud-Services, Multi-Content-Revisions (MCR Deployment), Wikidata
jcrespo added a comment to T199325: +2 for Addshore on operations/puppet.

+2 on operations/puppet makes not much sense if one cannot deploy by itself (in fact, it would be a bad thing, as it would block other deployments). Not opposed to it, but the request of +2 should come with global root rights, as puppet == root access, so on it own this would not make much sense. Feel free to disagree.

Wed, Jul 11, 1:03 PM · SRE-Access-Requests, Operations
jcrespo added a comment to T174047: Hide deprecated/unused fields on toolforge replica [MCR].

I can take care of sending an email or coordinating with cloud team, but I don't really know all the changes and/or read map. Could you point me to a summary of that and I can take care of the rest? E.g. "The following fields will no longer be updated. The following fileds /tables will appear. etc." Anything so we can update a FAQ on wikitech.org (better if it is just a link to mediawiki.org you already have).

Wed, Jul 11, 12:38 PM · Cloud-Services, Multi-Content-Revisions (MCR Deployment), Wikidata
jcrespo added a comment to T174047: Hide deprecated/unused fields on toolforge replica [MCR].

More important than this, could you help us document and draft a communication with a summary of MCR changes for cloud users, and I will coordinate with cloud team?

Wed, Jul 11, 12:27 PM · Cloud-Services, Multi-Content-Revisions (MCR Deployment), Wikidata
jcrespo awarded T198974: Rate-limit is too harsh and affects human users a The World Burns token.
Wed, Jul 11, 12:09 PM · Patch-For-Review, Phabricator
jcrespo added a comment to T195515: GUC query performance regressed 100x from <3s to 80-300s.

it was only a suggestion pending it being feasible and desired

Wed, Jul 11, 11:49 AM · Stewards-and-global-tools, Data-Services, cloud-services-team, Tool-Global-user-contributions
jcrespo edited projects for T199316: "sql wikishared" doesn't work on mwmaint1001, added: MediaWiki-Platform-Team; removed Operations, DBA.
Wed, Jul 11, 11:46 AM · Patch-For-Review, Core-Platform-Team, Scap
jcrespo added a comment to T196547: Extension:JADE scalability concerns due to creating a page per revision.

I think the proposed plan ha deep architecture problems at storage layer, so we should discuss in depth possibilities to be able to move forward- I don't have any problem with the functionality itself- it is the proposed way of implementing it that we should try to agree on. I propose to organize a video meeting to discuss better.

Wed, Jul 11, 10:26 AM · TechCom-RFC, DBA, Scoring-platform-team (Current), User-Joe, Operations, JADE
zeljkofilipin awarded T195253: Special:Notifications gives a consistent PHP exception on load ("The trash icon is not registered") for users with OpenStackManager notifications a Pterodactyl token.
Wed, Jul 11, 10:11 AM · MW-1.32-release-notes (WMF-deploy-2018-06-26 (1.32.0-wmf.10)), Growth-Team, Collaboration-Team-Triage (Collab-Team-This-Quarter), MediaWiki-extensions-OpenStackManager, wikitech.wikimedia.org, Wikimedia-log-errors, Notifications
jcrespo added a comment to T198974: Rate-limit is too harsh and affects human users.

I belive I think what is the source of the issues, when you write a comment, like this, it generates a live preview. This creates dozens of requests to phabricator, and I have been banned mid-writing of one- I think the rate limiting should be just based on actual number of actual writes to tickets, and not other kind of requests. I understand this may not be possible, but maybe we could not apply such a filter to a long list of trusted contributors.

Wed, Jul 11, 10:11 AM · Patch-For-Review, Phabricator
jcrespo added a comment to T20493: Unify various deletion systems.

There is one small nitpick, it is said that "Database operation for smaller page that move rows between tables is something DBAs would prefer never happens, and should be migrated away from." Actually, from a pure DBA point of view, moving rows deleted to a separate table is good because it is basically a bad way of implementing partitioning and requires less optimization to avoid virtually deleted rows. It is when I put on the Database engineer hat that I hate that- it is prone to cause data loss, inconsistencies and more traffic and writes than needed. It doesn't change the overall sentiment, but at least highlights one of the few things good with moving rows around (instead of virtually delete them with SET deleted = 1/INSERT latest version with deleted status, which is the standard model of doing it in most scenarios).

Wed, Jul 11, 10:07 AM · TechCom-RFC, Stewards-and-global-tools, MediaWiki-Page-deletion
jcrespo renamed T198176: Mediawiki page deletions should happen in batches of revisions from Deletions should happen in batches of revisions to Mediawiki page deletions should happen in batches of revisions.
Wed, Jul 11, 9:50 AM · MediaWiki-Page-deletion
jcrespo added a comment to T197246: Move OpenStreetMaps postgresql cluster servers for datacenter reconfiguration.

Do we need to do a dns update now? I was trying to fix labsdb1006 puppet run, but I found @akosiaris working on it at the moment and didn't want to modify anything without your permission. Is the OSM import running?

Wed, Jul 11, 9:45 AM · Patch-For-Review, cloud-services-team (Kanban)
jcrespo added a comment to T196336: Icinga passive checks go awal and downtime stops working.

@Volans I remember you giving and update on this? Is this still a thing? Could it have been not happening for a while- in which case, it should be closed?

Wed, Jul 11, 9:29 AM · Icinga, monitoring
jcrespo added a project to T195578: Deploy access to performance_schema/sys for the administrative mediawiki account (mediawiki deployers): Security.

@Security-team Do you see any blocker or reason not to enable this on all hosts? This was already available for roots only, and now it will be extended to the admin account (deployers with mediawiki database access). It contains performance information similar to that of tendril, but more detailed. Please have a look at it on db2083 yourselves if on doubt.

Wed, Jul 11, 9:23 AM · Security, Performance, DBA
jcrespo updated subscribers of T195515: GUC query performance regressed 100x from <3s to 80-300s.

@Anomie @daniel I don't think your suggestions to create views to maintain backwards compatibility are the right way to go- instead, they are degrading performance, in some cases leading to high load on the servers that cannot be handled. I think the right way to go is to announce MCR to wikirreplica users, and explain why backwards compatibility cannot be done- and expose the real tables. This will certainly break many tools- but not worse than making all not work anymore due to performance issues. Unlike API calls, one cannot guarantee a stable interface for the internal tables- and we should reward those that keep them up to date, not the other way.

Wed, Jul 11, 9:12 AM · Stewards-and-global-tools, Data-Services, cloud-services-team, Tool-Global-user-contributions
jcrespo added a comment to T195293: 503 error attempting to open multiple projects (Wikipedia and meta wiki are loading very slowly) .

So full outage but no actionable yet?

Wed, Jul 11, 9:03 AM · Language-2018-July-September, User-Nikerabbit, MediaWiki-extensions-Translate, Wikimedia-Incident, Wikimedia-log-errors, Operations
jcrespo added a comment to T184832: Decommission labsdb1001 and labsdb1003.

BTW, I can still see on racktables a labsdb1002-array1- not sure if a mistake on the application or it really is still there on reality, but that should be removed too (along with labsdb1001/3-array).

Wed, Jul 11, 8:57 AM · decommission, ops-eqiad, Operations, cloud-services-team (Kanban)
jcrespo added a comment to T194403: Wikimedia\Rdbms\ChronologyProtector::initPositions: expected but failed to find position index..

@Krinkle may I ask to update the description with your sensible analysis? It still has my ignorant comments about the issue, and that may mislead readers.

Wed, Jul 11, 8:43 AM · MW-1.32-release-notes (WMF-deploy-2018-07-17 (1.32.0-wmf.13)), Release-Engineering-Team (Watching / External), Performance-Team, MediaWiki-Database, Wikimedia-log-errors
jcrespo added a parent task for T163495: Mediawiki revision-related queries are failing with high rate for enwiki on codfw: T199073: Perform a datacenter switchover (2018-19 Q1).
Wed, Jul 11, 7:59 AM · Core-Platform-Team, Patch-For-Review, codfw-rollout, Wikimedia-Incident, MediaWiki-General-or-Unknown
jcrespo added a subtask for T199073: Perform a datacenter switchover (2018-19 Q1): T163495: Mediawiki revision-related queries are failing with high rate for enwiki on codfw.
Wed, Jul 11, 7:59 AM · Operations, Goal

Tue, Jul 10

jcrespo added a subtask for T197069: Failover db1052 (s1) db primary master: T199224: Test database master switchover script on codfw.
Tue, Jul 10, 1:35 PM · Patch-For-Review, DBA
jcrespo added parent tasks for T199224: Test database master switchover script on codfw: T197069: Failover db1052 (s1) db primary master, T197073: switchover es1014 to es1017.
Tue, Jul 10, 1:35 PM · DBA
jcrespo added a subtask for T197073: switchover es1014 to es1017: T199224: Test database master switchover script on codfw.
Tue, Jul 10, 1:35 PM · Patch-For-Review, DBA
jcrespo triaged T199224: Test database master switchover script on codfw as High priority.
Tue, Jul 10, 1:35 PM · DBA
jcrespo created T199224: Test database master switchover script on codfw.
Tue, Jul 10, 1:35 PM · DBA
jcrespo added a comment to T198987: Gather statistics about the backups on a database.

more useful stats (size is after compression):

root@db1115.eqiad.wmnet[zarcillo]> select source, section, backup_date, sum(size) from backup_files GROUP BY source, section, backup_date;
+------------------+---------+---------------------+--------------+
| source           | section | backup_date         | sum(size)    |
+------------------+---------+---------------------+--------------+
| dbstore1001:3311 | s1      | 2018-07-03 17:57:43 | 108262361080 |
| dbstore2001:3315 | s5      | 2018-07-03 23:40:19 |  50813370411 |
| dbstore2001:3316 | s6      | 2018-07-03 21:26:44 |  65586588489 |
| dbstore2001:3317 | s7      | 2018-07-03 17:00:01 |  82638220983 |
| dbstore2001:3318 | s8      | 2018-07-04 00:23:27 |  72221593133 |
| dbstore2002:3311 | s1      | 2018-07-04 01:24:18 | 108267561324 |
| dbstore2002:3312 | s2      | 2018-07-04 04:01:34 |  91698946356 |
| dbstore2002:3313 | s3      | 2018-07-03 17:00:01 | 102135306565 |
| dbstore2002:3314 | s4      | 2018-07-03 20:33:00 |  95039731345 |
| dbstore2002:3320 | x1      | 2018-07-04 06:50:21 |  20686018930 |
+------------------+---------+---------------------+--------------+
10 rows in set (0.01 sec)
Tue, Jul 10, 12:01 PM · Patch-For-Review, DBA
jcrespo added a comment to T198987: Gather statistics about the backups on a database.

False alarm- empty tables do not get a data dump- we have to compare with schema dumps only, and they match:

mysql.py -BN -h db1115 zarcillo -e "SELECT DISTINCT SUBSTRING_INDEX(SUBSTRING_INDEX(file_name, '-schema.', 1), '.', -1) as tables FROM backup_files WHERE file_name like '%-schema.sql.gz' and source = 'dbstore1001:3311' and type = 'dump' and section = 's1' and backup_date = '2018-07-03 17:57:43' ORDER BY tables" > dbstore1001\:3311.backup.txt
Tue, Jul 10, 10:37 AM · Patch-For-Review, DBA
jcrespo added a comment to T198987: Gather statistics about the backups on a database.

Interesting:

root@db1115.eqiad.wmnet[zarcillo]> select source, count(*) FROM backup_files GROUP BY source;
+------------------+----------+
| source           | count(*) |
+------------------+----------+
| dbstore1001:3311 |      468 |
| dbstore2002:3311 |      449 |
+------------------+----------+
2 rows in set (0.00 sec)
Tue, Jul 10, 10:05 AM · Patch-For-Review, DBA
jcrespo added a comment to T198987: Gather statistics about the backups on a database.

So I have the first backup-specific statistics:

Tue, Jul 10, 9:53 AM · Patch-For-Review, DBA
jcrespo added a comment to T198483: Save Timing increased 50% since 2018-06-28 20:53.

watch the save timing and edit stash dashboards proactively

Tue, Jul 10, 9:00 AM · MW-1.32-release-notes (WMF-deploy-2018-06-26 (1.32.0-wmf.10)), Patch-For-Review, Performance-Team, Release-Engineering-Team
jcrespo closed T176043: Prepare and check storage layer for amwikimedia (including dropping s7 version of the wiki) as Resolved.

Only multisource hosts have now amwikimedia (because they also host s3):

 ./section s7 | while read host port; do ./mysql.py -BN -h $host:$port amwikimedia -e "SELECT @@GLOBAL.hostname, @@GLOBAL.port"; done
labsdb1011      3306
labsdb1010      3306
labsdb1009      3306
ERROR 1049 (42000): Unknown database 'amwikimedia'
dbstore1002     3306
ERROR 1049 (42000): Unknown database 'amwikimedia'
ERROR 1049 (42000): Unknown database 'amwikimedia'
ERROR 1049 (42000): Unknown database 'amwikimedia'
ERROR 1049 (42000): Unknown database 'amwikimedia'
ERROR 1049 (42000): Unknown database 'amwikimedia'
ERROR 1049 (42000): Unknown database 'amwikimedia'
ERROR 1049 (42000): Unknown database 'amwikimedia'
ERROR 1049 (42000): Unknown database 'amwikimedia'
ERROR 1049 (42000): Unknown database 'amwikimedia'
ERROR 1049 (42000): Unknown database 'amwikimedia'
ERROR 1049 (42000): Unknown database 'amwikimedia'
ERROR 1049 (42000): Unknown database 'amwikimedia'
ERROR 1049 (42000): Unknown database 'amwikimedia'
ERROR 1049 (42000): Unknown database 'amwikimedia'
ERROR 1049 (42000): Unknown database 'amwikimedia'
ERROR 1049 (42000): Unknown database 'amwikimedia'
ERROR 1049 (42000): Unknown database 'amwikimedia'
Tue, Jul 10, 6:47 AM · Patch-For-Review, cloud-services-team (Kanban), Data-Services, DBA
jcrespo closed T176043: Prepare and check storage layer for amwikimedia (including dropping s7 version of the wiki), a subtask of T176042: Create amwikimedia, as Resolved.
Tue, Jul 10, 6:47 AM · User-Urbanecm, Patch-For-Review, User-Ladsgroup, Wiki-Setup (Create)
jcrespo claimed T176043: Prepare and check storage layer for amwikimedia (including dropping s7 version of the wiki).
Tue, Jul 10, 6:36 AM · Patch-For-Review, cloud-services-team (Kanban), Data-Services, DBA
jcrespo moved T176043: Prepare and check storage layer for amwikimedia (including dropping s7 version of the wiki) from Done to In progress on the DBA board.
Tue, Jul 10, 6:36 AM · Patch-For-Review, cloud-services-team (Kanban), Data-Services, DBA
jcrespo reopened T176043: Prepare and check storage layer for amwikimedia (including dropping s7 version of the wiki) as "Open".

amwikimedia is still on s7 in some places, at least dbstore2001. CC @Marostegui

Tue, Jul 10, 6:36 AM · Patch-For-Review, cloud-services-team (Kanban), Data-Services, DBA
jcrespo reopened T176043: Prepare and check storage layer for amwikimedia (including dropping s7 version of the wiki), a subtask of T176042: Create amwikimedia, as Open.
Tue, Jul 10, 6:36 AM · User-Urbanecm, Patch-For-Review, User-Ladsgroup, Wiki-Setup (Create)

Mon, Jul 9

jcrespo triaged T199124: Remove all usages of $::mw_primary on puppet as High priority.
Mon, Jul 9, 5:08 PM · Puppet, DBA, Operations
jcrespo created T199124: Remove all usages of $::mw_primary on puppet.
Mon, Jul 9, 4:11 PM · Puppet, DBA, Operations
jcrespo added a comment to T198093: Add a safe failover for analytics1003.

This should be ok since we already have some dbproxies whitelisted in the analytics vlan's firewall, so it should be a matter of adding another one.

Mon, Jul 9, 7:39 AM · User-Elukey, Analytics

Fri, Jul 6

jcrespo added a comment to T198987: Gather statistics about the backups on a database.
--
-- Table structure for table `instances`
--
Fri, Jul 6, 6:26 PM · Patch-For-Review, DBA
jcrespo moved T198987: Gather statistics about the backups on a database from Triage to In progress on the DBA board.
root@neodymium:~$ ./section s1
db1052.eqiad.wmnet      3306
db1067.eqiad.wmnet      3306
db1080.eqiad.wmnet      3306
db1083.eqiad.wmnet      3306
db1089.eqiad.wmnet      3306
db1099.eqiad.wmnet      3311
db1105.eqiad.wmnet      3311
db1106.eqiad.wmnet      3306
db1114.eqiad.wmnet      3306
db1118.eqiad.wmnet      3306
db1119.eqiad.wmnet      3306
db1124.eqiad.wmnet      3311
db2048.codfw.wmnet      3306
db2055.codfw.wmnet      3306
db2062.codfw.wmnet      3306
db2070.codfw.wmnet      3306
db2071.codfw.wmnet      3306
db2072.codfw.wmnet      3306
db2085.codfw.wmnet      3311
db2088.codfw.wmnet      3311
db2092.codfw.wmnet      3306
db2094.codfw.wmnet      3311
dbstore1001.eqiad.wmnet 3311
dbstore1002.eqiad.wmnet 3306
dbstore2002.codfw.wmnet 3311
labsdb1009.eqiad.wmnet  3306
labsdb1010.eqiad.wmnet  3306
labsdb1011.eqiad.wmnet  3306
root@neodymium:~$ ./section es1
es1012.eqiad.wmnet      3306
es1016.eqiad.wmnet      3306
es1018.eqiad.wmnet      3306
es2011.codfw.wmnet      3306
es2012.codfw.wmnet      3306
es2013.codfw.wmnet      3306
Fri, Jul 6, 6:23 PM · Patch-For-Review, DBA
jcrespo triaged T198987: Gather statistics about the backups on a database as Normal priority.
Fri, Jul 6, 6:23 PM · Patch-For-Review, DBA
jcrespo closed T198937: Setup database on tendril hosts to gather backup statistics as Resolved.

Finally it is being worked on the same instances as tendril.

Fri, Jul 6, 6:20 PM · Patch-For-Review, DBA
jcrespo closed T198937: Setup database on tendril hosts to gather backup statistics, a subtask of T198447: Monitor backup generation for failure or incorrect generation, as Resolved.
Fri, Jul 6, 6:20 PM · Goal, DBA
jcrespo updated subscribers of T198960: Delete/Rename my WikiTech account.

I think we should be able to rename it on LDAP: https://wikitech.wikimedia.org/wiki/Renaming_users and Disable and or remove phabricator and gerrit accounts, but heads up for bug T198588. CC @aborrero @Aklapper @Andrew Thoughts?

Fri, Jul 6, 1:30 PM · wikitech.wikimedia.org
jcrespo updated subscribers of T146591: Add a primary key to l10n_cache.

But don't add him/her again :-)

Fri, Jul 6, 1:04 PM · Blocked-on-schema-change, Patch-For-Review, MediaWiki-Database
jcrespo edited projects for T198948: Toolforge Tools listing is now very slow, added: cloud-services-team, Performance; removed Performance-Team.

Hey, not part of the tools admins, but I guess you mean https://tools.wmflabs.org/admin/tools ? It indeed seems slow, probably because it is trying to show all tools at once. Just suggesting https://tools.wmflabs.org/hay/directory/ and see if that is faster for you (sorry not too familiar with tools, so I may not be really that helpful).

Fri, Jul 6, 10:35 AM · Tool-admin
jcrespo triaged T198937: Setup database on tendril hosts to gather backup statistics as Normal priority.
Fri, Jul 6, 7:27 AM · Patch-For-Review, DBA