jcrespo (Jaime Crespo)
Sr Database Administrator

Today

  • Clear sailing ahead.

Tomorrow

  • Clear sailing ahead.

Friday

  • Clear sailing ahead.

User Details

User Since
May 11 2015, 8:31 AM (123 w, 2 d)
Availability
Available
IRC Nick
jynus
LDAP User
Jcrespo
MediaWiki User
JCrespo (WMF)

Recent Activity

Today

jcrespo added a comment to T176273: Move the wbc_entity_usage table onto a dedicated DB shard.

I know- it is only related because the wikidata migration require replication channels movement and that consumes DBA time, not because it contains wikidata.

Wed, Sep 20, 10:57 AM · MediaWiki-extensions-WikibaseClient, Wikidata
jcrespo added a comment to T172882: s51187 and p50380g50692 database users are generating excessive lag on replica service.

dplbot/s51290 seems to keep creating issues. In this case, the lag was caused by a different issue (another user creating heavy queries, not by itself), but it was making difficult to catch up replication right now. I have banned the s51290 from labsdb1001 (not from other replicas, that can still be used, or other database hosts) until the lag goes back to 0 or close. Once I did that, replication lag started decreasing.

Wed, Sep 20, 9:38 AM · Data-Services, XTools
jcrespo added a comment to T131956: Disabling general.confirmeduser from dbreports for using up too much db resources.

p50380g50440 was running several queries that were never going to stop executing, and causing 1 day of lag on labsdb1001:

Wed, Sep 20, 8:08 AM · Toolforge, Cloud-Services, DBA
jcrespo added a comment to T176273: Move the wbc_entity_usage table onto a dedicated DB shard.

@hoo Regarding Wikimedia setup, you must know that it is our priority right now to move wikidata to a dedicated server group; which means from ops side no other structural change can happen at the same time.

Wed, Sep 20, 7:33 AM · MediaWiki-extensions-WikibaseClient, Wikidata

Yesterday

jcrespo added a comment to T176243: Decommission database hosts < db2030 (tracking).

Codfw version of T134476

Tue, Sep 19, 6:27 PM · DBA
jcrespo added a subtask for T176243: Decommission database hosts < db2030 (tracking): T175685: Decommission db2010 and move m1 codfw to db2078.
Tue, Sep 19, 6:26 PM · DBA
jcrespo added a parent task for T175685: Decommission db2010 and move m1 codfw to db2078: T176243: Decommission database hosts < db2030 (tracking).
Tue, Sep 19, 6:26 PM · Patch-For-Review, DBA
jcrespo created T176243: Decommission database hosts < db2030 (tracking).
Tue, Sep 19, 6:26 PM · DBA
jcrespo updated the task description for T170662: Productionize 22 new codfw database servers.
Tue, Sep 19, 5:28 PM · Patch-For-Review, DBA
jcrespo closed T175228: Degraded RAID on db2010 as Resolved.

Let's consider this fixed and lets focus on T175685.

Tue, Sep 19, 5:26 PM · DBA, Operations, ops-codfw
jcrespo added a comment to T104699: Firewall configurations for database hosts.

I think after the above patch, only the proxies are missing?

Tue, Sep 19, 3:48 PM · DBA, Operations, Patch-For-Review
jcrespo added a comment to T175973: db1100 crashed.

Will do! Thanks. Please give me a heads up if any maintenance happens here, unless you tell me otherwise, I will put it back into production. We can put it down at any time later, but I do not want it down for a long time (replication keeps going forward :-), I just need to depool it beforehand.

Tue, Sep 19, 2:40 PM · DBA, ops-eqiad, Operations
jcrespo created T176215: decommission db1018.
Tue, Sep 19, 2:14 PM · Patch-For-Review, Operations, DBA
jcrespo updated the task description for T172679: Productionize 11 new eqiad database servers.
Tue, Sep 19, 9:51 AM · Patch-For-Review, DBA
jcrespo closed T175487: Significant replication lag for the s1, s2, and s4 wikis on labsdb100[13] as Resolved.
Tue, Sep 19, 8:01 AM · DBA, Data-Services
jcrespo added a comment to T175973: db1100 crashed.

The error on the description was on the lifecycle log. It gave the same description that you googled.

Tue, Sep 19, 7:17 AM · DBA, ops-eqiad, Operations

Mon, Sep 18

jcrespo added a comment to T175973: db1100 crashed.

And put it down CC @Cmjohnson.

Mon, Sep 18, 5:01 PM · DBA, ops-eqiad, Operations
jcrespo added a comment to T176055: Update of QueryPages failing on commons with "MASTER_POS_WAIT() or MASTER_GTID_WAIT() failed: MySQL server has gone away".

there will be read queries here against vslow slaves that are very long (on the order of an hour)

Mon, Sep 18, 4:52 PM · DBA, Wikimedia-General-or-Unknown, Commons, MediaWiki-Special-pages
jcrespo added a comment to T175973: db1100 crashed.

db1100 is depooled, I have downtime'ed it for a week so the BIOS update can happen at any time.

Mon, Sep 18, 4:49 PM · DBA, ops-eqiad, Operations
jcrespo moved T176055: Update of QueryPages failing on commons with "MASTER_POS_WAIT() or MASTER_GTID_WAIT() failed: MySQL server has gone away" from Triage to Blocked external/Not db team on the DBA board.

There was no lag on the last occurrence of that error:
https://grafana.wikimedia.org/dashboard/db/mysql-replication-lag?panelId=4&fullscreen&orgId=1&from=1505552411468&to=1505562592019
https://grafana.wikimedia.org/dashboard/db/mediawiki-mysql-loadbalancer?panelId=1&fullscreen&orgId=1&from=1505557748993&to=1505561233557

Mon, Sep 18, 3:40 PM · DBA, Wikimedia-General-or-Unknown, Commons, MediaWiki-Special-pages
jcrespo added a comment to T176043: Prepare and check storage layer for amwikimedia.

Ok, then this is not a blocker for the above ticket. We would thank a ping when the actual database is deployed.

Mon, Sep 18, 9:55 AM · Cloud-Services, DBA
jcrespo added a comment to T176043: Prepare and check storage layer for amwikimedia.

Is this going to be a public or a private wiki?

Mon, Sep 18, 9:52 AM · Cloud-Services, DBA

Sun, Sep 17

jcrespo added a comment to T163551: Huge number of duplicate rows in wb_terms.

Not sure if with "you", you mean me, but if it is safe, yes. We may have to defragment the table later to reclaim disk space, but that can be done later and it is not a blocker.

Sun, Sep 17, 2:26 PM · Patch-For-Review, User-aude, Wikidata-Sprint, Wikidata, MediaWiki-extensions-WikibaseRepository
jcrespo added a comment to T163551: Huge number of duplicate rows in wb_terms.

We identify and delete duplicate rows (not trivial, but not difficult either), then we add a UNIQUE restriction over that combination of columns so that never happens again.

Sun, Sep 17, 1:56 PM · Patch-For-Review, User-aude, Wikidata-Sprint, Wikidata, MediaWiki-extensions-WikibaseRepository

Sat, Sep 16

jcrespo added a comment to T169516: Implement cron-based mydumper backups on the dbstore role.

We need this ASAP dbstore1001 crashed and it is not in a good state; plus it can no longer catch up with replication reasonably well.

Sat, Sep 16, 6:40 PM · Patch-For-Review, DBA
jcrespo added a comment to T175970: Lost access to x1-analytics-slave .

@Jayprakash12345 Unless I am wrong, that is a different issue, nor related to the analytics db servers- please file a separate ticket so @Analytics ops can have a look at it (it is probably not related to mysql).

Sat, Sep 16, 12:07 PM · DBA, Operations
jcrespo added a comment to T175946: Global rename for RadioFan.

Please tell me your timezone or roughly the hours where you will be available so we can find the best match. We do not need to do it at that exact time, as long as there are admins or developers around.

Sat, Sep 16, 12:03 PM · DBA

Fri, Sep 15

jcrespo added a comment to T175970: Lost access to x1-analytics-slave .

@Etonkovidova Can you provide more information about how you plan to use that data? It was not initially included on analytics data because it was both more difficult technologically and not needed at the time. If you need a one-time access, I can provide you access to a host temporarilly quickly. If it is an ongoing project (long term), I will import those tables into dbstore1002, which may take a few days.

Fri, Sep 15, 6:32 PM · DBA, Operations
jcrespo moved T175946: Global rename for RadioFan from Triage to Next on the DBA board.
Fri, Sep 15, 6:21 PM · DBA
jcrespo added a comment to T175946: Global rename for RadioFan.

If it is enwiki, then this can go at any time someone is around in case of a problem. I propose you 15h UTC on Monday as admins and developers of both Europe and Americas should be around (and me in particular), but that is only because I do not know your schedule- feel free to propose a different schedule if you or any other "renamer" prefers it. Please let me know about that :-)

Fri, Sep 15, 6:20 PM · DBA
jcrespo added a comment to T175946: Global rename for RadioFan.

Which wikis was he mainly editing? We have some ongoing performance issues on wikidata and commons, and I would like to delay it if it impacts those.

Fri, Sep 15, 11:17 AM · DBA
jcrespo moved T175679: Decommission db1048 (was Move m3 slave to db1059) from In progress to Blocked external/Not db team on the DBA board.
Fri, Sep 15, 11:13 AM · Operations, ops-eqiad, Phabricator, DBA
jcrespo moved T175973: db1100 crashed from Triage to In progress on the DBA board.
Fri, Sep 15, 11:13 AM · DBA, ops-eqiad, Operations
jcrespo added a comment to T175962: Issue with maintenance script: SELECTing revisions with high rev_id is painfully slow.

@MusikAnimal So this is a couple of things- MySQL here may be confusing, but it is not the one at fault (it is executing the queries in the fastest way possible, and the rev_id used has little effect on the performance (although indirectly, there could be a correlation on high vs low numbers due to edit patterns), but it is not the issue here.

Fri, Sep 15, 10:02 AM · MW-1.30-release-notes (WMF-deploy-2017-09-19 (1.30.0-wmf.19)), MW-1.31-release-notes (WMF-deploy-2017-09-26 (1.31.0-wmf.1)), Patch-For-Review, Community-Tech-Sprint, DBA
jcrespo added a comment to T175962: Issue with maintenance script: SELECTing revisions with high rev_id is painfully slow.

@IKhitron revision_userindex is a made up view that it is cloud wikireplicas only- production has very similar content, but has different objects on the database (columns, indexes, etc.), so it needs a bit of a separate assessment.

Fri, Sep 15, 9:07 AM · MW-1.30-release-notes (WMF-deploy-2017-09-19 (1.30.0-wmf.19)), MW-1.31-release-notes (WMF-deploy-2017-09-26 (1.31.0-wmf.1)), Patch-For-Review, Community-Tech-Sprint, DBA
jcrespo triaged T175973: db1100 crashed as Low priority.

Low after being depooled.

Fri, Sep 15, 8:16 AM · DBA, ops-eqiad, Operations
jcrespo added a comment to T175973: db1100 crashed.

However, it is a lot of coincidence that it crashes just hours after peing pooling and having some load: https://gerrit.wikimedia.org/r/378003 (it has been idle for weeks before). I would like to generate some cpu load to make sure this isn't repeatable.

Fri, Sep 15, 8:13 AM · DBA, ops-eqiad, Operations
jcrespo updated subscribers of T175973: db1100 crashed.

@Cmjohnson @RobH I assume there is not much left to do here at dc/provider level except keeping a record of the crash and complain if it repeats? This is one of the latest models bought.

Fri, Sep 15, 8:07 AM · DBA, ops-eqiad, Operations
jcrespo moved T175970: Lost access to x1-analytics-slave from Triage to Done on the DBA board.
Fri, Sep 15, 7:49 AM · DBA, Operations
jcrespo edited projects for T175970: Lost access to x1-analytics-slave , added: DBA; removed Ops-Access-Requests.
Fri, Sep 15, 7:49 AM · DBA, Operations
jcrespo changed the visibility for T175973: db1100 crashed.
Fri, Sep 15, 7:29 AM · DBA, ops-eqiad, Operations
jcrespo changed the visibility for T175973: db1100 crashed.
Fri, Sep 15, 7:28 AM · DBA, ops-eqiad, Operations
jcrespo added a comment to T175973: db1100 crashed.

Thanks, I just wanted to doublecheck.

Fri, Sep 15, 7:28 AM · DBA, ops-eqiad, Operations
jcrespo updated subscribers of T175973: db1100 crashed.

Making NDA-only for now, based on extreme, paranoid-level cation, until @MoritzMuehlenhoff or @Cmjohnson consider if we should worry about https://en.wikipedia.org/wiki/Intel_Active_Management_Technology#Known_vulnerabilities_and_exploits

Fri, Sep 15, 1:08 AM · DBA, ops-eqiad, Operations
jcrespo created T175973: db1100 crashed.
Fri, Sep 15, 1:05 AM · DBA, ops-eqiad, Operations

Thu, Sep 14

jcrespo removed a project from T175264: Decommission db1049: Patch-For-Review.

Decomm. done , only references left are spare on site.pp and admin_install.

Thu, Sep 14, 11:27 AM · ops-eqiad, DBA, Operations
jcrespo updated the task description for T175264: Decommission db1049.
Thu, Sep 14, 11:25 AM · ops-eqiad, DBA, Operations
jcrespo added a parent task for T175264: Decommission db1049: T134476: Decommission old coredb machines (<=db1050).
Thu, Sep 14, 10:50 AM · ops-eqiad, DBA, Operations
jcrespo added a subtask for T134476: Decommission old coredb machines (<=db1050): T175264: Decommission db1049.
Thu, Sep 14, 10:50 AM · Patch-For-Review, Operations, DBA
jcrespo added a comment to T165348: Check long-running screen/tmux sessions.

are you OK with all mariadb:: roles

Thu, Sep 14, 9:20 AM · Patch-For-Review, monitoring, Operations

Wed, Sep 13

jcrespo added a comment to T175679: Decommission db1048 (was Move m3 slave to db1059).

Let's wait a bit more. I may have to talk to you abut setting up TLS for php and changing passwords, let's talk and aim for next week (but we shouldn't delay it much).

Wed, Sep 13, 5:53 PM · Operations, ops-eqiad, Phabricator, DBA
jcrespo placed T175685: Decommission db2010 and move m1 codfw to db2078 up for grabs.

I think the assignment is an accident because it was created as a subticket of another ticket; nothing to do here yet for you. Sorry for the distraction.

Wed, Sep 13, 2:17 PM · Patch-For-Review, DBA
jcrespo added a comment to T175790: dbstore1002 (analytics store) enwiki lag due to blocking query.

I belive this, or something similar in spirit could be happening now for s5. I need to look more into it to identify it.

Wed, Sep 13, 11:21 AM · User-Addshore, Research, WMDE-Analytics-Engineering, Analytics
jcrespo added a comment to T175790: dbstore1002 (analytics store) enwiki lag due to blocking query.

Actually, I converted the tables to innodb already- so nothing is to be done unless there is still some non-obvious interaction that makes the lag not go away (metadata locking, inter-query locking, or someting else). In theory nothing is to be made until we check lag doesn't happen again. This was just a heads up to not create Aria tables, ever! :-D

Wed, Sep 13, 10:11 AM · User-Addshore, Research, WMDE-Analytics-Engineering, Analytics
jcrespo added a comment to T175790: dbstore1002 (analytics store) enwiki lag due to blocking query.

Is it possible to see what query / script is running in the query mentioned in your comment?

Wed, Sep 13, 9:59 AM · User-Addshore, Research, WMDE-Analytics-Engineering, Analytics
jcrespo added a comment to T175790: dbstore1002 (analytics store) enwiki lag due to blocking query.

Missing word: "reads will get [blocked] by writes (from replication) [on non transactional engines]."

Wed, Sep 13, 9:58 AM · User-Addshore, Research, WMDE-Analytics-Engineering, Analytics
jcrespo moved T175679: Decommission db1048 (was Move m3 slave to db1059) from Backlog to Decommission on the ops-eqiad board.
Wed, Sep 13, 4:37 AM · Operations, ops-eqiad, Phabricator, DBA
jcrespo edited projects for T175679: Decommission db1048 (was Move m3 slave to db1059), added: ops-eqiad; removed Patch-For-Review.
Wed, Sep 13, 4:37 AM · Operations, ops-eqiad, Phabricator, DBA
jcrespo lowered the priority of T175679: Decommission db1048 (was Move m3 slave to db1059) from Normal to Low.
Wed, Sep 13, 4:37 AM · Operations, ops-eqiad, Phabricator, DBA
jcrespo added a comment to T175679: Decommission db1048 (was Move m3 slave to db1059).

db1048 is now ready to be decommissioned, it is set as spare, but it still needs to be fully deleted from the configuration and infrastructure (installer, site.pp).

Wed, Sep 13, 4:36 AM · Operations, ops-eqiad, Phabricator, DBA
jcrespo merged task T160731: Decom db1048 (BBU Faulty - slave lagging) into T175679: Decommission db1048 (was Move m3 slave to db1059).
Wed, Sep 13, 4:27 AM · Operations, Phabricator, DBA
jcrespo merged T160731: Decom db1048 (BBU Faulty - slave lagging) into T175679: Decommission db1048 (was Move m3 slave to db1059).
Wed, Sep 13, 4:27 AM · Operations, ops-eqiad, Phabricator, DBA
jcrespo renamed T175679: Decommission db1048 (was Move m3 slave to db1059) from Move m3 slave to db1059 to Decommission db1048 (was Move m3 slave to db1059).
Wed, Sep 13, 4:26 AM · Operations, ops-eqiad, Phabricator, DBA
jcrespo added a project to T175790: dbstore1002 (analytics store) enwiki lag due to blocking query: Research.

Actually, I am not sure if it is that script, there is other one happening at 3am, too, that seems to block the queries:

Wed, Sep 13, 4:19 AM · User-Addshore, Research, WMDE-Analytics-Engineering, Analytics
jcrespo created T175790: dbstore1002 (analytics store) enwiki lag due to blocking query.
Wed, Sep 13, 3:35 AM · User-Addshore, Research, WMDE-Analytics-Engineering, Analytics
jcrespo archived P5998 test.
Wed, Sep 13, 3:05 AM
jcrespo created P5998 test.
Wed, Sep 13, 3:05 AM
jcrespo claimed T175679: Decommission db1048 (was Move m3 slave to db1059).

@mmodell We have to upgrade the hardware for phabricator databases. What do you think of doing also this thursday a master switchover and upgrade to stretch/mariadb 10.1, enable TLS and setup the firewall. It should be a few seconds of restarting phabricator to get the new connections, if something goes bad, we revert to the current server.

Wed, Sep 13, 2:48 AM · Operations, ops-eqiad, Phabricator, DBA
jcrespo added a comment to T175778: Index on oresc_probability, temporarily or permanently.

Not directly related, but for background, and could be relevant regarding recentchanges scaning, and why it has become a problem for many queries lately: T171027#3599821

Wed, Sep 13, 2:40 AM · Schema-change, Scoring-platform-team, Collaboration-Team-Triage (Collab-Team-Q1-Jul-Sep-2017), Edit-Review-Improvements-RC-Page, MediaWiki-extensions-ORES

Tue, Sep 12

jcrespo added a comment to T175228: Degraded RAID on db2010.

0 is Online, Spun UP. Next one should be Span: 1

Tue, Sep 12, 4:39 PM · DBA, Operations, ops-codfw
jcrespo moved T165756: Create summary templates on Wikitech wiki to stop writing the same things everywhere, everytime from Triage to Backlog on the DBA board.
Tue, Sep 12, 4:29 PM · MediaWiki-SWAT-deployments, DBA, Wikimedia-Site-requests, Documentation, Wikimedia-Hackathon-2017
jcrespo lowered the priority of T165756: Create summary templates on Wikitech wiki to stop writing the same things everywhere, everytime from Normal to Low.

I am not saying this is not important, but nothing will break if this is not done (unlike many other "normal" tickets), which for me is low prioriy. Feel free to put it higher if you can help with this.

Tue, Sep 12, 4:28 PM · MediaWiki-SWAT-deployments, DBA, Wikimedia-Site-requests, Documentation, Wikimedia-Hackathon-2017
jcrespo moved T170508: The "show ip" action should also provide a distinct list of user-agents for each IP from Triage to Backlog on the DBA board.
Tue, Sep 12, 4:26 PM · DBA, Patch-For-Review, CheckUser
jcrespo added a comment to T170508: The "show ip" action should also provide a distinct list of user-agents for each IP.

See comment on gerrit, it helps with speeding up reviews :-).

Tue, Sep 12, 4:26 PM · DBA, Patch-For-Review, CheckUser
jcrespo moved T168349: enwiki_p logging vs logging_userindex returning dramatically different results from Triage to Done on the DBA board.
Tue, Sep 12, 4:22 PM · Data-Services, DBA
jcrespo moved T165625: Evaluate future of wmf puppet module "mysql" from Triage to Meta/Epic on the DBA board.
Tue, Sep 12, 4:21 PM · Icinga, Quarry, Community-Wikimetrics, DBA, Cloud-Services, Operations
jcrespo moved T175096: Identify tools hosting databases on labsdb100[13] and notify maintainers from Triage to Blocked external/Not db team on the DBA board.
Tue, Sep 12, 4:21 PM · cloud-services-team, Data-Services, DBA
jcrespo moved T175487: Significant replication lag for the s1, s2, and s4 wikis on labsdb100[13] from Triage to Done on the DBA board.
Tue, Sep 12, 4:21 PM · DBA, Data-Services
jcrespo moved T175672: Make client certs available for apache/maintenance hosts for TLS connections to mariadb from Triage to Backlog on the DBA board.
Tue, Sep 12, 4:21 PM · Performance-Team (Radar), Availability (Multiple-active-datacenters), DBA, Operations
jcrespo moved T174648: Create MW Schema Diff maintenance script from Done to Blocked external/Not db team on the DBA board.
Tue, Sep 12, 4:20 PM · DBA, MediaWiki-Database, MediaWiki-Maintenance-scripts
jcrespo moved T174648: Create MW Schema Diff maintenance script from Triage to Done on the DBA board.
Tue, Sep 12, 4:20 PM · DBA, MediaWiki-Database, MediaWiki-Maintenance-scripts
jcrespo moved T175086: Create and announce timeline for shutting down labsdb100[13] from Triage to Blocked external/Not db team on the DBA board.
Tue, Sep 12, 4:20 PM · cloud-services-team (Kanban), Data-Services, DBA
jcrespo added a comment to T166344: db1016 m1 master: Possibly faulty BBU.

db1069 has been reused on s7, probably we should chose db1066 instead.

Tue, Sep 12, 4:11 PM · Operations, ops-eqiad, DBA
jcrespo moved T169517: Research backup storage options and prepare a design document from Meta/Epic to In progress on the DBA board.
Tue, Sep 12, 4:08 PM · Documentation, DBA
jcrespo closed T168409: Migrate dbstore2001 to multi instance as Resolved.
Tue, Sep 12, 4:07 PM · Patch-For-Review, DBA
jcrespo closed T168409: Migrate dbstore2001 to multi instance, a subtask of T159423: Meta ticket: Migrate multi-source database hosts to multi-instance, as Resolved.
Tue, Sep 12, 4:07 PM · Epic, DBA
jcrespo moved T175679: Decommission db1048 (was Move m3 slave to db1059) from Triage to In progress on the DBA board.
Tue, Sep 12, 4:07 PM · Operations, ops-eqiad, Phabricator, DBA
jcrespo moved T175685: Decommission db2010 and move m1 codfw to db2078 from Triage to Backlog on the DBA board.
Tue, Sep 12, 4:06 PM · Patch-For-Review, DBA
jcrespo added a comment to T175228: Degraded RAID on db2010.

Still on Firmware state: Rebuild, we will wait a bit for the next one. (I am a bit more cautions than I have to be due to the RAID 10 because the disks are not new, so there is a change for those to fail, too).

Tue, Sep 12, 3:10 PM · DBA, Operations, ops-codfw
jcrespo merged task T175704: Degraded RAID on db2010 into T175228: Degraded RAID on db2010.
Tue, Sep 12, 3:09 PM · Operations, ops-codfw
jcrespo merged T175704: Degraded RAID on db2010 into T175228: Degraded RAID on db2010.
Tue, Sep 12, 3:09 PM · DBA, Operations, ops-codfw
jcrespo added a subtask for T170662: Productionize 22 new codfw database servers: T175685: Decommission db2010 and move m1 codfw to db2078.
Tue, Sep 12, 1:15 PM · Patch-For-Review, DBA
jcrespo added a parent task for T175685: Decommission db2010 and move m1 codfw to db2078: T170662: Productionize 22 new codfw database servers.
Tue, Sep 12, 1:15 PM · Patch-For-Review, DBA
jcrespo created T175685: Decommission db2010 and move m1 codfw to db2078.
Tue, Sep 12, 1:14 PM · Patch-For-Review, DBA
jcrespo added a comment to T171027: "2062 Read timeout is reached" DBQueryError when trying to load specific users' watchlists (with +1000 articles) on several wikis.

After testing some indexes, I do not see a huge improvement- we can reduce from scanning 100M rows to 18M, but there can be always a combination of query parameters that does not filter many rows on recentchanges. Paging by id (or timestamp) is the only reliable solution to make the queries in smaller batches so they do not fail:

Tue, Sep 12, 12:48 PM · Wikidata, Commons, Contributors-Team, User-notice, Wikimedia-log-errors, MW-1.30-release-notes (WMF-deploy-2017-08-08_(1.30.0-wmf.13)), Russian-Sites, Wikimedia-General-or-Unknown, Performance, MediaWiki-Watchlist
jcrespo added a subtask for T134476: Decommission old coredb machines (<=db1050): T175679: Decommission db1048 (was Move m3 slave to db1059).
Tue, Sep 12, 12:05 PM · Patch-For-Review, Operations, DBA
jcrespo added a parent task for T175679: Decommission db1048 (was Move m3 slave to db1059): T134476: Decommission old coredb machines (<=db1050).
Tue, Sep 12, 12:05 PM · Operations, ops-eqiad, Phabricator, DBA
jcrespo added a parent task for T162593: Run pt-table-checksum on s4 (commonswiki): T175679: Decommission db1048 (was Move m3 slave to db1059).
Tue, Sep 12, 12:05 PM · DBA
jcrespo added a subtask for T175679: Decommission db1048 (was Move m3 slave to db1059): T162593: Run pt-table-checksum on s4 (commonswiki).
Tue, Sep 12, 12:05 PM · Operations, ops-eqiad, Phabricator, DBA
jcrespo created T175679: Decommission db1048 (was Move m3 slave to db1059).
Tue, Sep 12, 12:04 PM · Operations, ops-eqiad, Phabricator, DBA