jcrespo (Jaime Crespo)
Sr Database Administrator

Today

  • Clear sailing ahead.

Tomorrow

  • Clear sailing ahead.

Saturday

  • Clear sailing ahead.

User Details

User Since
May 11 2015, 8:31 AM (183 w, 3 d)
Availability
Available
IRC Nick
jynus
LDAP User
Jcrespo
MediaWiki User
JCrespo (WMF) [ Global Accounts ]

Recent Activity

Mon, Nov 12

jcrespo added a comment to T166733: Deploy refactored comment storage.

I will need some more time to digest the rest of your comment, but it seems it is even more I need to answer my question. I can quickly answer:

Mon, Nov 12, 4:38 PM · MediaWiki-Commenting, Patch-For-Review, Core Platform Team Kanban (Doing), Core Platform Team ( Code Health (TEC13)), User-notice, Epic, Release-Engineering-Team (Watching / External)
jcrespo added a comment to T209031: Not able to scoop comment table in labs for mediawiki reconstruction process.

Nuria apparently subscribed 120 cloud users to this task by mistake- please be careful when using Phabricator to not annoy (with spam) our valuable contributors just to get the desired attention.

Mon, Nov 12, 4:01 PM · Analytics-Kanban, DBA, Data-Services, Analytics
jcrespo added a comment to T203709: Schema change for adding indexes of ct_tag_id.

Could you check the list of schema changes and maintenance to be ran during switchover to test if they where also undone?

Mon, Nov 12, 2:13 PM · Patch-For-Review, Wikidata, Blocked-on-schema-change, User-Ladsgroup, Wikidata-Campsite, MediaWiki-Database, MediaWiki-Change-tagging
jcrespo added a comment to T166733: Deploy refactored comment storage.

There is now a need to check consistency configuration on all codfw hosts, which were altered to prevent lots of lagging behind, as well as I would recommend some light checking on the consistency of the affected tables.

Mon, Nov 12, 7:47 AM · MediaWiki-Commenting, Patch-For-Review, Core Platform Team Kanban (Doing), Core Platform Team ( Code Health (TEC13)), User-notice, Epic, Release-Engineering-Team (Watching / External)
jcrespo added a comment to T209031: Not able to scoop comment table in labs for mediawiki reconstruction process.

This is not something we handle- we don't decide on the table structure (this refactoring, comment storage, was owned by Platform team), while the actual view structure changes is handled by Cloud. I don't even know which structure there is on wikireplicas- we only handle the production filtering. When this was setup (views) we weren't very happy about it, and we did only "accept" it with the condition that cloud would handle on its own wikireplicas filtering. They have been working together on T181650- you should comment there your needs and one of the 2 teams may serve you better :-)

Mon, Nov 12, 7:43 AM · Analytics-Kanban, DBA, Data-Services, Analytics

Thu, Nov 8

jcrespo added a comment to T208909: [Bug] Update old nonuniformly distributed page_random values.

Update from the airport for the practical bits- send a patch with a foreachwiki php script (check other past cases with similar tasks) so many people can review it and +1 it- if it takes less than 1 hour to run, !log on IRC #wikimedia-operations and run it during its own deployment window, it it is longer, write it on the [[wikitech:Deployments]] page in the "week of:" section, as recommended on "Long running tasks/scripts". Make sure no other long-running tasks such as T166733 are writing to the same tables to prevent locking (eg. deadlocks). Even if the reporter do not have access to production, once it is deployed into production, anyone with server access will be able to run it (not only ops- releng and several developer, too).

Thu, Nov 8, 5:41 PM · MW-1.33-notes (1.33.0-wmf.3; 2018-11-06), Patch-For-Review, Readers-Web-Backlog (Readers-Web-Kanbanana-Board-2018-19-Q2), DBA, MediaWiki-General-or-Unknown
jcrespo added a comment to T208909: [Bug] Update old nonuniformly distributed page_random values.

I was asked to participate on this task explicitly:

Thu, Nov 8, 5:21 PM · MW-1.33-notes (1.33.0-wmf.3; 2018-11-06), Patch-For-Review, Readers-Web-Backlog (Readers-Web-Kanbanana-Board-2018-19-Q2), DBA, MediaWiki-General-or-Unknown
jcrespo added a comment to T207941: Spike of DBTransactionSizeError exceptions from /w/api.php from Special:Watchlist.

I have not done any research that suggests they are related, but he had Watchlist regressions in the past on certain wikis when rcs growed a lot on long watchlists at T171027#3667090 Not assuming it is Wikidata at all, but it could be any other process making rcs grow (a thing that is easy to check or discard).

Thu, Nov 8, 4:56 PM · Performance-Team, Wikimedia-production-error
jcrespo added a comment to T208383: Implement parsercache service on pc[12]0(07|08|09|10) and replace leased pc[12]00[456].

I voted +1 but check if you can fix some minor style things I suggest so all mariadb roles look similar. Otherwise, your deploy strategy looks sane to me- disable puppet everywhere, test on the new host, then on codfw, then on eqiad, one by one making sure it is noop. Thanks for your hard work!

Thu, Nov 8, 4:47 PM · Patch-For-Review, User-Banyek, DBA, Operations
jcrespo added a comment to T208383: Implement parsercache service on pc[12]0(07|08|09|10) and replace leased pc[12]00[456].

While it looks good, please wait to have at least one positive review unless there is an emergency- as all the work will be done on your side, you can move to other ticket while you wait. I think I will be able to review it today.

Thu, Nov 8, 4:38 PM · Patch-For-Review, User-Banyek, DBA, Operations
jcrespo updated subscribers of T208954: Missing row in enwiki.archive on sanitarium.

@Anomie I do strongly believe that mediawiki has recently gone unsafe (regression) in the latest releases- this is one of the many replication issues we had recently- we should search recent commits for unsafe statements or this will continue happening.

Thu, Nov 8, 10:23 AM · User-Banyek, Patch-For-Review, DBA

Tue, Nov 6

jcrespo added a comment to T208622: Import recommendations into production database.

@faidon They don't need a DBA, they are searching for someone to support them with puppetization. We already attended the DBA tasks at T205294, the rest can be handled with anyone with production access or knowledge- they can wait for me to have the time, of course, but if they do, they may be waiting forever :-)

Tue, Nov 6, 6:17 PM · Operations, Research
jcrespo placed T208622: Import recommendations into production database up for grabs.

Sorry, I don't have bandwidth to work on this search for another op.

Tue, Nov 6, 1:50 PM · Operations, Research

Mon, Nov 5

jcrespo added a comment to T208695: Duplicate key on several s8 replicas breaking replication.

it got missed somehow

Mon, Nov 5, 8:28 PM · Wikidata, Wikimedia-Incident, DBA
jcrespo added a comment to T208672: Duplicate rows error in db2095 replication @s7.

hm... replication broken again, now on metawiki.archive

Mon, Nov 5, 1:23 PM · DBA
jcrespo added a comment to T208672: Duplicate rows error in db2095 replication @s7.

could T208565 and this ticket be related to T208695 either in root cause or in surface reason (writing a lot of rarely-written rows?).

Mon, Nov 5, 1:02 PM · DBA
jcrespo added a project to T208695: Duplicate key on several s8 replicas breaking replication: DBA.
Mon, Nov 5, 12:19 AM · Wikidata, Wikimedia-Incident, DBA
jcrespo created T208695: Duplicate key on several s8 replicas breaking replication.
Mon, Nov 5, 12:18 AM · Wikidata, Wikimedia-Incident, DBA

Sun, Nov 4

jcrespo added a comment to T208672: Duplicate rows error in db2095 replication @s7.

This is what I did for T208565: stop replication, add db.table to the long list of filters (carefully), let it catch up, then stop the replication on its master (beware of icinga alerts, downtime all possible alerts first), truncate the table without altering its definition and reimport it logically -eg mysqldump- (the triggers should take care of the sanitization), restore the replication filter to the original state and restart replication on the master.

Sun, Nov 4, 3:21 PM · DBA

Fri, Nov 2

jcrespo added a comment to T205294: Request to create database and account for recommendation API.

"extended 1:1" is not my words-- obviously I cannot guarantee such a thing, only a manager can do so, but I believe assistance on writing a puppet patch would be within the functions, as I understand them.

Fri, Nov 2, 3:37 PM · Patch-For-Review, DBA, Research
jcrespo added a comment to T205294: Request to create database and account for recommendation API.

Do you know who's the right person to contact today?

Fri, Nov 2, 3:29 PM · Patch-For-Review, DBA, Research
jcrespo closed T208565: db2094 s3 replication broke as Resolved.

This is technically fixed, but we should do a deeper check on the causes of this, there could be some drift on this or other close dbs that only manifests due to the ROW-based replication.

Fri, Nov 2, 3:14 PM · DBA
jcrespo moved T208526: Database timeout error + significant lag when modifying a Partial Block with 10 items from Triage to Blocked external/Not db team on the DBA board.
Fri, Nov 2, 11:42 AM · User-Ryasmeen, MW-1.33-notes (1.33.0-wmf.2; 2018-10-30), Anti-Harassment (AHT Sprint 32), Patch-For-Review, DBA, Operations
jcrespo lowered the priority of T208462: Error Unknown column ipb_sitewide in field list on query from High to Normal.
Fri, Nov 2, 9:34 AM · DBA, Anti-Harassment, Operations
jcrespo moved T208462: Error Unknown column ipb_sitewide in field list on query from Triage to Done on the DBA board.

Work by dbas here is done except making the 2 altered hosts back fully consistent once they catch up.

Fri, Nov 2, 9:33 AM · DBA, Anti-Harassment, Operations
jcrespo added a project to T208462: Error Unknown column ipb_sitewide in field list on query: DBA.
Fri, Nov 2, 9:33 AM · DBA, Anti-Harassment, Operations
jcrespo claimed T208565: db2094 s3 replication broke.
Fri, Nov 2, 9:32 AM · DBA
jcrespo triaged T208565: db2094 s3 replication broke as High priority.
Fri, Nov 2, 8:12 AM · DBA
jcrespo created T208565: db2094 s3 replication broke.
Fri, Nov 2, 8:12 AM · DBA
jcrespo updated subscribers of T208462: Error Unknown column ipb_sitewide in field list on query.

So I am going to give a shot in the dark and say that then schema changes are done under pressure and in a hurry as this was done only to have the features out early and without proper checks, they are prone to errors.

Fri, Nov 2, 7:48 AM · DBA, Anti-Harassment, Operations
jcrespo claimed T208462: Error Unknown column ipb_sitewide in field list on query.
Fri, Nov 2, 7:41 AM · DBA, Anti-Harassment, Operations

Thu, Nov 1

jcrespo added a comment to T208272: codfw row C recable and add QFX.

Note I didn't ask for a delay- and neither Manuel (vacations) or I (training) will be around that new day either. Balasz will be, however.

Thu, Nov 1, 10:57 AM · Patch-For-Review, ops-codfw, Operations, netops

Wed, Oct 31

jcrespo added a comment to T207006: Set wb_changes_dispatch ROW_FORMAT=COMPRESSED on install and update.

loading a table with ROW_FORMAT=COMPRESSED but I am not sure compression is actually the cause, but maybe (I would actually be more sure about that) just a simple rebuild from scratch after a many edits are suffered (something common after >1000M edits).

Wed, Oct 31, 5:46 PM · DBA, Wikidata-Campsite, wikidata-tech-focus, Wikidata
jcrespo added a comment to T207259: rack/setup/install pc2007-pc2010.

A larger stripe size should not be a huge issue (unlike a smaller one, which affected performance significantly and we didn't like it). We were thinking of increasing the one we used due to increased capacity anyway, so this would be a nice test (these are 1.6TB disks anyway). Redoing the RAID and reformatting may take a long time and it may be a waste of time.

Wed, Oct 31, 2:45 PM · User-Banyek, Patch-For-Review, Operations, ops-codfw, DBA
jcrespo awarded T198838: Turn off 'blame' by default on Diffusion a Like token.
Wed, Oct 31, 1:31 PM · Upstream, Phabricator (Upstream), Diffusion
jcrespo created P7744 parsercache entries.
Wed, Oct 31, 11:21 AM
jcrespo moved T208383: Implement parsercache service on pc[12]0(07|08|09|10) and replace leased pc[12]00[456] from Triage to Next on the DBA board.
Wed, Oct 31, 10:36 AM · Patch-For-Review, User-Banyek, DBA, Operations
jcrespo assigned T208383: Implement parsercache service on pc[12]0(07|08|09|10) and replace leased pc[12]00[456] to Banyek.
Wed, Oct 31, 10:35 AM · Patch-For-Review, User-Banyek, DBA, Operations
jcrespo placed T208383: Implement parsercache service on pc[12]0(07|08|09|10) and replace leased pc[12]00[456] up for grabs.
Wed, Oct 31, 10:34 AM · Patch-For-Review, User-Banyek, DBA, Operations
jcrespo closed T207934: Reimage pc2006 with stretch as Declined.

We should work on T208383 instead.

Wed, Oct 31, 10:21 AM · User-Banyek, DBA
jcrespo triaged T208383: Implement parsercache service on pc[12]0(07|08|09|10) and replace leased pc[12]00[456] as High priority.
Wed, Oct 31, 10:18 AM · Patch-For-Review, User-Banyek, DBA, Operations

Tue, Oct 30

jcrespo moved T208323: Predictive failures on disk S.M.A.R.T. status from Triage to Backlog on the DBA board.
Tue, Oct 30, 3:02 PM · Operations, DBA
jcrespo moved T206743: S8 replication issues leading to rows missing during eqiad -> codfw switch (Was: "A few lexemes disappeared") from In progress to Done on the DBA board.

In particular, I see https://www.wikidata.org/wiki/Q2058295 properly merged.

Tue, Oct 30, 2:57 PM · Wikidata-Campsite (Wikidata-Campsite-Iteration-∞), Wikimedia-Incident, User-notice, Patch-For-Review, DBA, User-Addshore, Datacenter-Switchover-2018, Lexicographical data, Wikidata
jcrespo moved T208320: BBU Fail on dbstore2002 from Triage to Backlog on the DBA board.
Tue, Oct 30, 2:51 PM · DBA, Operations
jcrespo lowered the priority of T208150: db1117 went away from High to Normal.
Tue, Oct 30, 2:38 PM · Operations, ops-eqiad, DBA
jcrespo moved T208150: db1117 went away from In progress to Blocked external/Not db team on the DBA board.

There is nothing else left for DBAs here except waiting for errors.

Tue, Oct 30, 2:38 PM · Operations, ops-eqiad, DBA
jcrespo closed T208151: prometheus-mysqld-exporter not starts on db1117 as Resolved.
Tue, Oct 30, 2:36 PM · User-Banyek, DBA
jcrespo added a comment to T208272: codfw row C recable and add QFX.

No problem on my side, a short network outage is not a huge issue on codfw for dbs, but I cannot guarantee they will not page, and I won't be around to attend it- someone else will have to.

Tue, Oct 30, 2:19 PM · Patch-For-Review, ops-codfw, Operations, netops
jcrespo closed T184805: Move some wikis to s5 as Resolved.

Everthing at T184805#4654953 done, except the GTID handling, which has to be checked separately for other reasons.

Tue, Oct 30, 12:39 PM · Patch-For-Review, Release-Engineering-Team (Watching / External), wikitech.wikimedia.org, cloud-services-team, DBA, Operations
jcrespo closed T184805: Move some wikis to s5, a subtask of T189107: DB meta task for next DC failover issues, as Resolved.
Tue, Oct 30, 12:39 PM · Patch-For-Review, Epic, Operations, DBA
jcrespo added a comment to T184805: Move some wikis to s5.

No filters left that I can see:

./software/dbtools/section s5 | while read host; do echo $host; mysql.py -h $host -e "SHOW ALL SLAVES STATUS\G" | grep 'Wild' ; done
./software/dbtools/section s3 | while read host; do echo $host; mysql.py -h $host -e "SHOW ALL SLAVES STATUS\G" | grep 'Wild' ; done
Tue, Oct 30, 12:37 PM · Patch-For-Review, Release-Engineering-Team (Watching / External), wikitech.wikimedia.org, cloud-services-team, DBA, Operations
jcrespo added a comment to T184805: Move some wikis to s5.

In theory the drops finished, but I need to do an additonal pass to check for missing hosts/dbs as well as check/remove filters.

Tue, Oct 30, 12:03 PM · Patch-For-Review, Release-Engineering-Team (Watching / External), wikitech.wikimedia.org, cloud-services-team, DBA, Operations
jcrespo added a comment to T184805: Move some wikis to s5.

All those wikis sanitarium set up on db2094 too

Tue, Oct 30, 9:28 AM · Patch-For-Review, Release-Engineering-Team (Watching / External), wikitech.wikimedia.org, cloud-services-team, DBA, Operations
jcrespo added a comment to T208003: WatchedItemStore::addWatchBatchForUser does not have outer scope..

There is also the, I guess related:

Wikimedia\Rdbms\LBFactory::getEmptyTransactionTicket: WatchedItemStore::removeWatchBatchForUser does not have outer scope. #0 /srv/mediawiki/php-1.33.0-wmf.1/includes/watcheditem/WatchedItemStore.php(392): Wikimedia\Rdbms\LBFactory->getEmptyTransactionTic
Tue, Oct 30, 8:15 AM · MW-1.33-notes (1.33.0-wmf.4; 2018-11-13), Patch-For-Review, Growth-Team (Current Sprint), MediaWiki-Watchlist, Regression, Wikimedia-production-error

Mon, Oct 29

jcrespo merged T208250: Wikimedia\Rdbms\LBFactory::getEmptyTransactionTicket: WatchedItemStore::addWatchBatchForUser does not have outer scope. into T208003: WatchedItemStore::addWatchBatchForUser does not have outer scope..
Mon, Oct 29, 5:55 PM · MW-1.33-notes (1.33.0-wmf.4; 2018-11-13), Patch-For-Review, Growth-Team (Current Sprint), MediaWiki-Watchlist, Regression, Wikimedia-production-error
jcrespo merged task T208250: Wikimedia\Rdbms\LBFactory::getEmptyTransactionTicket: WatchedItemStore::addWatchBatchForUser does not have outer scope. into T208003: WatchedItemStore::addWatchBatchForUser does not have outer scope..
Mon, Oct 29, 5:55 PM · MediaWiki-Watchlist, Wikimedia-production-error, Growth-Team
jcrespo created T208250: Wikimedia\Rdbms\LBFactory::getEmptyTransactionTicket: WatchedItemStore::addWatchBatchForUser does not have outer scope..
Mon, Oct 29, 5:54 PM · MediaWiki-Watchlist, Wikimedia-production-error, Growth-Team
jcrespo added a comment to T207530: Deleting pages on the English Wikipedia is very slow.

I logged a deletion on en.wikipedia.org using X-Wikimedia-Debug, you can see it in mwlog1001.eqiad.wmnet:/srv/mw-log/XWikimediaDebug.log . You can see that the row count query was indeed very slow. The query was:

SELECT  COUNT(*) AS `rowcount`  FROM (SELECT  1  FROM `archive`    WHERE ar_page_id = '47773335' ...
Mon, Oct 29, 5:42 PM · Performance-Team (Radar), MW-1.32-notes, MW-1.33-notes (1.33.0-wmf.2; 2018-10-30), MW-1.32-release, Patch-For-Review, Operations, MediaWiki-Page-deletion
jcrespo moved T206592: Significant (17x) increase in time spent by updateSpecialPages.php script since datacenter switch over updating commons special pages from Blocked external/Not db team to Backlog on the DBA board.
Mon, Oct 29, 5:27 PM · DBA, Datacenter-Switchover-2018, MediaWiki-Special-pages
jcrespo added a comment to T206592: Significant (17x) increase in time spent by updateSpecialPages.php script since datacenter switch over updating commons special pages.

Please give ups the times on eqiad again so we can verify it was the switch indeed.

Mon, Oct 29, 5:24 PM · DBA, Datacenter-Switchover-2018, MediaWiki-Special-pages
jcrespo added a comment to T207934: Reimage pc2006 with stretch.

We should consider declining that and do the work directly on the new hardware: T207259

Mon, Oct 29, 4:51 PM · User-Banyek, DBA
jcrespo added a comment to T207259: rack/setup/install pc2007-pc2010.

@Papaul: @Banyek will be your contact point as he will be the person in charge of the related goal while Manuel is out.

Mon, Oct 29, 3:50 PM · User-Banyek, Patch-For-Review, Operations, ops-codfw, DBA
jcrespo added a comment to T208150: db1117 went away.

It is not clearly a RAM issue

Mon, Oct 29, 3:13 PM · Operations, ops-eqiad, DBA
jcrespo added a comment to T197486: prop=revisions API timing out for a specific user and pages they edited.

https://downloads.mariadb.org/ 10.1.37 not yet considered stable at the time of writing this. While we could deploy something from the tree, that is a big no for database code (unlike other kinds of code I don't have a problem with doing that) unless there is an unbreak now bug. Once it is officially released I will build it and test it.

Mon, Oct 29, 9:56 AM · DBA, MediaWiki-Database, MediaWiki-API
jcrespo updated subscribers of T205294: Request to create database and account for recommendation API.

We don't share passwords publicly, and you shouldn't need it to actually use it- you should create puppet code that reads it and write it to a config file or reads it and performs the load from you. Giving you the password not only it would be dangerous, it would not serve if for any reason it has to be changed and the loading code does not work anymore. Also passwords can be stolen and lost and I personally consider them sensitive data.

Mon, Oct 29, 9:36 AM · Patch-For-Review, DBA, Research
jcrespo added a comment to T170508: The "show ip" action should also provide a distinct list of user-agents for each IP.

Please note that while I have been asking you to wait, I am genuinely concerned about the performance of the query- even if only a few people can run it, group by over long period of times can be bad for the servers- so not blocked just on me not responding - however it is not easy to provide alternatives. Putting a limit on the timespan, or testing the waters by trying to count the number of matching entries beforehand would help to build confidence on the query- note that even if you do a LIMIT 5001, the group by forces to go over all matching entries.
What I was asking was to provide pure SQL and a tip such as "run this on the ip with more edits" or so (it is not easy for me to do the mw syntax -> SQL transformation, I am not that familiar with mediawiki code, sorry.

Mon, Oct 29, 9:08 AM · DBA, Patch-For-Review, CheckUser
jcrespo awarded T202715: "Lock wait timeout exceeded" when a user edits fast (from User::incEditCountImmediate) a Love token.
Mon, Oct 29, 8:49 AM · MW-1.33-notes (1.33.0-wmf.2; 2018-10-30), Performance-Team-notice, Performance-Team, MediaWiki-Page-editing, Commons, Wikimedia-production-error, MediaWiki-Database
jcrespo updated the task description for T208150: db1117 went away.
Mon, Oct 29, 8:34 AM · Operations, ops-eqiad, DBA
jcrespo added a project to T208150: db1117 went away: ops-eqiad.

so the error happens because it is tried to be run manually, which it not a big deal if it errors out- just delete any file you may have added. I ran systemctl disable <service manually started that failed> and then systemctl reset-failedand it should never fail again.

Mon, Oct 29, 8:33 AM · Operations, ops-eqiad, DBA
jcrespo moved T145072: Create a script to regenerate prometheus mysqld exporter listing that works with puppetdb from Triage to Backlog on the DBA board.

I think this should be moved to zarcillo mariadb metadata database, and centralize there the active database control (substituting tendril, puppet, per-section lists and prometheus mysql exported list).

Mon, Oct 29, 8:11 AM · monitoring, DBA, Operations, Prometheus-metrics-monitoring
jcrespo triaged T164382: Evaluate the need for FORCE INDEX (ls_field_val) [now IGNORE INDEX (ls_log_id)], delete the index hint if not needed anymore as Low priority.
Mon, Oct 29, 8:09 AM · MediaWiki-Logging, DBA
jcrespo moved T207881: excessive "lock wait timeout exceeded " error rate after deploying 1.33.0-wmf.1 to group1 from Triage to Blocked external/Not db team on the DBA board.
Mon, Oct 29, 8:06 AM · wikidata-tech-focus, Wikidata, MediaWiki-extensions-WikibaseRepository, MW-1.33-notes (1.33.0-wmf.1; 2018-10-23), Patch-For-Review, MediaWiki-Database, DBA, Wikimedia-production-error
jcrespo added a comment to T208150: db1117 went away.

mysql-prometheus-exporter should not run in a multiinstances host, there is mysql-prometheus@m1, @m2 ... That is specified on puppet.

Mon, Oct 29, 7:54 AM · Operations, ops-eqiad, DBA

Thu, Oct 25

jcrespo added a comment to T206743: S8 replication issues leading to rows missing during eqiad -> codfw switch (Was: "A few lexemes disappeared").

@Pigsonthewing I hope my comment at Wikidata Village Pump was helpful- if you think that is ok, I would suggest closing this task, and open a different one to track the merges of old history (this was to track the recovery from backups)?

Thu, Oct 25, 4:00 PM · Wikidata-Campsite (Wikidata-Campsite-Iteration-∞), Wikimedia-Incident, User-notice, Patch-For-Review, DBA, User-Addshore, Datacenter-Switchover-2018, Lexicographical data, Wikidata
jcrespo added a comment to T198176: Mediawiki page deletions should happen in batches of revisions.

MarcoAurelio- great suggestion. I would also add to check the effect if a page is retried to be deleted several times (a common occurence in the past due to the way deletion requests are handled)- I know there was work on preventing issues- but it would be nice to recheck on production.

Thu, Oct 25, 1:57 PM · Core Platform Team Kanban (Done with CPT), MW-1.32-notes (WMF-deploy-2018-10-16 (1.32.0-wmf.26)), Core Platform Team ( Code Health (TEC13)), Wikimedia-production-error, Patch-For-Review, MediaWiki-Page-deletion
jcrespo awarded T198176: Mediawiki page deletions should happen in batches of revisions a Love token.
Thu, Oct 25, 1:55 PM · Core Platform Team Kanban (Done with CPT), MW-1.32-notes (WMF-deploy-2018-10-16 (1.32.0-wmf.26)), Core Platform Team ( Code Health (TEC13)), Wikimedia-production-error, Patch-For-Review, MediaWiki-Page-deletion
jcrespo awarded T198156: Server-side deletion of User:LorenzoMilano/sandbox a Love token.
Thu, Oct 25, 1:54 PM · User-MarcoAurelio, Wikimedia-Site-requests
jcrespo updated the task description for T207941: Spike of DBTransactionSizeError exceptions from /w/api.php from Special:Watchlist.
Thu, Oct 25, 11:58 AM · Performance-Team, Wikimedia-production-error
jcrespo created T207941: Spike of DBTransactionSizeError exceptions from /w/api.php from Special:Watchlist.
Thu, Oct 25, 11:57 AM · Performance-Team, Wikimedia-production-error
jcrespo added a parent task for T198176: Mediawiki page deletions should happen in batches of revisions: T207940: Large transaction-related errors and other problems (tracking).
Thu, Oct 25, 11:52 AM · Core Platform Team Kanban (Done with CPT), MW-1.32-notes (WMF-deploy-2018-10-16 (1.32.0-wmf.26)), Core Platform Team ( Code Health (TEC13)), Wikimedia-production-error, Patch-For-Review, MediaWiki-Page-deletion
jcrespo added a subtask for T207940: Large transaction-related errors and other problems (tracking): T198176: Mediawiki page deletions should happen in batches of revisions.
Thu, Oct 25, 11:52 AM · Tracking, MediaWiki-Database
jcrespo added a parent task for T171898: Deleting large watchlist takes > 4 seconds causing rollback due to write time limit: T207940: Large transaction-related errors and other problems (tracking).
Thu, Oct 25, 11:50 AM · MW-1.33-notes (1.33.0-wmf.1; 2018-10-23), Patch-For-Review, Growth-Team (Current Sprint), DBA, MediaWiki-Watchlist, Wikimedia-production-error
jcrespo added a subtask for T207940: Large transaction-related errors and other problems (tracking): T171898: Deleting large watchlist takes > 4 seconds causing rollback due to write time limit.
Thu, Oct 25, 11:50 AM · Tracking, MediaWiki-Database
jcrespo created T207940: Large transaction-related errors and other problems (tracking).
Thu, Oct 25, 11:50 AM · Tracking, MediaWiki-Database
jcrespo raised the priority of T207881: excessive "lock wait timeout exceeded " error rate after deploying 1.33.0-wmf.1 to group1 from Normal to Unbreak Now!.
Thu, Oct 25, 8:32 AM · wikidata-tech-focus, Wikidata, MediaWiki-extensions-WikibaseRepository, MW-1.33-notes (1.33.0-wmf.1; 2018-10-23), Patch-For-Review, MediaWiki-Database, DBA, Wikimedia-production-error
jcrespo added a comment to T207901: dbproxy1005 reports database failover.

T207881 is mediawiki, db1072 is m5, nothing to do.

Thu, Oct 25, 8:02 AM · cloud-services-team, DBA
jcrespo added a comment to T207901: dbproxy1005 reports database failover.

Please reload the proxy and work with @Bstorm or whoever may help to identify next steps.

Thu, Oct 25, 7:57 AM · cloud-services-team, DBA
jcrespo added a project to T207901: dbproxy1005 reports database failover: cloud-services-team.

Network likly went down at 19:23 https://grafana.wikimedia.org/dashboard/db/mysql?orgId=1&var-dc=eqiad%20prometheus%2Fops&var-server=db1073&var-port=9104&from=1540407778981&to=1540410443852, or more likely, connections reached max_connections CC cloud-services-team

Thu, Oct 25, 7:20 AM · cloud-services-team, DBA
jcrespo added a comment to T207901: dbproxy1005 reports database failover.

The host is not up and running, it says: db1073,0,0,0,0,,0,0,0,,0,,0,0,0,0,DOWN

Thu, Oct 25, 7:11 AM · cloud-services-team, DBA

Wed, Oct 24

jcrespo closed T126252: Populate the wikishared db on all dbstores as Resolved.

This was done long time ago on dbstore1002, and doesn't apply anymore on dbstores due to multiinstance.

Wed, Oct 24, 1:59 PM · Operations, DBA
jcrespo added a comment to T207253: Compare a few tables per section between hosts and DC.

To not just be a pain, this is how you can discover the master for a particular section automatically:

Wed, Oct 24, 11:16 AM · Patch-For-Review, User-Banyek, Wikimedia-Incident, DBA
jcrespo added a comment to T207253: Compare a few tables per section between hosts and DC.

I wouldn't use tables_to_check.txt for now

Wed, Oct 24, 11:02 AM · Patch-For-Review, User-Banyek, Wikimedia-Incident, DBA
jcrespo added a comment to T207253: Compare a few tables per section between hosts and DC.

@Banyek Hardcoding the masters in configurations seems to me like a bad idea- they are already defined redundantly 4 times on mediawiki, on puppet, on tendril and on prometheus. We should reduce the redundancy, not increase it.

Wed, Oct 24, 10:59 AM · Patch-For-Review, User-Banyek, Wikimedia-Incident, DBA
jcrespo added a comment to T206743: S8 replication issues leading to rows missing during eqiad -> codfw switch (Was: "A few lexemes disappeared").

@Addshore I thought you had communicated to wikidata users about that? Apparently not, or @Pigsonthewing didn't see it, could you link your messages to him?

Wed, Oct 24, 10:05 AM · Wikidata-Campsite (Wikidata-Campsite-Iteration-∞), Wikimedia-Incident, User-notice, Patch-For-Review, DBA, User-Addshore, Datacenter-Switchover-2018, Lexicographical data, Wikidata

Tue, Oct 23

jcrespo added a comment to T202715: "Lock wait timeout exceeded" when a user edits fast (from User::incEditCountImmediate).

COMMIT takes a few seconds

Tue, Oct 23, 2:45 PM · MW-1.33-notes (1.33.0-wmf.2; 2018-10-30), Performance-Team-notice, Performance-Team, MediaWiki-Page-editing, Commons, Wikimedia-production-error, MediaWiki-Database
jcrespo added a comment to T202715: "Lock wait timeout exceeded" when a user edits fast (from User::incEditCountImmediate).

I believe this affects mostly commonswiki, it regularly shows:

Tue, Oct 23, 2:42 PM · MW-1.33-notes (1.33.0-wmf.2; 2018-10-30), Performance-Team-notice, Performance-Team, MediaWiki-Page-editing, Commons, Wikimedia-production-error, MediaWiki-Database
jcrespo added a comment to T205452: Setup access from service to mysql.

Sorry, there is an ops clinic duty to answer these kind of requests- I did my part which was creating the user account on production. I am not responsible for anything else- anybody can do an RC on beta repos, and I am definitely not in charge of those.

Tue, Oct 23, 2:04 PM · Core Platform Team Kanban (Done with CPT), Services (done), Recommendation-API, SCB, Operations, Research
jcrespo added a comment to T207253: Compare a few tables per section between hosts and DC.

that task is mostly append only

Tue, Oct 23, 1:20 PM · Patch-For-Review, User-Banyek, Wikimedia-Incident, DBA
jcrespo added a comment to T207253: Compare a few tables per section between hosts and DC.

So this has to be done (I will check in case there is a duplicate task already), not arguing against that.

Tue, Oct 23, 1:13 PM · Patch-For-Review, User-Banyek, Wikimedia-Incident, DBA
jcrespo lowered the priority of T206743: S8 replication issues leading to rows missing during eqiad -> codfw switch (Was: "A few lexemes disappeared") from High to Normal.

We believe this to be fixed fully both wikireplicas and on production, but will not close until extra checks confirm so.

Tue, Oct 23, 12:56 PM · Wikidata-Campsite (Wikidata-Campsite-Iteration-∞), Wikimedia-Incident, User-notice, Patch-For-Review, DBA, User-Addshore, Datacenter-Switchover-2018, Lexicographical data, Wikidata