Marostegui (Manuel Aróstegui)
User

Today

  • Clear sailing ahead.

Tomorrow

  • Clear sailing ahead.

Sunday

  • Clear sailing ahead.

User Details

User Since
Sep 1 2016, 6:48 AM (111 w, 22 h)
Availability
Available
IRC Nick
marostegui
LDAP User
Marostegui
MediaWiki User
MArostegui (WMF) [ Global Accounts ]

TZ: UTC +1/+2

Recent Activity

Yesterday

Marostegui updated the task description for T206743: S8 replication issues leading to rows missing during eqiad -> codfw switch (Was: "A few lexemes disappeared").
Thu, Oct 18, 4:09 PM · Wikimedia-Incident, User-notice, Patch-For-Review, Wikidata-Campsite (Wikidata-Campsite-Iteration-∞), Wikidata-Campsite, DBA, User-Addshore, Datacenter-Switchover-2018, Lexicographical data, Wikidata
Marostegui added a comment to T206743: S8 replication issues leading to rows missing during eqiad -> codfw switch (Was: "A few lexemes disappeared").

Update from Jaime 18th Oct 16:05: s8 core hosts all finished getting fixed (pending labs)

Thu, Oct 18, 4:09 PM · Wikimedia-Incident, User-notice, Patch-For-Review, Wikidata-Campsite (Wikidata-Campsite-Iteration-∞), Wikidata-Campsite, DBA, User-Addshore, Datacenter-Switchover-2018, Lexicographical data, Wikidata
Marostegui added a comment to T202051: db2042 (m3) master RAID battery failed.

if it fails again, I suggest we go for a DC failover.

Thu, Oct 18, 4:01 PM · User-Banyek, Operations, ops-codfw, DBA
Marostegui added a comment to T207385: Create a check on the DC failover script to see if codfw -> eqiad replication is working before failing over to codfw (considering eqiad as the active DC by default).

So I guess it should go on Phase 0 before #4, somewhere between #3 and #4?
Or maybe even before #2, as it if it is not connected, maybe better to abort before wasting time on other next steps?

Thu, Oct 18, 3:19 PM · Operations-Software-Development, Datacenter-Switchover-2018
Marostegui reopened T207212: Degraded RAID on db2051 as "Open".

Leave it open until it finally gets rebuilt. They fail quite often unfortunately specially on old hosts and they need Papaul or Chris to pull the disk out and then back in

Thu, Oct 18, 2:55 PM · DBA, Operations, ops-codfw
Marostegui added a comment to T207385: Create a check on the DC failover script to see if codfw -> eqiad replication is working before failing over to codfw (considering eqiad as the active DC by default).

Sure, we can add a step that checks the parser cache replication/heartbeat.
Could you precisely outline in which phase we need to check what and also update the SwitchDatacenter wiki page so that is clear that the step is needed even before we automate that in the cookbooks?

From the top of my head I think it should go where the check if all the masters are up to date is.
Can you give me link to the switch wiki page?

Thank you!

Mhhh, but that is done during the read-only period, while this one seems to me that it should be done before hand, unless I'm missing something.

Thu, Oct 18, 1:40 PM · Operations-Software-Development, Datacenter-Switchover-2018
Marostegui added a comment to T207385: Create a check on the DC failover script to see if codfw -> eqiad replication is working before failing over to codfw (considering eqiad as the active DC by default).

Sure, we can add a step that checks the parser cache replication/heartbeat.
Could you precisely outline in which phase we need to check what and also update the SwitchDatacenter wiki page so that is clear that the step is needed even before we automate that in the cookbooks?

Thu, Oct 18, 1:22 PM · Operations-Software-Development, Datacenter-Switchover-2018
Marostegui added a comment to T206992: Create replication icinga check for the Parsercache hosts.

I have created T207385 so we can follow the discussion there.

Thu, Oct 18, 12:45 PM · Patch-For-Review, Wikimedia-Incident, User-Banyek, DBA
Marostegui created T207385: Create a check on the DC failover script to see if codfw -> eqiad replication is working before failing over to codfw (considering eqiad as the active DC by default).
Thu, Oct 18, 12:45 PM · Operations-Software-Development, Datacenter-Switchover-2018
Marostegui added a comment to T206992: Create replication icinga check for the Parsercache hosts.

@Volans ^ is that something we can do on the dc switchover script?

Thu, Oct 18, 12:36 PM · Patch-For-Review, Wikimedia-Incident, User-Banyek, DBA
Marostegui added a comment to T206992: Create replication icinga check for the Parsercache hosts.

I did check the parsercache hosts before the failover, to make sure they were all green - I would have seen that check and I would have remembered that they have to have replication enabled.

That is my point- the check is green now, even if replication isn't working. Even if it wasn't (there may be a parameter for that), the errors is not the replication, but the "freshness" of the data (hit ratio if it was active). We stop replication all the time- we need to check replication is working and recent - eg. maybe add it to the replication checks on switchover (but not the read only phase), in addition to this check.

Thu, Oct 18, 12:33 PM · Patch-For-Review, Wikimedia-Incident, User-Banyek, DBA
Marostegui added a comment to T206992: Create replication icinga check for the Parsercache hosts.

I am not sure how useful is this, honestly- this alert would have not prevented the issue at all:

MariaDB Slave IO: pc1	OK 	2018-10-18 12:26:14 	0d 3h 7m 42s 	1/3 	OK slave_io_state not a slave
Thu, Oct 18, 12:29 PM · Patch-For-Review, Wikimedia-Incident, User-Banyek, DBA
Marostegui added a comment to T206992: Create replication icinga check for the Parsercache hosts.

Is this alert fully deployed?

Thu, Oct 18, 12:25 PM · Patch-For-Review, Wikimedia-Incident, User-Banyek, DBA
Marostegui moved T207359: compress wb_changes_dispatch on testwikidatawiki from Triage to Backlog on the DBA board.

For the record, this table only has 4 rows, so it can probably done directly on the master with replication (once we are out of the woods with s8)

Thu, Oct 18, 8:08 AM · wikidata-tech-focus, DBA, MediaWiki-extensions-WikibaseRepository, Wikidata
Marostegui removed a project from T162558: Use memcached (or something similar) to keep the latest chd_seen state, only flush to table every once in a while: DBA.
Thu, Oct 18, 7:53 AM · User-Daniel, Performance, Wikidata
Marostegui added a comment to T206743: S8 replication issues leading to rows missing during eqiad -> codfw switch (Was: "A few lexemes disappeared").

on db1124 with instance s8 we have a repliation error as

Last_Error: Could not execute Delete_rows_v1 event on table wikidatawiki.pagelinks; Can't find record in 'pagelinks', Error_code: 1032; handler error HA_ERR_KEY_NOT_FOUND; the event's master log db1087-bin.003073, end_log_pos 582738698

Where was this error message logged? I have been looking for an error log on db1124 without success.

Thu, Oct 18, 6:37 AM · Wikimedia-Incident, User-notice, Patch-For-Review, Wikidata-Campsite (Wikidata-Campsite-Iteration-∞), Wikidata-Campsite, DBA, User-Addshore, Datacenter-Switchover-2018, Lexicographical data, Wikidata
Marostegui added a comment to T206743: S8 replication issues leading to rows missing during eqiad -> codfw switch (Was: "A few lexemes disappeared").

Update from 17th at 19:04 from Jaime:
all tables except wb_terms, which is half done, should be equal on the s8 master

Thu, Oct 18, 5:43 AM · Wikimedia-Incident, User-notice, Patch-For-Review, Wikidata-Campsite (Wikidata-Campsite-Iteration-∞), Wikidata-Campsite, DBA, User-Addshore, Datacenter-Switchover-2018, Lexicographical data, Wikidata
RandomDSdevel awarded T206743: S8 replication issues leading to rows missing during eqiad -> codfw switch (Was: "A few lexemes disappeared") a Baby Tequila token.
Thu, Oct 18, 1:23 AM · Wikimedia-Incident, User-notice, Patch-For-Review, Wikidata-Campsite (Wikidata-Campsite-Iteration-∞), Wikidata-Campsite, DBA, User-Addshore, Datacenter-Switchover-2018, Lexicographical data, Wikidata

Wed, Oct 17

Marostegui added a comment to T206992: Create replication icinga check for the Parsercache hosts.
Wed, Oct 17, 4:52 PM · Patch-For-Review, Wikimedia-Incident, User-Banyek, DBA
Marostegui added a comment to T206992: Create replication icinga check for the Parsercache hosts.

Where is the compiler run link?

Wed, Oct 17, 4:31 PM · Patch-For-Review, Wikimedia-Incident, User-Banyek, DBA
Marostegui moved T196378: Investigate solutions for MySQL connection pooling from In progress to Next on the DBA board.
Wed, Oct 17, 1:48 PM · DBA, Availability (MediaWiki-MultiDC), Performance-Team (Radar), Operations
Marostegui triaged T207273: Parser cache hit ratio alerting as Normal priority.

This needs some thought in order to make it effective:

Wed, Oct 17, 1:19 PM · monitoring, DBA
Marostegui closed T206740: parsercache used disk space increase as Resolved.

Let's close it and if something breaks we can reopen.
Good job!

Wed, Oct 17, 1:18 PM · MediaWiki-Cache, Performance-Team (Radar), User-Banyek, Performance-Team-notice, Datacenter-Switchover-2018, Operations, DBA
Marostegui updated the task description for T206743: S8 replication issues leading to rows missing during eqiad -> codfw switch (Was: "A few lexemes disappeared").
Wed, Oct 17, 1:05 PM · Wikimedia-Incident, User-notice, Patch-For-Review, Wikidata-Campsite (Wikidata-Campsite-Iteration-∞), Wikidata-Campsite, DBA, User-Addshore, Datacenter-Switchover-2018, Lexicographical data, Wikidata
Marostegui added a comment to T206743: S8 replication issues leading to rows missing during eqiad -> codfw switch (Was: "A few lexemes disappeared").

Update from Jaime at 12:10 UTC
All tables on s8 master fixed except pagelinks and wb_terms

Wed, Oct 17, 1:05 PM · Wikimedia-Incident, User-notice, Patch-For-Review, Wikidata-Campsite (Wikidata-Campsite-Iteration-∞), Wikidata-Campsite, DBA, User-Addshore, Datacenter-Switchover-2018, Lexicographical data, Wikidata
Marostegui added a comment to T146585: Add a primary key to user_newtalk.

So user_id,user_ip as PK would require changes to MW?

Wed, Oct 17, 12:54 PM · Patch-For-Review, MediaWiki-Database
Marostegui raised the priority of T146585: Add a primary key to user_newtalk from Normal to High.
Wed, Oct 17, 9:35 AM · Patch-For-Review, MediaWiki-Database
Marostegui added a comment to T146585: Add a primary key to user_newtalk.

@Krinkle @Reedy could you give this some priority so we can get a PK in place for user_newtalk? Right now we are dealing with recoveries on T206743: S8 replication issues leading to rows missing during eqiad -> codfw switch (Was: "A few lexemes disappeared") and by not having a PK here it slows down a bit the recoveries on that table, so another reason to have a PK for it :-)

Wed, Oct 17, 9:35 AM · Patch-For-Review, MediaWiki-Database
Marostegui added a comment to T206740: parsercache used disk space increase.

the 'parsercache' database takes around 1,5T data on pc1004
the binlogs are 7Gb on pc1004; but the binlogs are cleaned up in every hour
on the pc2004 host the binlogs take 155Gb now, and the expire_logs_days variable is set to 1, so if we stop the purger on pc1004, the same amount of logs will be there. So let's sat 1,5T + 0.2T = 1,7T on a 2.2T disk (I rounded up the numbers as checking for safety)

Wed, Oct 17, 7:59 AM · MediaWiki-Cache, Performance-Team (Radar), User-Banyek, Performance-Team-notice, Datacenter-Switchover-2018, Operations, DBA
Marostegui added a comment to T206740: parsercache used disk space increase.

Keep in mind that the disk space is always constant on the pc hosts: https://grafana.wikimedia.org/dashboard/file/server-board.json?panelId=17&fullscreen&orgId=1&var-server=pc1005&var-network=eth0&from=1532045819383&to=1535494432566
If it grows then that is normally an indication of an issue.

Wed, Oct 17, 7:51 AM · MediaWiki-Cache, Performance-Team (Radar), User-Banyek, Performance-Team-notice, Datacenter-Switchover-2018, Operations, DBA
Marostegui added a subtask for T207258: rack/setup/install pc1007-pc1010: Unknown Object (Task).
Wed, Oct 17, 7:46 AM · Patch-For-Review, Operations, ops-eqiad, DBA
Marostegui added a subtask for T207259: rack/setup/install pc2007-pc2010: Unknown Object (Task).
Wed, Oct 17, 7:46 AM · Patch-For-Review, Operations, ops-codfw, DBA
Marostegui triaged T207258: rack/setup/install pc1007-pc1010 as Normal priority.
Wed, Oct 17, 7:45 AM · Patch-For-Review, Operations, ops-eqiad, DBA
Marostegui triaged T207259: rack/setup/install pc2007-pc2010 as Normal priority.
Wed, Oct 17, 7:45 AM · Patch-For-Review, Operations, ops-codfw, DBA
Marostegui created T207259: rack/setup/install pc2007-pc2010.
Wed, Oct 17, 7:45 AM · Patch-For-Review, Operations, ops-codfw, DBA
Marostegui created T207258: rack/setup/install pc1007-pc1010.
Wed, Oct 17, 7:44 AM · Patch-For-Review, Operations, ops-eqiad, DBA
Marostegui added a project to T206743: S8 replication issues leading to rows missing during eqiad -> codfw switch (Was: "A few lexemes disappeared"): Wikimedia-Incident.
Wed, Oct 17, 6:39 AM · Wikimedia-Incident, User-notice, Patch-For-Review, Wikidata-Campsite (Wikidata-Campsite-Iteration-∞), Wikidata-Campsite, DBA, User-Addshore, Datacenter-Switchover-2018, Lexicographical data, Wikidata
Marostegui moved T207253: Compare a few tables per section between hosts and DC from To Triage to Follow-up/Actionables on the Wikimedia-Incident board.
Wed, Oct 17, 6:38 AM · Wikimedia-Incident, DBA
Marostegui triaged T207253: Compare a few tables per section between hosts and DC as Normal priority.
Wed, Oct 17, 6:38 AM · Wikimedia-Incident, DBA
Marostegui created T207253: Compare a few tables per section between hosts and DC.
Wed, Oct 17, 6:38 AM · Wikimedia-Incident, DBA
Marostegui added a comment to T206743: S8 replication issues leading to rows missing during eqiad -> codfw switch (Was: "A few lexemes disappeared").

Update from yesterday at around 14:00UTC
@jcrespo has done an amazing job of manually checking and fixing most of the tables on db1087 (which is the labs master, and it is not as easy to reclone as the others). He's gone thru all the tables to check and fix them.
Right all the tables apart from pagelinks and wb_terms are supposed to be fixed (although replication is broken on db1124 sanitarium master for wb_items_per_site, which is kind of expected with such amount of rows to check - we are looking at it).

Wed, Oct 17, 6:16 AM · Wikimedia-Incident, User-notice, Patch-For-Review, Wikidata-Campsite (Wikidata-Campsite-Iteration-∞), Wikidata-Campsite, DBA, User-Addshore, Datacenter-Switchover-2018, Lexicographical data, Wikidata
Marostegui added a comment to T206593: Productionize db2096 on x1.

I have added it to tendril.

Wed, Oct 17, 5:53 AM · Patch-For-Review, User-Banyek, DBA
Marostegui closed T201133: db1069 (x1 master) memory errors as Resolved.

As it happened before, this recovered itself - closing for now:

04:26 < icinga-wm> RECOVERY - Memory correctable errors -EDAC- on db1069 is OK: (C)4 ge (W)2 ge 1 https://grafana.wikimedia.org/dashboard/db/host-overview?orgId=1&var-server=db1069&var-datasource=eqiad%2520prometheus%252Fops
Wed, Oct 17, 5:10 AM · ops-eqiad, Operations, DBA
Marostegui closed T201133: db1069 (x1 master) memory errors, a subtask of T189107: DB meta task for next DC failover issues, as Resolved.
Wed, Oct 17, 5:10 AM · Patch-For-Review, Epic, Operations, DBA

Tue, Oct 16

Marostegui closed T141095: Explore 'Analyze' statement as substitute for Explain as Declined.

I am going to decline as there are no plans to use analyze as a substitute of EXPLAIN. Instead show explain maybe in combination with https://tools.wmflabs.org/tools-info/optimizer.py should be done. T141095#2895716
If anyone feels this needs to be re-opened, please do so.

Tue, Oct 16, 12:36 PM · DBA, Cloud-VPS, Cloud-Services
Marostegui closed T141095: Explore 'Analyze' statement as substitute for Explain, a subtask of T140788: Labs databases rearchitecture (tracking), as Declined.
Tue, Oct 16, 12:36 PM · Epic, Tracking, DBA, Cloud-Services, Cloud-VPS
Marostegui updated the task description for T207165: eventlogging_db_sanitization script failed.
Tue, Oct 16, 12:23 PM · Analytics-Kanban, Analytics
Marostegui renamed T207165: eventlogging_db_sanitization script failed from eventloggiong_db_sanitization script failed to eventlogging_db_sanitization script failed.
Tue, Oct 16, 12:23 PM · Analytics-Kanban, Analytics
Marostegui added a comment to T207165: eventlogging_db_sanitization script failed.

I've ack'ed the alerts for 3 hours on db1107 and db1108.

Tue, Oct 16, 12:22 PM · Analytics-Kanban, Analytics
Marostegui created T207165: eventlogging_db_sanitization script failed.
Tue, Oct 16, 12:20 PM · Analytics-Kanban, Analytics
Marostegui updated the task description for T206740: parsercache used disk space increase.
Tue, Oct 16, 10:12 AM · MediaWiki-Cache, Performance-Team (Radar), User-Banyek, Performance-Team-notice, Datacenter-Switchover-2018, Operations, DBA
Marostegui added a comment to T206740: parsercache used disk space increase.

You can probably disconnect pc1005 and pc1006 today already, but keep executing the truncating without binlog, just in case.

Tue, Oct 16, 10:10 AM · MediaWiki-Cache, Performance-Team (Radar), User-Banyek, Performance-Team-notice, Datacenter-Switchover-2018, Operations, DBA
Marostegui added a comment to T206740: parsercache used disk space increase.

Considering we are at around 70% at eqiad, and should remain like that, I don't think it is worth the hassle and the risk of depooling hosts at this point. So probably just truncating codfw tables and leaving replication codfw -> eqiad disconnected is the way to go here.

Tue, Oct 16, 9:57 AM · MediaWiki-Cache, Performance-Team (Radar), User-Banyek, Performance-Team-notice, Datacenter-Switchover-2018, Operations, DBA
Marostegui moved T71127: Discrepancies with logging table on different wikis from Triage to Backlog on the DBA board.
Tue, Oct 16, 9:46 AM · Data-Services, DBA
Marostegui added a subtask for T104459: Automate the check and fix of object, schema and data drifts between mediawiki HEAD, production masters and slaves: T71127: Discrepancies with logging table on different wikis.
Tue, Oct 16, 9:46 AM · Release-Engineering-Team (Watching / External), Wikimedia-Incident, Datasets-General-or-Unknown, Patch-For-Review, WorkType-NewFunctionality, DBA
Marostegui added a parent task for T71127: Discrepancies with logging table on different wikis: T104459: Automate the check and fix of object, schema and data drifts between mediawiki HEAD, production masters and slaves.
Tue, Oct 16, 9:46 AM · Data-Services, DBA
Marostegui added a comment to T206740: parsercache used disk space increase.

for luck all the tables around 6,5 Gb there, so yeah, I guess picking one is good enough. ç

Tue, Oct 16, 9:43 AM · MediaWiki-Cache, Performance-Team (Radar), User-Banyek, Performance-Team-notice, Datacenter-Switchover-2018, Operations, DBA
Marostegui added a comment to T206740: parsercache used disk space increase.

Pick some of the biggest tables and try to alter them to see how much you get and then we can probably extrapolate.
If it is not worth to be run on eqiad is good to know. Codfw should still be truncated after your tests I would suggest.

Tue, Oct 16, 9:41 AM · MediaWiki-Cache, Performance-Team (Radar), User-Banyek, Performance-Team-notice, Datacenter-Switchover-2018, Operations, DBA
Marostegui added a comment to T147169: Make sure Wikibase dump maintenance scripts solely use the "dump" db group.

@jcrespo Do you have any more information for what to look out here?

The problem is most likely the loadbalancer- working with old data, even if the server is not needed, generates lag warnings for some complex dependency of how lag check works (generating 40K logs per minute). Abandon ship- this is not going to work, and nothing that wikibase developers or DBAs do can fix it.

Although you could do a new database requests every 10 minutes or so to mitigate that. Kill the process and restart it so it reloads the new config.

We currently run the dumper instances for about 1h... would it be helpful to shorten this to maybe 30m?

Tue, Oct 16, 9:29 AM · Patch-For-Review, Wikidata-Campsite (Wikidata-Campsite-Iteration-∞), User-Addshore, MW-1.32-notes (WMF-deploy-2018-07-10 (1.32.0-wmf.12)), Wikidata-Campsite, Wikidata-Ministry-Of-Magic, Wikidata, MediaWiki-extensions-WikibaseRepository
Marostegui updated subscribers of T206593: Productionize db2096 on x1.

Is there a tool for updating zarcillo about the host, or I should 'INSERT INTO ...' ?

Tue, Oct 16, 8:34 AM · Patch-For-Review, User-Banyek, DBA
Marostegui added a comment to T207095: Prepare and check storage layer for vnwikimedia.

I think the wiki will be public. publicly readable, editing restricted. But I will leave @Urbanecm to confirm whether this can be replicated to labs.
This one is replicated to labs: T168788#3378730

Tue, Oct 16, 8:28 AM · User-Banyek, DBA, Cloud-Services
Marostegui added a comment to T206593: Productionize db2096 on x1.

Make sure you enable notifications first and update zarcillo DB before repooling.

Tue, Oct 16, 8:22 AM · Patch-For-Review, User-Banyek, DBA
Marostegui added a comment to T206593: Productionize db2096 on x1.

what's pending here?

Tue, Oct 16, 7:27 AM · Patch-For-Review, User-Banyek, DBA
Marostegui added a comment to T206740: parsercache used disk space increase.

Incident report: https://wikitech.wikimedia.org/wiki/Incident_documentation/20181016-eqiad_parsercache_empty_post-switchover

Tue, Oct 16, 6:39 AM · MediaWiki-Cache, Performance-Team (Radar), User-Banyek, Performance-Team-notice, Datacenter-Switchover-2018, Operations, DBA
Marostegui added a comment to T206841: Evaluate the consequences of the parsercache being empty post-switchover.

Incident report (please feel free to add or modify whatever you feel it needs some changes!): https://wikitech.wikimedia.org/wiki/Incident_documentation/20181016-eqiad_parsercache_empty_post-switchover

Tue, Oct 16, 6:38 AM · User-Joe, Datacenter-Switchover-2018, DBA, Operations
Marostegui triaged T206992: Create replication icinga check for the Parsercache hosts as Normal priority.
Tue, Oct 16, 5:42 AM · Patch-For-Review, Wikimedia-Incident, User-Banyek, DBA

Mon, Oct 15

Marostegui added a comment to T206623: /usr/local/sbin/wikireplica_dns timeouts.

Yeah, S3 has like 90 wikis!
Thank you!
Feel free to resolve this ticket if his is fully done from your side!
Thanks again!

Mon, Oct 15, 6:24 PM · cloud-services-team (Kanban), Data-Services
Marostegui reopened T206623: /usr/local/sbin/wikireplica_dns timeouts as "Open".

It needs to also run for S3 shard.

Mon, Oct 15, 5:31 PM · cloud-services-team (Kanban), Data-Services
Marostegui closed T205514: db1092 crashed - BBU broken as Resolved.

Battery replaced by Chris - thank you!:

Battery/Capacitor Count: 1
Battery/Capacitor Status: OK
Mon, Oct 15, 3:55 PM · User-Banyek, Operations, ops-eqiad, Patch-For-Review, DBA
Marostegui updated the task description for T206743: S8 replication issues leading to rows missing during eqiad -> codfw switch (Was: "A few lexemes disappeared").
Mon, Oct 15, 1:28 PM · Wikimedia-Incident, User-notice, Patch-For-Review, Wikidata-Campsite (Wikidata-Campsite-Iteration-∞), Wikidata-Campsite, DBA, User-Addshore, Datacenter-Switchover-2018, Lexicographical data, Wikidata
Marostegui added a comment to T206743: S8 replication issues leading to rows missing during eqiad -> codfw switch (Was: "A few lexemes disappeared").

So as Jaime said at T206743#4666169 all the pooled replicas in eqiad have now the content from codfw, so that is now consistent.
What is pending is db1087 (depooled) which is the master for db1124 (labsdb master) and labsdb, and of course the master, db1071 which doesn't receive reads.
So labsdb still have inconsistencies.

Mon, Oct 15, 1:28 PM · Wikimedia-Incident, User-notice, Patch-For-Review, Wikidata-Campsite (Wikidata-Campsite-Iteration-∞), Wikidata-Campsite, DBA, User-Addshore, Datacenter-Switchover-2018, Lexicographical data, Wikidata
Marostegui added a comment to T205865: Investigate decrease in wikidata dispatch times due to eqiad -> codfw DC switch.

All the pooled replicas have now compressed tables, can you confirm from your end if this is good?

Mon, Oct 15, 1:24 PM · User-Addshore, Wikidata-Campsite, Patch-For-Review, DBA, Operations, MediaWiki-General-or-Unknown, Performance, wikidata-tech-focus, Wikidata, Datacenter-Switchover-2018
Marostegui created P7679 (An Untitled Masterwork).
Mon, Oct 15, 9:47 AM
Marostegui added a comment to T204006: Execute the schema change for Partial Blocks.

Now that I see it we are even more than 50% done, as we have also done lots of sections in both DCs already, so only pending 4 sections in eqiad. So I don't think we have delayed this that much considering the ticket was ready for us on the 13th. Sept.

Mon, Oct 15, 8:42 AM · Patch-For-Review, DBA, Blocked-on-schema-change, Anti-Harassment
Marostegui updated the task description for T206743: S8 replication issues leading to rows missing during eqiad -> codfw switch (Was: "A few lexemes disappeared").
Mon, Oct 15, 8:35 AM · Wikimedia-Incident, User-notice, Patch-For-Review, Wikidata-Campsite (Wikidata-Campsite-Iteration-∞), Wikidata-Campsite, DBA, User-Addshore, Datacenter-Switchover-2018, Lexicographical data, Wikidata
Marostegui added a comment to T206740: parsercache used disk space increase.

Sounds good!
Also as T206740#4659202, let's create a separate task for the replication check addition, so we can just focus on the immediate triage on this task I would say

Mon, Oct 15, 8:02 AM · MediaWiki-Cache, Performance-Team (Radar), User-Banyek, Performance-Team-notice, Datacenter-Switchover-2018, Operations, DBA
Marostegui created P7678 (An Untitled Masterwork).
Mon, Oct 15, 7:20 AM
Marostegui added a comment to T205865: Investigate decrease in wikidata dispatch times due to eqiad -> codfw DC switch.

Another test with db1104 before and after compressing:

root@db1104.eqiad.wmnet[wikidatawiki]> show create table wb_changes_dispatch\G
*************************** 1. row ***************************
       Table: wb_changes_dispatch
Create Table: CREATE TABLE `wb_changes_dispatch` (
  `chd_site` varbinary(32) NOT NULL,
  `chd_db` varbinary(32) NOT NULL,
  `chd_seen` int(11) NOT NULL DEFAULT '0',
  `chd_touched` varbinary(14) NOT NULL DEFAULT '00000000000000',
  `chd_lock` varbinary(64) DEFAULT NULL,
  `chd_disabled` tinyint(3) unsigned NOT NULL DEFAULT '0',
  PRIMARY KEY (`chd_site`),
  KEY `wb_changes_dispatch_chd_seen` (`chd_seen`),
  KEY `wb_changes_dispatch_chd_touched` (`chd_touched`)
) ENGINE=InnoDB DEFAULT CHARSET=binary
1 row in set (0.00 sec)
Mon, Oct 15, 7:15 AM · User-Addshore, Wikidata-Campsite, Patch-For-Review, DBA, Operations, MediaWiki-General-or-Unknown, Performance, wikidata-tech-focus, Wikidata, Datacenter-Switchover-2018
Marostegui added a comment to T205865: Investigate decrease in wikidata dispatch times due to eqiad -> codfw DC switch.

@hoo after db1109 has been recloned (and now has compressed tables):

root@db1109.eqiad.wmnet[wikidatawiki]> SELECT COUNT(*) FROM wb_changes_dispatch;
+----------+
| COUNT(*) |
+----------+
|      577 |
+----------+
1 row in set (0.00 sec)
Mon, Oct 15, 6:35 AM · User-Addshore, Wikidata-Campsite, Patch-For-Review, DBA, Operations, MediaWiki-General-or-Unknown, Performance, wikidata-tech-focus, Wikidata, Datacenter-Switchover-2018
Marostegui added a comment to T206740: parsercache used disk space increase.

@Banyek please double check the key purge has finished on mwaint1002 and keep on with the rest of pending things to do here.
Probably after re-enabling the normal config options you might want to check if it is worth doing the table rebuild by doing it on codfw (make sure not to replicate the changes) and checking before/after space, so we can evaluate if it is worth the time or not (as we are on a good shape now)

Mon, Oct 15, 5:38 AM · MediaWiki-Cache, Performance-Team (Radar), User-Banyek, Performance-Team-notice, Datacenter-Switchover-2018, Operations, DBA
Marostegui moved T206592: Significant (17x) increase in time spent by updateSpecialPages.php script since datacenter switch over updating commons special pages from Triage to Blocked external/Not db team on the DBA board.

Im away this next 1.5 weeks for a conference. I will follow up when i get back

Mon, Oct 15, 5:36 AM · DBA, Datacenter-Switchover-2018, MediaWiki-Special-pages
Marostegui moved T205865: Investigate decrease in wikidata dispatch times due to eqiad -> codfw DC switch from Triage to In progress on the DBA board.
Mon, Oct 15, 5:32 AM · User-Addshore, Wikidata-Campsite, Patch-For-Review, DBA, Operations, MediaWiki-General-or-Unknown, Performance, wikidata-tech-focus, Wikidata, Datacenter-Switchover-2018
Marostegui added a comment to T205865: Investigate decrease in wikidata dispatch times due to eqiad -> codfw DC switch.
wikiadmin@db1109(wikidatawiki)> SELECT * FROM information_schema.tables WHERE table_name = 'wb_changes_dispatch'\G
 *************************** 1. row ***************************
   TABLE_CATALOG: def
    TABLE_SCHEMA: wikidatawiki
      TABLE_NAME: wb_changes_dispatch
      TABLE_TYPE: BASE TABLE
          ENGINE: InnoDB
         VERSION: 10
      ROW_FORMAT: Compact
      TABLE_ROWS: 577
  AVG_ROW_LENGTH: 141
     DATA_LENGTH: 81920
 MAX_DATA_LENGTH: 0
    INDEX_LENGTH: 4907859968
       DATA_FREE: 7340032
  AUTO_INCREMENT: NULL
     CREATE_TIME: 2018-04-26 09:16:42
     UPDATE_TIME: NULL
      CHECK_TIME: NULL
 TABLE_COLLATION: binary
        CHECKSUM: NULL
  CREATE_OPTIONS: 
   TABLE_COMMENT: 
 1 row in set (0.00 sec)
 
 wikiadmin@db2083(wikidatawiki)> SELECT * FROM information_schema.tables WHERE table_name = 'wb_changes_dispatch'\G
 *************************** 1. row ***************************
   TABLE_CATALOG: def
    TABLE_SCHEMA: wikidatawiki
      TABLE_NAME: wb_changes_dispatch
      TABLE_TYPE: BASE TABLE
          ENGINE: InnoDB
         VERSION: 10
      ROW_FORMAT: Compressed
      TABLE_ROWS: 577
  AVG_ROW_LENGTH: 70
     DATA_LENGTH: 40960
 MAX_DATA_LENGTH: 0
    INDEX_LENGTH: 540672
       DATA_FREE: 9961472
  AUTO_INCREMENT: NULL
     CREATE_TIME: 2018-05-31 10:31:11
     UPDATE_TIME: NULL
      CHECK_TIME: NULL
 TABLE_COLLATION: binary
        CHECKSUM: NULL
  CREATE_OPTIONS: row_format=COMPRESSED key_block_size=8
   TABLE_COMMENT: 
 1 row in set (0.03 sec)

The index length differences look interesting…

Mon, Oct 15, 5:32 AM · User-Addshore, Wikidata-Campsite, Patch-For-Review, DBA, Operations, MediaWiki-General-or-Unknown, Performance, wikidata-tech-focus, Wikidata, Datacenter-Switchover-2018
Marostegui closed T205257: BBU problems dbstore2002 as Resolved.

This is no longer about dbstore2002 but about db2042, so let's follow on that task: T202051
dbstore2002 is good for now, so let's close this and re-open if necessary:

Mon, Oct 15, 5:28 AM · User-Banyek, DBA
Marostegui added a comment to T204006: Execute the schema change for Partial Blocks.

@Marostegui - I've heard from multiple people about unexpected fires delaying Ops/DBA work, but no additional information. Is there a phab ticket we can follow for these fires?

Mon, Oct 15, 5:27 AM · Patch-For-Review, DBA, Blocked-on-schema-change, Anti-Harassment
Marostegui created P7677 (An Untitled Masterwork).
Mon, Oct 15, 5:13 AM
Marostegui added a project to T206965: Degraded RAID on dbstore1002: Analytics.
Mon, Oct 15, 5:08 AM · Analytics, ops-eqiad, Operations

Sun, Oct 14

Marostegui added a comment to T202764: Wikidata produces a lot of failed requests for recentchanges API.

eqiad db API hosts return the query in less than a second as they have the correct schema:

root@db1104.eqiad.wmnet[wikidatawiki]> select @@hostname;
+------------+
| @@hostname |
+------------+
| db1104     |
+------------+
1 row in set (0.00 sec)
Sun, Oct 14, 6:23 AM · User-Addshore, Operations, Wikidata-Query-Service, Wikidata

Sat, Oct 13

Marostegui moved T206743: S8 replication issues leading to rows missing during eqiad -> codfw switch (Was: "A few lexemes disappeared") from Triage to In progress on the DBA board.
Sat, Oct 13, 7:46 AM · Wikimedia-Incident, User-notice, Patch-For-Review, Wikidata-Campsite (Wikidata-Campsite-Iteration-∞), Wikidata-Campsite, DBA, User-Addshore, Datacenter-Switchover-2018, Lexicographical data, Wikidata
Marostegui moved T202051: db2042 (m3) master RAID battery failed from Blocked external/Not db team to In progress on the DBA board.
Sat, Oct 13, 7:46 AM · User-Banyek, Operations, ops-codfw, DBA
Marostegui moved T201133: db1069 (x1 master) memory errors from Next to In progress on the DBA board.
Sat, Oct 13, 7:45 AM · ops-eqiad, Operations, DBA

Fri, Oct 12

Marostegui added a comment to T206743: S8 replication issues leading to rows missing during eqiad -> codfw switch (Was: "A few lexemes disappeared").
root@neodymium:/home/marostegui# ./section s8 | while read host port; do echo $host; mysql.py -h$host:$port wikidatawiki -BN -e "select page_id, page_title, page_latest from page where page_id = 99480;";done
labsdb1011.eqiad.wmnet
99480	Q97215	745463455
labsdb1010.eqiad.wmnet
99480	Q97215	745463455
labsdb1009.eqiad.wmnet
99480	Q97215	745463455
dbstore2001.codfw.wmnet
99480	Q97215	745463455
dbstore1002.eqiad.wmnet
99480	Q97215	745463455
db2094.codfw.wmnet
99480	Q97215	745463455
db2086.codfw.wmnet
99480	Q97215	745463455
db2085.codfw.wmnet
99480	Q97215	745463455
db2083.codfw.wmnet
99480	Q97215	745463455
db2082.codfw.wmnet
99480	Q97215	745463455
db2081.codfw.wmnet
99480	Q97215	745463455
db2080.codfw.wmnet
99480	Q97215	745463455
db2079.codfw.wmnet
99480	Q97215	745463455
db2045.codfw.wmnet
99480	Q97215	745463455
db1124.eqiad.wmnet
99480	Q97215	745463455
db1116.eqiad.wmnet
99480	Q97215	745463455
db1109.eqiad.wmnet
99480	Q97215	745463455
db1104.eqiad.wmnet
99480	Q97215	745463455
db1101.eqiad.wmnet
99480	Q97215	745463455
db1099.eqiad.wmnet
99480	Q97215	745463455
db1092.eqiad.wmnet
99480	Q97215	745463455
db1087.eqiad.wmnet
99480	Q97215	745463455
db1071.eqiad.wmnet
99480	Q97215	745463455
Fri, Oct 12, 6:43 PM · Wikimedia-Incident, User-notice, Patch-For-Review, Wikidata-Campsite (Wikidata-Campsite-Iteration-∞), Wikidata-Campsite, DBA, User-Addshore, Datacenter-Switchover-2018, Lexicographical data, Wikidata
Marostegui added a comment to T206743: S8 replication issues leading to rows missing during eqiad -> codfw switch (Was: "A few lexemes disappeared").

I'll run a maint script to fix the page_latest of the 9000ish pages that have an incorrect one so that we don't end up with more inconsistencies / broken stuff moving forward.

Fri, Oct 12, 6:28 PM · Wikimedia-Incident, User-notice, Patch-For-Review, Wikidata-Campsite (Wikidata-Campsite-Iteration-∞), Wikidata-Campsite, DBA, User-Addshore, Datacenter-Switchover-2018, Lexicographical data, Wikidata
Marostegui added a comment to T206743: S8 replication issues leading to rows missing during eqiad -> codfw switch (Was: "A few lexemes disappeared").

Which means there are ~9433 pages that now probably have the wrong page_latest, for example:

Fri, Oct 12, 6:23 PM · Wikimedia-Incident, User-notice, Patch-For-Review, Wikidata-Campsite (Wikidata-Campsite-Iteration-∞), Wikidata-Campsite, DBA, User-Addshore, Datacenter-Switchover-2018, Lexicographical data, Wikidata
Marostegui added a comment to T206743: S8 replication issues leading to rows missing during eqiad -> codfw switch (Was: "A few lexemes disappeared").

We are also fully restoring the eqiad hosts from codfw which has all the fine data.
The fix done on Friday was a quick fix to get all the data in there.

Fri, Oct 12, 6:05 PM · Wikimedia-Incident, User-notice, Patch-For-Review, Wikidata-Campsite (Wikidata-Campsite-Iteration-∞), Wikidata-Campsite, DBA, User-Addshore, Datacenter-Switchover-2018, Lexicographical data, Wikidata

Thu, Oct 11

Marostegui updated subscribers of T201343: rack/setup/install mwmaint1002.eqiad.wmnet.

Yes.

mariadb: remove mwmaint1001 from prod-m5 SQL grants - https://gerrit.wikimedia.org/r/#/c/operations/puppet/+/465645/
But i can pretty much self-merge those except the mariadb one that needs deployment. Some need rebasing.

Thu, Oct 11, 6:21 PM · Patch-For-Review, Datacenter-Switchover-2018, ops-eqiad, Operations
Marostegui added a comment to T206740: parsercache used disk space increase.

Forgetting the codfw -> eqiad replication was the most likely cause of overload on the application servers (and on External storage hosts).

I think we should we track this (and the respective actionables) in a different task

Thu, Oct 11, 6:16 PM · MediaWiki-Cache, Performance-Team (Radar), User-Banyek, Performance-Team-notice, Datacenter-Switchover-2018, Operations, DBA
Marostegui created P7666 (An Untitled Masterwork).
Thu, Oct 11, 3:15 PM
Marostegui added a comment to T204006: Execute the schema change for Partial Blocks.

We have had unexpected fires that might last long, so this might be delayed more than next week. We will do whatever we can, but I cannot promise this will get done next week or the week after.
Sorry.

Thu, Oct 11, 2:53 PM · Patch-For-Review, DBA, Blocked-on-schema-change, Anti-Harassment
Marostegui lowered the priority of T206740: parsercache used disk space increase from Unbreak Now! to High.

This is no longer "unbreak now"

pc1004
Filesystem                 Type  Size  Used Avail Use% Mounted on
/dev/mapper/pc1004--vg-srv xfs   2.2T  1.6T  644G  71% /srv
Thu, Oct 11, 2:23 PM · MediaWiki-Cache, Performance-Team (Radar), User-Banyek, Performance-Team-notice, Datacenter-Switchover-2018, Operations, DBA