MCR schema migration stage 2: populate new fields
Closed, ResolvedPublic
Actions

Assigned To

Authored By

	daniel
	Dec 21 2017, 6:39 PM

Description

Set MediaWiki to write content meta-data to the old AND the new columns (via config[**]). Don't forget to also do this for new entries in the archive table.
Wait a bit and watch for performance issues caused by writing to the new table.
Run maintenance/populateContentTable.php to populate the content table. The script needs to support chunking (and maybe also sharding, for parallel operation).
Keep watching for performance issues while the new table grows.

Operation of populateContentTable.php:

Select n rows from the revision table that do not have a corresponding entry in the content table (a WHERE NOT EXISTS subquery is probably better than a LEFT JOIN for this, because of LIMIT).

For each such row, construct a corresponding row for the content and slots table[*][**]. The rows can either be collected in an array for later mass-insert, or inserted individually, possibly buffered in a transaction.
The content_models, content_formats, and content_roles tables will be populated as a side-effect, by virtue of calling the assignId() function in order to get a numeric ID for content models, formats, and roles.
When all rows in one chunk have been processed, insert/commit the new rows in the content table and wait for slaves to catch up.
Repeat until there are no more rows in revision that have no corresponding row in content. This will eventually be the case, since web requests are already populating the content table when creating new rows in revision.

https://www.mediawiki.org/wiki/Multi-Content_Revisions/Content_Meta-Data#Phase_II:_Population
https://www.mediawiki.org/wiki/Multi-Content_Revisions/Schema_Migration

Details

	Subject	Repo	Branch	Lines +/-
	Fix undeletion write-both/read-old mode.	mediawiki/core	master	+171 -38
	[MCR] RevisionStore, enable insertions for new schema	mediawiki/core	master	+747 -251

Customize query in gerrit

Related Objects
Search...

View Standalone Graph

This task is connected to more than 200 other tasks. Only direct parents and subtasks are shown here. Use View Standalone Graph to show more of the graph.

Status	Assigned	Task
		· · ·
Resolved	None	T183487 MCR schema migration stage 3: stop using legacy fields
Resolved	daniel	T174044 Deploy MCR storage layer
Resolved	Anomie	T198308 Enable MCR migration stage "write both, read new" on live systems
Resolved	Anomie	T198309 Enable MCR migration stage "write both, read new" on testwiki
Resolved	tstarling	T183488 MCR schema migration stage 2: populate new fields
Resolved	Anomie	T183489 MCR schema migration stage 1: Fix Legacy Archive Rows
Resolved	Anomie	T174030 Implement DB schema migration tools for MCR
Resolved	• Marostegui	T190153 DROP unused 'slots' table (WAS: In the slots table, replace slot_inherited with slot_origin)
Resolved	Anomie	T196172 Test schema migration script
Resolved	tstarling	T197816 Enable MCR migration stage "write both, read old" on live systems
Resolved	Anomie	T202032 Duplicate ar_rev_id values in several wikis
		· · ·

Event Timeline

There are a very large number of changes, so older changes are hidden. Show Older Changes

In T183488#4027858, @Abit wrote:

@Addshore, the MCR pert says you and Katie are working on this, and I see you wrote a related gerrit patch. Should this be assigned to you?

So I am writing some of the underlying code, but that is probably best tracked in T174024.
I won't be doing the actual migration, but I believe is what this ticket is target towards.

In T183488#4027864, @Addshore wrote:

In T183488#4027858, @Abit wrote:

@Addshore, the MCR pert says you and Katie are working on this, and I see you wrote a related gerrit patch. Should this be assigned to you?

So I am writing some of the underlying code, but that is probably best tracked in T174024.
I won't be doing the actual migration, but I believe is what this ticket is target towards.

Good to know, thank you!

daniel added a subtask: T190153: DROP unused 'slots' table (WAS: In the slots table, replace slot_inherited with slot_origin).Mar 20 2018, 2:18 PM

• Marostegui closed subtask T190153: DROP unused 'slots' table (WAS: In the slots table, replace slot_inherited with slot_origin) as Resolved.Mar 21 2018, 6:14 AM

Anomie closed subtask T183489: MCR schema migration stage 1: Fix Legacy Archive Rows as Resolved.Apr 6 2018, 7:21 PM

Tgr mentioned this in T183487: MCR schema migration stage 3: stop using legacy fields.Apr 7 2018, 10:16 PM

Lydia_Pintscher moved this task from incoming to monitoring on the Wikidata board.Apr 23 2018, 7:28 AM

Anomie reopened subtask T183489: MCR schema migration stage 1: Fix Legacy Archive Rows as Open.Apr 26 2018, 3:56 PM

Anomie mentioned this in T174047: Hide deprecated/unused fields on toolforge replica [MCR].May 4 2018, 2:58 PM

CCicalese_WMF moved this task from Inbox to MCR-SDC Storage Layer - phase 1 on the Multi-Content-Revisions board.May 8 2018, 1:17 PM

CCicalese_WMF edited projects, added Multi-Content-Revisions (MCR-SDC Storage Layer - phase 1); removed Multi-Content-Revisions.

CCicalese_WMF assigned this task to Anomie.May 15 2018, 3:20 PM

CCicalese_WMF moved this task from Backlog to Blocked on the Multi-Content-Revisions (MCR-SDC Storage Layer - phase 1) board.

daniel added a parent task: T174044: Deploy MCR storage layer.May 15 2018, 7:09 PM

Anomie closed subtask T183489: MCR schema migration stage 1: Fix Legacy Archive Rows as Resolved.May 24 2018, 8:31 PM

Change 406595 merged by jenkins-bot:
[mediawiki/core@master] [MCR] RevisionStore, enable insertions for new schema

https://gerrit.wikimedia.org/r/406595

Anomie closed subtask T196172: Test schema migration script as Resolved.Jun 18 2018, 7:47 PM

daniel closed subtask T174030: Implement DB schema migration tools for MCR as Resolved.Jun 20 2018, 5:30 PM

daniel added a parent task: T198308: Enable MCR migration stage "write both, read new" on live systems.Jun 27 2018, 3:42 PM

• Vvjjkkii reopened subtask T196172: Test schema migration script as Open.Jul 1 2018, 1:06 AM

• Marostegui closed subtask T196172: Test schema migration script as Resolved.Jul 2 2018, 5:02 AM

CCicalese_WMF moved this task from MCR-SDC Storage Layer - phase 1 to Deployment on the Multi-Content-Revisions board.Jul 3 2018, 6:18 PM

CCicalese_WMF edited projects, added Multi-Content-Revisions (Deployment); removed Multi-Content-Revisions (MCR-SDC Storage Layer - phase 1).

CCicalese_WMF moved this task from Backlog to Blocked on the Multi-Content-Revisions (Deployment) board.Jul 3 2018, 6:40 PM

• Fjalapeno moved this task from Blocked to Done on the Multi-Content-Revisions (Deployment) board.Jul 23 2018, 4:44 PM

• Fjalapeno moved this task from Done to In Progress on the Multi-Content-Revisions (Deployment) board.

• Fjalapeno moved this task from In Progress to Ready on the Multi-Content-Revisions (Deployment) board.

Anomie added a parent task: T198309: Enable MCR migration stage "write both, read new" on testwiki.Jul 24 2018, 4:18 PM

Anomie mentioned this in T198309: Enable MCR migration stage "write both, read new" on testwiki.

Mentioned in SAL (#wikimedia-operations) [2018-07-25T15:31:45Z] <anomie> running populateContentTables.php on testwiki for T183488

FYI, mwscript populateContentTables.php --wiki=testwiki --batch-size 2000 processed 355991 rows in 170.07610702515 seconds. That was significantly faster than was expected based on the test DB results from T196172, but on the other hand it's a very small sample size and testwiki's database may be unique in other ways.

It was also too small of a sample to really tell if this is going to cause codfw slave lag when run on larger wikis (maintenance scripts wait for slaves in the same DC, but cross-DC can't be checked so easily).

CCicalese_WMF moved this task from Ready to In Progress on the Multi-Content-Revisions (Deployment) board.Jul 27 2018, 11:04 PM

Daniel proposed the following schedule:

Today: test2wiki, testwikidatawiki
This week (WN31): mediawikiwiki, metawiki
Next week (WN32): wikidatawiki, commonswiki
Week after (WN33): everything

I'm going to start on this now.

Mentioned in SAL (#wikimedia-operations) [2018-08-02T05:30:43Z] <TimStarling> on mwmaint1001 running populateContentTables.php as described in T183488

test2wiki and testwikidatawiki are complete.

In T183488#4471137, @tstarling wrote:

Daniel proposed the following schedule:

Today: test2wiki, testwikidatawiki

This week (WN31): mediawikiwiki, metawiki

Next week (WN32): wikidatawiki, commonswiki

Week after (WN33): everything

Given it's Friday I don't think the rest of this week (WN31) will happen :) Should this schedule be updated?

Please do add the correct wikis runs to the "week of" section of the deployment calendar in the corresponding week: https://wikitech.wikimedia.org/wiki/Deployments

tstarling closed subtask T197816: Enable MCR migration stage "write both, read old" on live systems as Resolved.Aug 6 2018, 4:06 AM

Mentioned in SAL (#wikimedia-operations) [2018-08-06T04:09:55Z] <TimStarling> on mwmaint1001 running populateContentTables.php on mediawikiwiki T183488

Mentioned in SAL (#wikimedia-operations) [2018-08-06T04:29:45Z] <TimStarling> on mwmaint1001 running populateContentTables.php on metawiki T183488

@greg The WN31 things are done now, only 1081 seconds for mediawikiwiki and 9252 seconds for metawiki. For metawiki the rate was about the same as anomie got for testwiki, 2000 rows per second for the revision table and 600 rows per second for the archive table. At that rate, we can expect wikidatawiki to take about 91 hours and commonswiki to take about 48 hours. We can run them concurrently since they are on different DB clusters, and that way maybe get them done by the end of the week.

ack, thanks!

CCicalese_WMF reassigned this task from Anomie to tstarling.Aug 7 2018, 12:03 AM

CCicalese_WMF added a project: Core-Platform-Team-Old (CPT-Q1-Jul-Sep-2018).

CCicalese_WMF moved this task from Backlog to In Progress on the Core-Platform-Team-Old (CPT-Q1-Jul-Sep-2018) board.

Mentioned in SAL (#wikimedia-operations) [2018-08-07T03:46:00Z] <TimStarling> on mwmaint1001 running populateContentTables.php concurrently on wikidatawiki and commonswiki (T183488)

Mentioned in SAL (#wikimedia-operations) [2018-08-13T01:45:10Z] <TimStarling> on mwmaint1001 running populateContentTables.php on all wikis using ~tstarling/pct-list T183488

Let's keep an eye on performance metrics, specially https://grafana.wikimedia.org/dashboard/db/save-timing Although with the estimations given T183488#4480579, it may be desirable to make things faster rather and short than extended for a long period of time.

Meanwhile, I will keep an eye on the disk usage and size.

@tstarling Please stop writes going to *s2* unless they have already finished, there are plans to put s2 in read only without much pre-warning (unscheduled read only) T201694- I don't know exactly what is going to be the plan (a logical failover or a network maintenance), but probably we should preventively pause all long-running write traffic to s2 until I know the details and it is safe to continue writing in batches.

In T183488#4499288, @jcrespo wrote:

@tstarling Please stop writes going to *s2* unless they have already finished

Done. s2 was up to itwiki rev_id 3012040.

enwiki should take about another 6 days at the current rate.

Log summary:

aawikibooks failed with "Error: 1062 Duplicate entry '3003-1' for key 'PRIMARY' (10.64.0.205)"
cawiki failed with "Replication wait failed: Server shutdown in progress" and will need to be restarted.
gotwikibooks also failed with a duplicate key error
s5 (dewiki) is complete
s2 was killed

So 4 concurrent processes are currently running: s1, s3, s6 and s7.

aawikibooks failed with "Error: 1062 Duplicate entry '3003-1' for key 'PRIMARY' (10.64.0.205)"

It's annoying that this error message doesn't say which *table* this was on. My guess is that it'S the slots table, which has primary key ( slot_revision_id, slot_role_id ). slot_role_id is always 1 for now (for the main slot). So how do we end up trying to insert a row for revision 3003 twice?

My first guess was that we have the same revision ID in the revision and the archive table, from an incomplete deletion. This is however not the case. Instead, there is two rows in the archive table with the same ar_rev_id, for what seems to be completely different revisions:

MariaDB [aawikibooks_p]> select ar_id, ar_rev_id, ar_page_id, ar_namespace, ar_title, ar_timestamp, ar_text_id from archive where ar_rev_id = 3003\G
*************************** 1. row ***************************
       ar_id: 29
   ar_rev_id: 3003
  ar_page_id: NULL
ar_namespace: 8
    ar_title: Accesskey-watch
ar_timestamp: 20060701183713
  ar_text_id: 2956
*************************** 2. row ***************************
       ar_id: 987
   ar_rev_id: 3003
  ar_page_id: NULL
ar_namespace: 8
    ar_title: Group-sysop
ar_timestamp: 20060701183714
  ar_text_id: 3003
2 rows in set (0.00 sec)

That's a bit worrying. How did that happen?

Anyway, one of them needs to get a fresh revision id, probably best by setting ar_rev_id to NULL, and then running populateArchiveRevId.php to make sure the new revision ID is assigned safely.

You can see the full logs at mwmaint1001:/var/log/mediawiki/populateContentTables/ . On both aawikibooks and gotwikibooks, the error occurred on the second batch of the archive table, starting at ar_rev_id 2001. In both cases it was also the last batch, with the maximum ar_rev_id being 3275 and 3175 respectively.

Mentioned in SAL (#wikimedia-operations) [2018-08-14T11:40:41Z] <TimStarling> restarted populateContentTables.php on s2 (T183488)

So how do we end up trying to insert a row for revision 3003 twice?

max(rev_id) on this wiki is 3002, so the ar_rev_id certainly came from populateArchiveRevId.php, which allocated ar_rev_id values by inserting dummy rows into revision and deleting them in the same transaction. It may be that MySQL has some method to reallocate autoincremented IDs in this case which we didn't know about.

daniel mentioned this in T193180: Clean up archive rows with duplicate revision IDs.Aug 16 2018, 12:22 PM

I mentioned this to Tim, and he said to better mention it to Brad (and Daniel?), that it would be nice to run a read-only process to check the consistency of the migration- it should take much less time than the writes and would avoid headaches given the issues with the existing data and other possible complications that could have happened during the migration. I don't think anything is problem-free, but we can try to detect the issues as much as we can. What do you think?

it would be nice to run a read-only process to check the consistency of the migration

We did pretty extensive testing of this on the test copies (db1111 and db1112). We could have a script that does basic sanity checks, but I'm not sure what to test, beyond the trivial. All I can think of is:

all revision rows have an associated slot row.
all archive rows have an associated slot row.
all slot rows have an associated content.
ar_rev_id as no conflicts with rev_id
ar_rev_id as no conflicts with itself

Is that what you had in mind? (1) and (2) are actually what happens when you re-run the populateContentTables.php script. It's very quick if there's nothing to do. (4) and (5) is covered by deduplicateArchiveRevId.php, which should also be quick enough.

So, I'd say we should just re-run these scripts, at least for a handful of wikis. Do you think that would be sufficient?

In T183488#4506824, @daniel wrote:

We did pretty extensive testing of this on the test copies (db1111 and db1112). We could have a script that does basic sanity checks, but I'm not sure what to test, beyond the trivial. All I can think of is:

all revision rows have an associated slot row.

all archive rows have an associated slot row.

all slot rows have an associated content.

ar_rev_id as no conflicts with rev_id

ar_rev_id as no conflicts with itself

Is that what you had in mind?

Yes, also the inverse checks, assuming it is 1:1 at the moment.

@jcrespo Ooops, I updated my comment after you already answered, sorry.

So the script will make sure to not miss any content, but doesn't assure it introduces extra one, or duplicates some, that is why I mention the usage of extra checks. Takes very little time (1 long running query per check) and validates the whole process. The same thing that "all code should have tests" "all tasks should have a way to validate they completed correctly". I think this is very process to miss that, don't you think?

Duplicates are prevented by the unique index on slot_revision_id/slot_role_id. Orphan slot rows are possible, just as orphan revision rows have been possible before.

We can run a query to detect those on some wikis as a spot check. If we find nothing, I don't think full validation is needed.

You are right that all code should have tests - so ideally, the migration script should have a unit test that validates the database content. We opted for manual testing with "real" data instead, since this is a one-off conversion, and "fake" data is unlikely to expose problems anyway.

In T183488#4504507, @tstarling wrote:

So how do we end up trying to insert a row for revision 3003 twice?

max(rev_id) on this wiki is 3002, so the ar_rev_id certainly came from populateArchiveRevId.php, which allocated ar_rev_id values by inserting dummy rows into revision and deleting them in the same transaction.

Or deduplicateArchiveRevId.php, which does the same thing.

It may be that MySQL has some method to reallocate autoincremented IDs in this case which we didn't know about.

wikiadmin@10.64.0.205(aawikibooks)> select max(ar_rev_id) from archive;
+----------------+
| max(ar_rev_id) |
+----------------+
|           3275 |
+----------------+

wikiadmin@10.64.0.205(aawikibooks)> SELECT auto_increment FROM information_schema.tables WHERE table_schema = 'aawikibooks' and table_name = 'revision';
+----------------+
| auto_increment |
+----------------+
|           3152 |
+----------------+

That's certainly not right.

https://dev.mysql.com/doc/refman/8.0/en/innodb-auto-increment-handling.html#innodb-auto-increment-initialization may be relevant here: MySQL until 8.0 (and MariaDB until some version after 10.1.29, apparently, based on local testing) doesn't actually save the auto-increment value to disk. Instead on server restart it recalculates the value via the equivalent of SELECT MAX(...). If MAX(rev_id) < MAX(ar_rev_id) on a restart, we'll likely wind up with reused revision IDs for new revisions.

Once MCR is rolled out and the slots table is populated, and as long as every revision has a main slot, future instances of this problem on restart would cause the next MAX(ar_rev_id) - MAX(rev_id) revision inserts to fail thanks to the slots table PK. Each failure would increment the rev_id auto-increment counter, so the problem would eventually go away until the next master restart that happens to have problematic state. We might be able to have the same sort of thing on a master failover, if the new master's rev_id auto-increment counter is somehow out of sync in a similar manner.

Current status: everything is done except enwiki and the T202032 wikis. enwiki has about another 49 hours to run.

enwiki is complete now, so only the T202032 wikis remain.

daniel reopened subtask T197816: Enable MCR migration stage "write both, read old" on live systems as Open.Aug 22 2018, 12:09 PM

daniel closed subtask T197816: Enable MCR migration stage "write both, read old" on live systems as Resolved.Aug 22 2018, 1:55 PM

Anomie mentioned this in T202032: Duplicate ar_rev_id values in several wikis.Aug 22 2018, 6:53 PM

Mentioned in SAL (#wikimedia-operations) [2018-08-22T19:01:36Z] <anomie> re-running populateContentTables.php on cawiki for T183488

Anomie closed subtask T202032: Duplicate ar_rev_id values in several wikis as Resolved.Aug 27 2018, 3:45 PM

Mentioned in SAL (#wikimedia-operations) [2018-08-27T15:49:02Z] <anomie> Re-running populateContentTables.php on aawikibooks, gotwikibooks, kswikiquote, lvwikibooks, nostalgiawiki, wawikibooks and wikimania2005wiki for T183488

@Anomie What do you think of re-running the script on a few more wikis, as a spot check to verify that it finds nothing to do? If otoh the script finds stuff to migrate, something is wrong...

I'm running queries on all wikis to see if there are any revision or archive rows lacking a slots row. That found T202904 so far.

In T183488#4535083, @Anomie wrote:

I'm running queries on all wikis to see if there are any revision or archive rows lacking a slots row. That found T202904 so far.

Wow, nice find!

Mentioned in SAL (#wikimedia-operations) [2018-08-27T16:31:53Z] <anomie> Running populateContentTables.php on advisorswiki for T183488 and T202904

Stashbot mentioned this in T202904: advisorswiki is not in any s?.dblist.Aug 27 2018, 4:31 PM

Also it looks like undeletions can somehow manage to escape the script, if the undeletion happens between when the rev_id is reached in revision and when the ar_rev_id is reached in archive. Found a few on commonswiki so far, and I won't be too surprised if there are more on enwiki or other big wikis.

It looks like the "fake" SlotRecord created by RevisionStore::newRevisionFromArchiveRow() passes the $slot->hasRevision() check in RevisionStore::insertRevisionInternal() so it doesn't actually insert it.

It looks like the "fake" SlotRecord created by RevisionStore::newRevisionFromArchiveRow() passes the $slot->hasRevision() check in RevisionStore::insertRevisionInternal() so it doesn't actually insert it.

That's as designed (well, partially): when undeleting, it should not be necessary to insert slot rows, since they should already be there. That's of course not the case when undeleting from an unmigrated archive row. Nasty edge case, lucky we still write the "old" info the revision table, otherwise we would have lost these revisions!

This will never happen when the database in a consistently migrated or unmigrated state, but we probably want to properly deal with this in RevisionStore. How urgent do you think this is?

We should get it fixed before 1.32 is released (or did MCR undeletion make it into 1.31? I forget). For Wikimedia sites I can just make sure everything got migrated as I'm already doing.

Change 455630 had a related patch set uploaded (by Daniel Kinzler; owner: Daniel Kinzler):
[mediawiki/core@master] Fix undeletion write-both/read-old mode.

https://gerrit.wikimedia.org/r/455630

commonswiki, enwiki, wikidatawiki, and zhwiki had such undeletions. I hacked a copy of the maintenance script to process only the revision IDs printed by the check query and updated them all, so we should be good now.

I'll re-run the check queries to be sure, ask me again in 5 or 6 hours.