Page MenuHomePhabricator
Paste P6486

Chat Log
ActivePublic

Authored by daniel on Dec 19 2017, 4:21 PM.
Tags
None
Referenced Files
F11935268: Chat Log
Dec 19 2017, 4:21 PM
Subscribers
None
[16:35] <RoanKattouw> addshore: DanielK_WMDE_: (Moving here because both -tech and -dev are noisy) During my walk I figured out WHY the "revision does not exist" message happened. It was related to ChronologyProtector, but not in the way we thought: it happened BECAUSE it was doing its job
[16:36] <addshore> :D
[16:36] <RoanKattouw> For newly created pages/revisions, the text table rows were written to the replica (and so weren't on the master), but the revision table rows were written to the master (and replicated to the replica)
[16:36] <RoanKattouw> When you first save the page, the replica hasn't caught up yet, so CP ensures that your next page view reads from the master (which never happens otherwise), because it's the only up-to-date server
[16:37] <RoanKattouw> The master doesn't have the text row, so trying to get the text fails
[16:37] <RoanKattouw> Then when you refresh, the replica has caught up, so your page view reads from the replica, and it has both rows so it works fine
[16:37] <addshore> aaah, and then the second refresh reads from the replica
[16:37] <addshore> Sounds like a thoughtful walk :)
[16:37] <RoanKattouw> The real problems began when an AbuseFilter rule was hit, because AF was still writing to the master
[16:38] <RoanKattouw> So the master assigns that text row an old_id which it thinks is the next available old_id, but the replica has already used that ID for something else
[16:38] <addshore> and then that is the point the replication exploded
[16:38] <RoanKattouw> Then when the replica tries to replicate that insertion, it fails because of an ID collision, and replication stops
[16:39] <RoanKattouw> Leaving both the replica and the master in a broken state: the master has revision rows pointing to old_ids that don't exist, or if they do, point to AbuseFilter data
[16:39] <addshore> RoanKattouw: so wikidatawiki on beta is also broken
[16:40] <RoanKattouw> And the replica has one AbuseFilter log entry that points to an old_id that points to revision text (and no more, becaues replication stops at this point)
[16:40] <RoanKattouw> Yeah I can imagine
[16:40] <RoanKattouw> I'm just about to check the others
[16:40] <addshore> https://phabricator.wikimedia.org/T183232
[16:40] <addshore> well that ticket actually talks about enwiki
[16:40] <RoanKattouw> I fixed enwiki and deploymentwiki by transferring the text rows from the replica to the master and updating the references to them in the revision table for their new IDs
[16:40] <addshore> you can probably write a query to find all wikis that have edits between now and the time the patch was first merged / landed on beta
[16:41] <RoanKattouw> I could do that but it's easier to just compare the text tables on the master and replica
[16:41] <RoanKattouw> If there are rows they disagree on, that means I need to fix things
[16:42] <RoanKattouw> Going to start doing that with wikidatawiki now
[16:42] <addshore> well, sorry for this fallout, and thanks for helping!
[16:43] <RoanKattouw> No worries!
[16:43] <RoanKattouw> 4 layers failed here
[16:43] <RoanKattouw> (Author, reviewer, MW DB abstraction, read only flag on the DB server)
[16:43] <RoanKattouw> So I can hardly blame any individual one