Page MenuHomePhabricator

Deploy refactored actor storage
Open, Needs TriagePublic

Description

The high-level checklist:

  • 1. Merge the first patch for T167246: Refactor "user" & "user_text" fields into "actor" reference table (adding new schemas and code)
  • 1.1. Check deployed extensions for needed updates.
  • 2. Perform schema change (T188299)
  • Interrupt: Remove read-both (gerrit:461440)
  • 3. Turn the feature flag to "write both, read old". See if stuff breaks.
  • 4. Run the maintenance script(s) to migrate all the old stuff to new stuff.
    • s1
    • s2
    • s3
    • s4
    • s5
    • s6
    • s7
    • s8
    • wikitech
  • 5. Turn the feature flag to "write both, read new". See if stuff breaks.
    • 5.1. Announce the pending change to wikitech-l@ and cloud@, and give time for people to update.
    • 5.2. Make sure all deployed extensions are updated.
    • Beta Cluster
    • Test wikis
    • Group0 wikis
    • Group1 wikis
    • All wikis
  • 6. Turn the feature flag to "new only".
    • Beta Cluster
    • Test wikis
    • Group0 wikis
    • Group1 wikis
    • All wikis
  • 7. Remove old schemas and code
    • 7.1. Update WMCS replicas to no longer reference old schemas (T223406).
    • 7.2. Write and merge patches to remove $wgActorTableSchemaMigrationStage, supporting code, and old schemas.
    • 7.3. Submit Schema-change task for WMF production.

For the cleaning up of revision_actor_temp, see T215466. This and T166733 both block T161671, which in turn is the first step of that task.

Related Objects

StatusAssignedTask
OpenAnomie
ResolvedMarostegui
OpenBstorm
OpenAnomie
ResolvedJdforrester-WMF
OpenNone
OpenNone
ResolvedMarostegui
ResolvedAnomie
ResolvedCatrope
ResolvedAnomie
ResolvedAnomie
ResolvedAnomie
ResolvedAnomie
ResolvedAnomie
OpenAnomie
OpenNone
OpenMaxSem
OpenMusikAnimal
OpenBstorm
OpenMilimetric

Event Timeline

There are a very large number of changes, so older changes are hidden. Show Older Changes

Probably because of my depool db1113:3315 and db1113:3316.
db1113:3316 is now up, so that should have stopped for ruwiki.
db1113:3315 will remain down for a couple of more hours so that affects dewiki and shwiki

From IRC (#mediawiki-core) yesterday:

[16:09:43] <anomie> AaronSchulz: Do you know of any code path in the Database or LoadBalancer that would ignore exceptions about failed connections? Maybe in waitForReplication(), trying to reconnect every time that's called? See T188327#4887781 and later comments.
[21:50:38] <AaronSchulz> anomie: LoadMonitor::getServerStates should log such errors rarely due to caching. safeWaitForMasterPos() seems like it would fatal and probably is missing a !$masterConn check. LB::waitForReplication() does look plausible. I guess if https://gerrit.wikimedia.org/r/#/c/mediawiki/core/+/394430/26/includes/libs/rdbms/loadbalancer/LoadBalancer.php was merged, the log spam would be a different message instead with less connection attempts.

@Anomie I have seen this query running for 10 minutes on enwiki master:

| 1040540374 | wikiadmin       | 10.64.16.77:43252  | enwiki             | Query       |      648 | Sending data                                                          | SELECT /* MigrateActors::migrateToTemp www-data@mwmain... */  rev_id,rev_user,rev_user_text,CASE WHEN rev_user = 0 OR rev_user IS NULL THEN (SELECT  actor_id  FROM `actor`    WHERE (rev_user_text = actor_name)  ) ELSE (SELECT  actor_id  FROM `actor`    WHERE (rev_user = actor_user)  ) END AS `actor_id`,rev_timestamp AS `revactor_timestamp`,rev_page AS `revactor_page`  FROM `revision` LEFT JOIN `revision_actor_temp` ON ((rev_id=revactor_rev))   WHERE revactor_rev IS NULL  ORDER BY rev_id LIMIT 2000

Is that expected to have such a long query running on the master?

Ugh. The script errored out on enwiki somewhere in the middle of the archive table, after having processed all of revision. After being restarted it's running through revision again looking for rows that are still missing an actor (and finding almost none, of course).

It looks like it just finished with that as I was looking at this. If I have to restart enwiki again, I'll make a local copy of the script with the revision bit commented out.

Marostegui added a comment.EditedJan 19 2019, 5:27 PM

@Anomie can you let us know when the scripts finishes everywhere so we can start assuming that any new lag on codfw isn't because of the migration?
Thank you!

@Anomie can you let us know when the scripts finishes everywhere so we can start assuming that any new lag on codfw isn't because of the migration?
Thank you!

So far wikitech finished on the 14th, s8 finished on the 16th, and s7 finished yesterday. The other sections are still running.

s2 is finished. s1, s3, s4, s5, and s6 are still running.

A few s3 wikis might run into T188327#4892827, but since it's supposed to be all small wikis I hope they'll complete quickly despite the inefficient query.

Thanks for the heads up

s3 and s6 are done now.

Jdforrester-WMF updated the task description. (Show Details)
Anomie updated the task description. (Show Details)Jan 28 2019, 4:52 PM

s5 is done. s1 and s4 are still in progress.

Mentioned in SAL (#wikimedia-operations) [2019-01-30T18:03:19Z] <jynus> reducing innodb consistency options for db2048 T188327

Anomie updated the task description. (Show Details)Jan 30 2019, 6:10 PM

s4 finished late Monday (UTC). s1 is still running.

Anomie updated the task description. (Show Details)Jan 30 2019, 11:03 PM

s1 is done. Next step is to run some double-checks on the vslow replicas.

Mentioned in SAL (#wikimedia-operations) [2019-01-31T10:22:43Z] <jynus> resetting to defaults innodb consistency options for db2048 T188327

Anomie updated the task description. (Show Details)Feb 4 2019, 9:08 PM

Checks passed. The only rows without an actor are a few log_search rows where target_author_id refers to a user_id that doesn't exist.

Spot checking some rows where xx_user_text doesn't match the user name for xx_user, it seems over the years we've probably had some bugs in user renaming where rows got skipped or where people could edit while being renamed, and maybe incomplete manual renames before MediaWiki-extensions-Renameuser existed, and really old weirdness like that described in T106941.

One example of weirdness, found at random:

Anomie updated the task description. (Show Details)Feb 6 2019, 8:29 PM

@Anomie Are consumers of the Toolforge replicas safely able to use the new actor storage? Looks like the columns are all there, just wanted to know if the data has been backfilled.

Anomie added a comment.EditedApr 3 2019, 4:45 PM

Yes, the data is backfilled. I suppose I may as well send out the email (step 5.1 on the checklist) this afternoon, rather than waiting until tomorrow.

Edit: https://lists.wikimedia.org/pipermail/cloud/2019-April/000621.html

Change 501000 had a related patch set uploaded (by Anomie; owner: Anomie):
[operations/mediawiki-config@master] Set actor migration to read-new on Beta Cluster

https://gerrit.wikimedia.org/r/501000

Change 501000 merged by jenkins-bot:
[operations/mediawiki-config@master] Set actor migration to read-new on Beta Cluster

https://gerrit.wikimedia.org/r/501000

Anomie updated the task description. (Show Details)Apr 4 2019, 7:18 PM

Change 501594 had a related patch set uploaded (by Anomie; owner: Anomie):
[mediawiki/core@master] Default $wgActorTableSchemaMigrationStage to READ_NEW

https://gerrit.wikimedia.org/r/501594

Change 501595 had a related patch set uploaded (by Anomie; owner: Anomie):
[mediawiki/core@master] Default $wgActorTableSchemaMigrationStage to SCHEMA_COMPAT_NEW

https://gerrit.wikimedia.org/r/501595

Change 501631 had a related patch set uploaded (by Anomie; owner: Anomie):
[mediawiki/extensions/AbuseFilter@master] Actually create user in AbuseFilterConsequencesTest

https://gerrit.wikimedia.org/r/501631

Change 501631 merged by jenkins-bot:
[mediawiki/extensions/AbuseFilter@master] Actually create user in AbuseFilterConsequencesTest

https://gerrit.wikimedia.org/r/501631

Change 502226 had a related patch set uploaded (by Anomie; owner: Anomie):
[operations/mediawiki-config@master] Set ActorTableSchemaMigrationStage => write-both/read-new on test wikis & mediawikiwiki

https://gerrit.wikimedia.org/r/502226

Change 502226 merged by jenkins-bot:
[operations/mediawiki-config@master] Set ActorTableSchemaMigrationStage => write-both/read-new on test wikis & mediawikiwiki

https://gerrit.wikimedia.org/r/502226

Mentioned in SAL (#wikimedia-operations) [2019-04-08T14:17:34Z] <anomie@deploy1001> Synchronized wmf-config/InitialiseSettings.php: Setting actor migration to write-both/read-new on test wikis and mediawikiwiki (T188327) (duration: 00m 59s)

Anomie updated the task description. (Show Details)Apr 8 2019, 2:44 PM
Anomie added a comment.Apr 8 2019, 2:46 PM
  • 5.2. Make sure all deployed extensions are updated.

CI passes on Ic483d0fd (all fixes there and in Iab2fc959 were bugs in tests rather than the code itself), and some quick greps don't turn up anything obvious.

This is a heads up/reminder about T220080, just to be redundant so everybody is aware of it (no specific impact on this ticket rather than the obvious), just trying to be verbose to prevent concurrent maintenance tasks.

Change 501594 merged by jenkins-bot:
[mediawiki/core@master] Default $wgActorTableSchemaMigrationStage to READ_NEW

https://gerrit.wikimedia.org/r/501594

Change 502794 had a related patch set uploaded (by Anomie; owner: Anomie):
[operations/mediawiki-config@master] Set ActorTableSchemaMigrationStage => write-both/read-new on group 0

https://gerrit.wikimedia.org/r/502794

Change 502794 merged by jenkins-bot:
[operations/mediawiki-config@master] Set ActorTableSchemaMigrationStage => write-both/read-new on group 0

https://gerrit.wikimedia.org/r/502794

Mentioned in SAL (#wikimedia-operations) [2019-04-10T13:42:33Z] <anomie@deploy1001> Synchronized wmf-config/InitialiseSettings.php: Setting actor migration to write-both/read-new on group0 (T188327) (duration: 01m 00s)

Change 504011 had a related patch set uploaded (by Anomie; owner: Anomie):
[operations/mediawiki-config@master] Set ActorTableSchemaMigrationStage => write-both/read-new on group 1

https://gerrit.wikimedia.org/r/504011

Change 504011 merged by jenkins-bot:
[operations/mediawiki-config@master] Set ActorTableSchemaMigrationStage => write-both/read-new on group 1

https://gerrit.wikimedia.org/r/504011

Change 501595 merged by jenkins-bot:
[mediawiki/core@master] Default $wgActorTableSchemaMigrationStage to SCHEMA_COMPAT_NEW

https://gerrit.wikimedia.org/r/501595

Change 507614 had a related patch set uploaded (by Jforrester; owner: Anomie):
[mediawiki/core@REL1_33] Default $wgActorTableSchemaMigrationStage to READ_NEW

https://gerrit.wikimedia.org/r/507614

Change 507615 had a related patch set uploaded (by Jforrester; owner: Anomie):
[mediawiki/core@REL1_33] Default $wgActorTableSchemaMigrationStage to SCHEMA_COMPAT_NEW

https://gerrit.wikimedia.org/r/507615

Change 507614 merged by jenkins-bot:
[mediawiki/core@REL1_33] Default $wgActorTableSchemaMigrationStage to READ_NEW

https://gerrit.wikimedia.org/r/507614

Change 507615 merged by jenkins-bot:
[mediawiki/core@REL1_33] Default $wgActorTableSchemaMigrationStage to SCHEMA_COMPAT_NEW

https://gerrit.wikimedia.org/r/507615

Change 509844 had a related patch set uploaded (by Anomie; owner: Anomie):
[operations/mediawiki-config@master] Set ActorTableSchemaMigrationStage => write-both/read-new on remaining wikis

https://gerrit.wikimedia.org/r/509844

Change 509844 merged by jenkins-bot:
[operations/mediawiki-config@master] Set ActorTableSchemaMigrationStage => write-both/read-new on remaining wikis

https://gerrit.wikimedia.org/r/509844

Mentioned in SAL (#wikimedia-operations) [2019-05-13T13:36:37Z] <anomie@deploy1001> Synchronized wmf-config/InitialiseSettings.php: Setting actor migration to write-both/read-new on all wikis (T188327) (duration: 00m 50s)

Change 509883 had a related patch set uploaded (by Anomie; owner: Anomie):
[operations/mediawiki-config@master] Set actor migration to write-new on Beta Cluster

https://gerrit.wikimedia.org/r/509883

Change 509883 merged by jenkins-bot:
[operations/mediawiki-config@master] Set actor migration to write-new on Beta Cluster

https://gerrit.wikimedia.org/r/509883

Change 510503 had a related patch set uploaded (by Anomie; owner: Anomie):
[operations/mediawiki-config@master] Set ActorTableSchemaMigrationStage => write-new/read-new on test wikis & mediawikiwiki

https://gerrit.wikimedia.org/r/510503

Change 510503 merged by jenkins-bot:
[operations/mediawiki-config@master] Set ActorTableSchemaMigrationStage => write-new/read-new on test wikis & mediawikiwiki

https://gerrit.wikimedia.org/r/510503

Mentioned in SAL (#wikimedia-operations) [2019-05-15T13:21:15Z] <anomie@deploy1001> Synchronized wmf-config/InitialiseSettings.php: Setting actor migration to write-new/read-new on testwikis and mediawikiwiki (T188327) (duration: 00m 57s)

Anomie updated the task description. (Show Details)
Anomie updated the task description. (Show Details)Wed, May 15, 6:55 PM

Change 511444 had a related patch set uploaded (by Anomie; owner: Anomie):
[operations/mediawiki-config@master] Set ActorTableSchemaMigrationStage => write-new/read-new on group 0

https://gerrit.wikimedia.org/r/511444

Change 511444 merged by jenkins-bot:
[operations/mediawiki-config@master] Set ActorTableSchemaMigrationStage => write-new/read-new on group 0

https://gerrit.wikimedia.org/r/511444

Mentioned in SAL (#wikimedia-operations) [2019-05-20T15:13:08Z] <anomie@deploy1001> Synchronized wmf-config/InitialiseSettings.php: Setting actor migration to write-new/read-new on group 0 (T188327) (duration: 00m 55s)

Anomie updated the task description. (Show Details)