Page MenuHomePhabricator

Deploy refactored actor storage
Open, Needs TriagePublic

Description

The high-level checklist:

  • 1. Merge the first patch for T167246: Refactor "user" & "user_text" fields into "actor" reference table (adding new schemas and code)
  • 1.1. Check deployed extensions for needed updates.
  • 2. Perform schema change (T188299)
  • Interrupt: Remove read-both (gerrit:461440)
  • 3. Turn the feature flag to "write both, read old". See if stuff breaks.
  • 4. Run the maintenance script(s) to migrate all the old stuff to new stuff.
    • s1
    • s2
    • s3
    • s4
    • s5
    • s6
    • s7
    • s8
    • wikitech
  • 5. Turn the feature flag to "write both, read new". See if stuff breaks.
    • 5.1. Announce the pending change to wikitech-l@ and cloud@, and give time for people to update.
    • 5.2. Make sure all deployed extensions are updated.
  • 6. Turn the feature flag to "new only".
  • 7. Remove old schemas and code
    • 7.1. Update WMCS replicas to no longer reference old schemas.
    • 7.2. Write and merge patches to remove $wgActorTableSchemaMigrationStage, supporting code, and old schemas.
    • 7.3. Submit Schema-change task for WMF production.

For the cleaning up of revision_actor_temp, see T215466. This and T166733 both block T161671, which in turn is the first step of that task.

Related Objects

Event Timeline

There are a very large number of changes, so older changes are hidden. Show Older Changes

Change 474704 had a related patch set uploaded (by Anomie; owner: Anomie):
[operations/mediawiki-config@master] Set ActorTableSchemaMigrationStage => write-both/read-old on group 0

https://gerrit.wikimedia.org/r/474704

Change 474704 merged by jenkins-bot:
[operations/mediawiki-config@master] Set ActorTableSchemaMigrationStage => write-both/read-old on group 0

https://gerrit.wikimedia.org/r/474704

Mentioned in SAL (#wikimedia-operations) [2018-11-19T15:08:29Z] <anomie@deploy1001> Synchronized wmf-config/InitialiseSettings.php: Setting actor migration to write-both/read-old on group 0 (T188327) (duration: 00m 47s)

Change 475767 had a related patch set uploaded (by Anomie; owner: Anomie):
[operations/mediawiki-config@master] Set ActorTableSchemaMigrationStage => write-both/read-old on group 1

https://gerrit.wikimedia.org/r/475767

Change 475767 merged by jenkins-bot:
[operations/mediawiki-config@master] Set ActorTableSchemaMigrationStage => write-both/read-old on group 1

https://gerrit.wikimedia.org/r/475767

Mentioned in SAL (#wikimedia-operations) [2018-11-26T15:24:49Z] <anomie@deploy1001> Synchronized wmf-config/InitialiseSettings.php: Setting actor migration to write-both/read-old on group 1 (T188327) (duration: 00m 46s)

CannotCreateActorException from line 2540 of /srv/mediawiki/php-1.33.0-wmf.6/includes/user/User.php: Cannot create an actor for a usable name that is not an existing user

Happened when deploying 1.33.0-wmf.6 to all wikis. Seems it was solely for wiktionary. Filled as T210732

Change 479454 had a related patch set uploaded (by Anomie; owner: Anomie):
[operations/mediawiki-config@master] Set ActorTableSchemaMigrationStage => write-both/read-old on all wikis

https://gerrit.wikimedia.org/r/479454

Change 479454 merged by jenkins-bot:
[operations/mediawiki-config@master] Set ActorTableSchemaMigrationStage => write-both/read-old on all wikis

https://gerrit.wikimedia.org/r/479454

Mentioned in SAL (#wikimedia-operations) [2018-12-13T15:18:40Z] <anomie@deploy1001> Synchronized wmf-config/InitialiseSettings.php: Setting actor migration to write-both/read-old on all wikis (T188327) (duration: 00m 45s)

Anomie updated the task description. (Show Details)Dec 13 2018, 3:40 PM
Anomie removed a project: Patch-For-Review.
hashar removed a subscriber: hashar.Dec 14 2018, 9:27 AM

Mentioned in SAL (#wikimedia-operations) [2018-12-17T16:07:32Z] <anomie@mwmaint1002> Running migrateActors.php on test wikis and mediawikiwiki for T188327

Mentioned in SAL (#wikimedia-operations) [2018-12-17T16:07:32Z] <anomie@mwmaint1002> Running migrateActors.php on test wikis and mediawikiwiki for T188327

I wound up aborting that soon after when it turned out the queries were very slow.

Change 480121 had a related patch set uploaded (by Anomie; owner: Anomie):
[mediawiki/core@master] Fix slow queries in migrateActors.php

https://gerrit.wikimedia.org/r/480121

Change 480121 merged by jenkins-bot:
[mediawiki/core@master] Fix slow queries in migrateActors.php

https://gerrit.wikimedia.org/r/480121

Change 483478 had a related patch set uploaded (by Anomie; owner: Anomie):
[mediawiki/core@wmf/1.33.0-wmf.12] Fix slow queries in migrateActors.php

https://gerrit.wikimedia.org/r/483478

Change 483478 merged by jenkins-bot:
[mediawiki/core@wmf/1.33.0-wmf.12] Fix slow queries in migrateActors.php

https://gerrit.wikimedia.org/r/483478

Mentioned in SAL (#wikimedia-operations) [2019-01-10T18:52:08Z] <anomie@mwmaint1002> Running migrateActors.php on test wikis and mediawikiwiki for T188327. This may cause lag in codfw.

Change 483488 had a related patch set uploaded (by Paladox; owner: Anomie):
[mediawiki/core@REL1_32] Fix slow queries in migrateActors.php

https://gerrit.wikimedia.org/r/483488

Change 483488 merged by jenkins-bot:
[mediawiki/core@REL1_32] Fix slow queries in migrateActors.php

https://gerrit.wikimedia.org/r/483488

Mentioned in SAL (#wikimedia-operations) [2019-01-11T14:51:17Z] <anomie@mwmaint1002> Running migrateActors.php on test wikis and mediawikiwiki for T188327. This may cause lag in codfw.

Mentioned in SAL (#wikimedia-operations) [2019-01-14T14:41:14Z] <anomie@mwmaint1002> Running migrateActors.php on remaining section 3 wikis for T188327. This may cause lag in codfw.

Mentioned in SAL (#wikimedia-operations) [2019-01-14T14:41:53Z] <anomie@mwmaint1002> Running migrateActors.php on section 1 wikis for T188327. This may cause lag in codfw.

Mentioned in SAL (#wikimedia-operations) [2019-01-14T14:42:01Z] <anomie@mwmaint1002> Running migrateActors.php on section 2 wikis for T188327. This may cause lag in codfw.

Mentioned in SAL (#wikimedia-operations) [2019-01-14T14:42:07Z] <anomie@mwmaint1002> Running migrateActors.php on section 4 wikis for T188327. This may cause lag in codfw.

Mentioned in SAL (#wikimedia-operations) [2019-01-14T14:42:10Z] <anomie@mwmaint1002> Running migrateActors.php on section 5 wikis for T188327. This may cause lag in codfw.

Mentioned in SAL (#wikimedia-operations) [2019-01-14T14:42:12Z] <anomie@mwmaint1002> Running migrateActors.php on section 6 wikis for T188327. This may cause lag in codfw.

Mentioned in SAL (#wikimedia-operations) [2019-01-14T14:42:14Z] <anomie@mwmaint1002> Running migrateActors.php on section 7 wikis for T188327. This may cause lag in codfw.

Mentioned in SAL (#wikimedia-operations) [2019-01-14T14:42:17Z] <anomie@mwmaint1002> Running migrateActors.php on section 8 wikis for T188327. This may cause lag in codfw.

Mentioned in SAL (#wikimedia-operations) [2019-01-14T14:42:19Z] <anomie@mwmaint1002> Running migrateActors.php on wikitech for T188327. This may cause lag in codfw.

Weirdnesses encountered so far:

  • On enwiki, apparently some import of pre-2006 revisions created several rows with IP addresses in rev_user_text and non-zero (but non-existent) user IDs in rev_user. I manually zeroed those rev_users.
  • A few instances where rev_user had a non-zero ID that didn't exist in user. Some on enwiki like the above bullet, others that seem like weird one-offs. I'll likely fix these manually too, by either assigning the correct user ID if the name exists or prefixing with unknown> if it doesn't.

When I get to the point of running a check script to make sure all rows have actors, I may also see how many there are where the actor name doesn't match xx_user_text.

Mentioned in SAL (#wikimedia-operations) [2019-01-16T09:29:34Z] <marostegui> Stop s3 actor-migration script in order to allow s3 to catch up and to avoid lag during the failover - T188327 T213858

@Anomie you can restart s3 migration script

Mentioned in SAL (#wikimedia-operations) [2019-01-17T12:19:01Z] <jynus> killing migrateActors.php --wiki=ptwiki on mwmaint, was using outdated db config T188327

@Anomie due to T213859 and T213748 we had to depool and power off a bunch of hosts (https://gerrit.wikimedia.org/r/#/c/operations/mediawiki-config/+/484987/) and we noticed that your script for s2 was failing on db1074 (one of the hosts that was depooled).

php -ddisplay_errors=On /srv/mediawiki/multiversion/MWScript.php migrateActors.php --wiki=ptwiki --batch-size 2000

We killed the script

^So I don't need this, but people really need to think of a way to push fresh configurations to mw tasks- or establish some policy of long-running scripts to avoid https://logstash.wikimedia.org/goto/523d9a64fb0821e25c2e84ca93502c1d

I suspect that trying to actually push fresh configuration would be extremely complex and fragile; simpler would be to have a general policy that maintenance scripts should be robust against being killed and restarted, and people should be prepared to check that the script actually ran to completion on each wiki. The killing of the ptwiki run is no problem for me, I'll just restart it after the rest of s2 finishes. I have to do the same for plwiki too, as that run died at 10:10 UTC when db1103:3312 was stopped.

If the DB server actually went down, migrateActors.php on that wiki would die of uncaught exception right away as has happened in several cases already. But that's obviously not what happened with ptwiki (or enwiki or dewiki) here. I see that migrateActors.php on metawiki did die during this incident though:

[11d25f6f6e1f7954847fd683] [no req]   Wikimedia\Rdbms\DBQueryError from line 1506 of /srv/mediawiki/php-1.33.0-wmf.13/includes/libs/rdbms/database/Database.php: A connection error occurred.
Query: SHOW GLOBAL VARIABLES LIKE 'gtid_%'
Function: Wikimedia\Rdbms\DatabaseMysqlBase::getReplicaPos
Error: 2006 MySQL server has gone away (10.64.0.91)

Backtrace:
#0 /srv/mediawiki/php-1.33.0-wmf.13/includes/libs/rdbms/database/Database.php(1476): Wikimedia\Rdbms\Database->makeQueryException(string, integer, string, string)
#1 /srv/mediawiki/php-1.33.0-wmf.13/includes/libs/rdbms/database/Database.php(1236): Wikimedia\Rdbms\Database->reportQueryError(string, integer, string, string, boolean)
#2 /srv/mediawiki/php-1.33.0-wmf.13/includes/libs/rdbms/database/DatabaseMysqlBase.php(1074): Wikimedia\Rdbms\Database->query(string, string)
#3 /srv/mediawiki/php-1.33.0-wmf.13/includes/libs/rdbms/database/DatabaseMysqlBase.php(976): Wikimedia\Rdbms\DatabaseMysqlBase->getServerGTIDs(string)
#4 /srv/mediawiki/php-1.33.0-wmf.13/includes/libs/rdbms/database/DatabaseMysqlBase.php(910): Wikimedia\Rdbms\DatabaseMysqlBase->getReplicaPos()
#5 /srv/mediawiki/php-1.33.0-wmf.13/includes/libs/rdbms/loadbalancer/LoadBalancer.php(647): Wikimedia\Rdbms\DatabaseMysqlBase->masterPosWait(Wikimedia\Rdbms\MySQLMasterPos, integer)
#6 /srv/mediawiki/php-1.33.0-wmf.13/includes/libs/rdbms/loadbalancer/LoadBalancer.php(542): Wikimedia\Rdbms\LoadBalancer->doWait(integer, boolean, integer)
#7 /srv/mediawiki/php-1.33.0-wmf.13/includes/libs/rdbms/lbfactory/LBFactory.php(429): Wikimedia\Rdbms\LoadBalancer->waitForAll(Wikimedia\Rdbms\MySQLMasterPos, integer)
#8 /srv/mediawiki/php-1.33.0-wmf.13/includes/GlobalFunctions.php(2829): Wikimedia\Rdbms\LBFactory->waitForReplication(array)
#9 /srv/mediawiki/php-1.33.0-wmf.13/maintenance/migrateActors.php(297): wfWaitForSlaves()
#10 /srv/mediawiki/php-1.33.0-wmf.13/maintenance/migrateActors.php(91): MigrateActors->migrate(string, array, string, string, string)
#11 /srv/mediawiki/php-1.33.0-wmf.13/maintenance/Maintenance.php(1698): MigrateActors->doDBUpdates()
#12 /srv/mediawiki/php-1.33.0-wmf.13/maintenance/doMaintenance.php(94): LoggedUpdateMaintenance->execute()
#13 /srv/mediawiki/php-1.33.0-wmf.13/maintenance/migrateActors.php(587): include(string)
#14 /srv/mediawiki/multiversion/MWScript.php(100): include(string)
#15 {main}

which is associated with an extra message in DBConnection.log on mwlog1001 (second line):

2019-01-17 10:57:47 [11d25f6f6e1f7954847fd683] mwmaint1002 metawiki 1.33.0-wmf.13 DBConnection ERROR: Error connecting to 10.64.0.91: Can't connect to MySQL server on '10.64.0.91' (111) {"db_server":"10.64.0.91","db_name":"metawiki","db_user":"wikiadmin","method":"Wikimedia\\Rdbms\\DatabaseMysqlBase::open","error":"Can't connect to MySQL server on '10.64.0.91' (111)"}
2019-01-17 10:57:47 [11d25f6f6e1f7954847fd683] mwmaint1002 metawiki 1.33.0-wmf.13 DBConnection ERROR: Wikimedia\Rdbms\Database::query: lost connection to 10.64.0.91 permanently {"dbserver":"10.64.0.91"}

Was the problem intermittent in some way, so the connection kept dropping but each time (except that once on db1079) the reconnect would succeed and stay up for a short time (a second or two could be enough) before dropping again? Other than that I have no ideas.

Both db1103 and db1079 were powered off.
Maybe during MySQL shutdown the script kept dropping until it got the connection refused which takes a few minutes?

From the linked Kibana board, though, the problem seems to have gone on for over an hour. Were the other hosts involved (db1074, db1080, db1081, db1082) shut down during the incident?

Yes, they all were. It wasn't an incident, it was an scheduled (emergency with short notice) maintenance

By "incident" I'm talking about the thousands of "Error connecting to {db_server}: {error}" messages.

Unless I'm missing something, the logic in MediaWiki's Database class is that when it finds the connection was dropped when trying to run a query (and there wasn't an open transaction or the like that would mean lost state), it'll try reconnecting once and re-run the query. If the reconnect and re-run succeeds it'll ignore the dropped connection, otherwise it'll throw an exception like metawiki did there.

If the hosts were all shut down, I have no idea why MediaWiki would have kept trying to reconnect.

By "incident" I'm talking about the thousands of "Error connecting to {db_server}: {error}" messages.

Sorry for the misunderstanding :)

Unless I'm missing something, the logic in MediaWiki's Database class is that when it finds the connection was dropped when trying to run a query (and there wasn't an open transaction or the like that would mean lost state), it'll try reconnecting once and re-run the query. If the reconnect and re-run succeeds it'll ignore the dropped connection, otherwise it'll throw an exception like metawiki did there.

If the hosts were all shut down, I have no idea why MediaWiki would have kept trying to reconnect.

Yeah, no idea here. They were depooled + shutdown not long after the deployment was done. In case that makes any difference, but once MW had detected it was down, no idea why it kept trying maybe somehow related to T180918?

Again, just to be clear- my only point is that servers are going to be down, sometimes in a scheduled way and sometimes in an unscheduled one. There is no way we cannot avoid that.

I know your scripts are "clean" and "idempotent", but others' aren't. This is not the right ticket to discuss it, but just a way to say, sorry, but as long as connection handling is on your side, the ball is on your side to decide how to handle it :-), be it a "request killer", maintenance "best practices" or some witty coding solution. There are also ways to do it smoother by handling connections as a service and being able to drop servers (e.g. though a proxy), but that also brings other problems. I do not have a perfect solution, I just wanted to start asking the question :-) (and justifying why the errors happens and that they could not be avoided- sorry about the disruption again).

BTW mwmaint1002 db connections errors continue on dewiki, ruwiki and shwiki, in a low but stable pace.

Probably because of my depool db1113:3315 and db1113:3316.
db1113:3316 is now up, so that should have stopped for ruwiki.
db1113:3315 will remain down for a couple of more hours so that affects dewiki and shwiki

From IRC (#mediawiki-core) yesterday:

[16:09:43] <anomie> AaronSchulz: Do you know of any code path in the Database or LoadBalancer that would ignore exceptions about failed connections? Maybe in waitForReplication(), trying to reconnect every time that's called? See T188327#4887781 and later comments.
[21:50:38] <AaronSchulz> anomie: LoadMonitor::getServerStates should log such errors rarely due to caching. safeWaitForMasterPos() seems like it would fatal and probably is missing a !$masterConn check. LB::waitForReplication() does look plausible. I guess if https://gerrit.wikimedia.org/r/#/c/mediawiki/core/+/394430/26/includes/libs/rdbms/loadbalancer/LoadBalancer.php was merged, the log spam would be a different message instead with less connection attempts.

@Anomie I have seen this query running for 10 minutes on enwiki master:

| 1040540374 | wikiadmin       | 10.64.16.77:43252  | enwiki             | Query       |      648 | Sending data                                                          | SELECT /* MigrateActors::migrateToTemp www-data@mwmain... */  rev_id,rev_user,rev_user_text,CASE WHEN rev_user = 0 OR rev_user IS NULL THEN (SELECT  actor_id  FROM `actor`    WHERE (rev_user_text = actor_name)  ) ELSE (SELECT  actor_id  FROM `actor`    WHERE (rev_user = actor_user)  ) END AS `actor_id`,rev_timestamp AS `revactor_timestamp`,rev_page AS `revactor_page`  FROM `revision` LEFT JOIN `revision_actor_temp` ON ((rev_id=revactor_rev))   WHERE revactor_rev IS NULL  ORDER BY rev_id LIMIT 2000

Is that expected to have such a long query running on the master?

Ugh. The script errored out on enwiki somewhere in the middle of the archive table, after having processed all of revision. After being restarted it's running through revision again looking for rows that are still missing an actor (and finding almost none, of course).

It looks like it just finished with that as I was looking at this. If I have to restart enwiki again, I'll make a local copy of the script with the revision bit commented out.

Marostegui added a comment.EditedJan 19 2019, 5:27 PM

@Anomie can you let us know when the scripts finishes everywhere so we can start assuming that any new lag on codfw isn't because of the migration?
Thank you!

@Anomie can you let us know when the scripts finishes everywhere so we can start assuming that any new lag on codfw isn't because of the migration?
Thank you!

So far wikitech finished on the 14th, s8 finished on the 16th, and s7 finished yesterday. The other sections are still running.

s2 is finished. s1, s3, s4, s5, and s6 are still running.

A few s3 wikis might run into T188327#4892827, but since it's supposed to be all small wikis I hope they'll complete quickly despite the inefficient query.

Thanks for the heads up

s3 and s6 are done now.

Jdforrester-WMF updated the task description. (Show Details)
Anomie updated the task description. (Show Details)Mon, Jan 28, 4:52 PM

s5 is done. s1 and s4 are still in progress.

Mentioned in SAL (#wikimedia-operations) [2019-01-30T18:03:19Z] <jynus> reducing innodb consistency options for db2048 T188327

Anomie updated the task description. (Show Details)Wed, Jan 30, 6:10 PM

s4 finished late Monday (UTC). s1 is still running.

Anomie updated the task description. (Show Details)Wed, Jan 30, 11:03 PM

s1 is done. Next step is to run some double-checks on the vslow replicas.

Mentioned in SAL (#wikimedia-operations) [2019-01-31T10:22:43Z] <jynus> resetting to defaults innodb consistency options for db2048 T188327

Anomie updated the task description. (Show Details)Mon, Feb 4, 9:08 PM

Checks passed. The only rows without an actor are a few log_search rows where target_author_id refers to a user_id that doesn't exist.

Spot checking some rows where xx_user_text doesn't match the user name for xx_user, it seems over the years we've probably had some bugs in user renaming where rows got skipped or where people could edit while being renamed, and maybe incomplete manual renames before MediaWiki-extensions-Renameuser existed, and really old weirdness like that described in T106941.

One example of weirdness, found at random:

Anomie updated the task description. (Show Details)Wed, Feb 6, 8:29 PM