Page MenuHomePhabricator

Fatal exception of type "Wikimedia\Rdbms\DBQueryError" for Oldversion and History pages
Closed, ResolvedPublicBUG REPORT

Description

Visit https://translatewiki.net/w/i.php?oldid=9506988 or https://translatewiki.net/w/i.php?title=MediaWiki:Fileimporter-filerevisions/fr&action=history

The page will return HTTP 500 code, with the following error:

Fatal exception of type "Wikimedia\Rdbms\DBQueryError"

In the first case, the layout is not even loaded.

Event Timeline

Database migrations are failing and are possibly related to this issue

Wikimedia\Rdbms\DBQueryError from line 1768 of /srv/mediawiki/tags/2022-05-19_07:08:59/includes/libs/rdbms/database/Database.php: Error 1146: Table 'translatewiki_net.bw_revision_actor_temp' doesn't exist
Function: MediaWiki\Extension\Translate\Statistics\TranslatorActivityQuery::inAllLanguages
Query: SELECT  actor_rev_user.actor_name AS `rev_user_text`,substring_index(page_title, '/', -1) as lang,MAX(rev_timestamp) as lastedit,count(page_id) as count  FROM `bw_page` JOIN `bw_revision` ON ((page_id=rev_page)) JOIN `bw_revision_actor_temp` `temp_rev_user` ON ((temp_rev_user.revactor_rev = rev_id)) JOIN `bw_actor` `actor_rev_user` ON ((actor_rev_user.actor_id = temp_rev_user.revactor_actor))   WHERE (page_title LIKE '%/%' ESCAPE '`' ) AND page_namespace IN (*)   GROUP BY lang,actor_rev_user.actor_name ORDER BY NULL

~20 minutes downtime (well, some pages were possibly accessible)

18:49:42] wm-bb> [telegram] <abijeetpatro> There was a migration in canary that removed a table but the production (non canary) instance was still accessing the
table and hence flooded the logs.
[18:50:27] wm-bb> [telegram] <abijeetpatro> Releasing the canary to production fixed the issue.
[18:50:28] wm-bb> [telegram] <abijeetpatro>
[18:50:30] wm-bb> [telegram] <abijeetpatro> Migration in question: 793845: Start clean up of revision_actor_temp table |
https://gerrit.wikimedia.org/r/c/mediawiki/core/+/793845

If I am reading https://gerrit.wikimedia.org/r/c/mediawiki/core/+/793845/16/docs/config-schema.yaml correctly, the migration value was set to 48, which seems to be correspond to 30 in hex:

define( 'SCHEMA_COMPAT_WRITE_TEMP', 0x10 );
define( 'SCHEMA_COMPAT_READ_TEMP', 0x20 );

So there was no soft migration for those following the defaults? @Ladsgroup

It should be OK now.

There was a long running (~3 hours) migration script that finished and then dropped the bw_revision_actor_temp table. We have a two step stage deployment process at translatewiki.net where the code is first deployed on a canary environment, tested and then released to production.

Once this table was dropped, the code in production was still accessing the table. The fix was to release the latest code to production. This took me a while to figure out.

So there was no soft migration for those following the defaults? @Ladsgroup

I'm sorry but I don't think so. Specifically I announced this change a couple of weeks ago and basically the only reason it was kept for five years was because the migration in wmf wasn't done yet. Sorry for the inconvenience,

@Ladsgroup To express my concerns more clearly: How can we avoid this from happening again in the future? This was one of the longest outages for translatewiki.net in the recent years. You did announce the change, but mainly from the aspect of anything using these tables. That was not the problem here, it was the deployment of the change itself, which was not highlighted as risky. Per my understanding of https://www.mediawiki.org/wiki/MediaWiki_database_policy I thought all database schema changes have a grace period. It seems it doesn't cover our case: we always run update.php before deploying new code.

I do understand your problem but that part of the policy was followed and it was optional for years (and several stable releases). We have to pull the plug eventually. It gets really complicated when a wiki is on unstable (master) and runs update.php. The only other place that does this beside translatewiki is beta cluster and it goes down or read-only quite often (which is natural), Such issues can happen and will happen to translatewiki as we migrate more of the schema and the data (templatelinks being the next one that will happen soon).

To avoid future issues, you have two directions:

  • Be more like a third party mediawiki installation:
    • Switch to using stable releases with preparing of migration beforehand
    • OR Run update.php with a newer version and avoid switching the whole system to the new version until update.php is done.
  • Be more like of a Wikimedia wiki:
    • Have a way for devs to let you know about major changes and make it clear that devs need to do that.
    • OR be more attentive to announcements on data migrations as it might affect you in ways devs are not aware of.

OR Run update.php with a newer version and avoid switching the whole system to the new version until update.php is done.

That's exactly what we do. The problem was that the (one week) old code which was deployed was still reading the old table which got dropped by update.php script.

This hasn't happened before, so I am not sure what was special about this patch. I believe that for earlier migrations the defaults in MediaWiki were gradually updated to shift reads and writes to the new schema before dropping the old tables/columns.

I'm a bit confused then. If the part that made the outage long was that update.php was stuck on running MigrateRevisionActorTemp, then it would have not gotten into dropping revision_actor_temp table (the non-b/c part) yet. Was the outage because update.php was finished but the user-facing code was still on the old version?

The most special thing about this change was that it was a major data migration, just running this on enwiki took five months. In other cases, there would be some issues but nothing major as data migration finishes quickly in translatewiki.

Was the outage because update.php was finished but the user-facing code was still on the old version?

That's correct. We have a staging area where we update code and run update.php before we deploy it to users.

A solution would be to switch prod traffic right after update.php is done.

Nikerabbit claimed this task.
Nikerabbit moved this task from Backlog to Incident follow-ups on the translatewiki.net board.