Page MenuHomePhabricator

Fix CheckUser database schema drifts in production
Open, MediumPublic

Description

The CheckUser tables (cu_changes and cu_log) have several drifts in production. One of these (due to 'cuc_private') caused T321041 due to the flawed understanding that it existed on all production wikis as there existed no task for rolling out the field to all wikis. The drift tracker was run and is visible on https://drift-tracker.toolforge.org/report/checkuser/.

Drifts that are currently listed at toolforge:

  • 'cu_changes.cuc_private' only db1104 - ideally needs to be on all wikis
  • 'cuc_actor_ip_time' index on 'cu_changes' only db1160 and db1104 - should be on all wikis
  • 'cu_changes.cuc_timestamp' has mismatched type - should be type 'mwtimestamp' (which is BINARY(14) in this case according to mediawiki schema change docs)
  • 'cu_log.cul_timestamp' has mismatched type - should be 'mwtimestamp' (same as point above for it's raw DB type)
  • 'cuc_user_time' index on 'cu_changes' needs to be removed from all wikis as it's not present in the current tables.json schema for CheckUser
  • 'cu_changes.cuc_agent' needs to exist on all wikis (I'm very surprised this hasn't caused a prod error)
  • 'cu_log.cul_range_end' needs to exist on all wikis (I'm surprised this hasn't caused a prod error)
  • 'cul_actor_time' index on table 'cu_log' needs to be on all wikis
  • Columns 'cuc_user' and 'cuc_user_text', and index 'cuc_user_ip_time' need to be removed from 'cu_changes'
  • Missing column 'cuc_only_for_read_old' from 'cu_changes'

Related Objects

Event Timeline

I suggest creating a subticket for each one, see the FR ticket T313253

The type of cul_timestamp would be nice to known from enwiki
cuc_agent and cul_range_end are the last columns on the production tables and both are part of the initial schema, where I would not expect them as missing. It is possible the schema drift checker makes something wrong here?

cuc_agent and cul_range_end are the last columns on the production tables and both are part of the initial schema, where I would not expect them as missing. It is possible the schema drift checker makes something wrong here?

It is quite possible that for connection issues, the replica has responded with empty response and the script interpreted that as "nothing here" and marked everything as missing. I have seen it happened before. I need to fix it :D

Thanks for creating the subtasks.

@Ladsgroup would it be possible to run the drift checker again? From my understanding all drifts should now be fixed, but it would be good to check before this is closed.

I'm seeing this:

"cu_changes cuc_timestamp field-type-mismatch": {
    "s7": [
        "db1158:arwiki",
        "db1158:cawiki",
        "db1158:eswiki",
        "db1158:fawiki",
        "db1158:frwiktionary",
        "db1158:hewiki",
        "db1158:huwiki",
        "db1158:kowiki",
        "db1158:metawiki",
        "db1158:rowiki",
        "db1158:ukwiki",
        "db1158:viwiki"
    ]
},

and this on basically every wiki:

"cu_log cul_timestamp field-type-mismatch": {
    "s3": [
        "db1157:aawiki",
        "db1112:aawiki",
        "db1123:aawiki",
        "db1166:aawiki",
        "db1175:aawiki",
        "db1179:aawiki",
        "db1189:aawiki",
        "db1198:aawiki",

Hmm.

Could you get the type of the timestamp fields which it says are a mismatch?

db1158 is listed in the schema-change task at T310011#7991330

The type of cul_timestamp is also unclear to me, from the code it was the correct type from begin (see T321063#8326828 above)

db1158 was repooled exactly one hour later meaning the depool failed to drain. I don't know if it's after the reload config or before but it can happen from time to time (if the calling script doesn't have the reload call, etc.). Redoing it should be fine and easy though.

Regarding the cul_timestamp, that's what I'm getting at aawiki:

`cul_timestamp` varbinary(14) NOT NULL DEFAULT '',

the schema change missing on db1158 was done before the reload config being deployed. That's why. Anyway, I restarted it on that host only. That'll be done.

In eqiad it's this https://drift-tracker.toolforge.org/report/checkuser/

It doesn't have cuc_only_for_read_old in it though. Weird.

Might I suggest then adding cuc_only_for_read_old on testcommonswiki, moving back to write new on group0 and seeing if any other wikis present the error?

Unfortunately not, because if the drift is more complicated, and the most dangerously, the column existing on master but not on replicas, it can break replication leading to a full set of wikis going read-only for extended period of time. We have had incidents like that before.

Thanks for the explanation. I'm guessing this is not easy to fix, especially as the drift tracker isn't saying where the issues are.

Yeah, I think there might be a bug in the analyzor which is worth investigating and fixing regardless. The code is here: https://github.com/Ladsgroup/db-analyzor-tools/blob/master/db_drift_checker.py

I'll try to debug it ASAP. If you feel adventurous, you can take a look too.

I've had a look, but I think I'd have to debug it when it's running to find the issue.

I think I found the issue. Fixed it and running it again.

Updated the report. I confirm it's only testcommonswiki (and some other schema changes are only missing there too)

I don't think there is anything left from DBAs at the moment. Please create a subticket for each drift and we will take care of it.