Page MenuHomePhabricator

Fix CheckUser database schema drifts in production
Closed, ResolvedPublic

Description

The CheckUser tables (cu_changes and cu_log) have several drifts in production. One of these (due to 'cuc_private') caused T321041 due to the flawed understanding that it existed on all production wikis as there existed no task for rolling out the field to all wikis. The drift tracker was run and is visible on https://drift-tracker.toolforge.org/report/checkuser/.

Drifts that are currently listed at toolforge:

  • 'cu_changes.cuc_private' only db1104 - ideally needs to be on all wikis
  • 'cuc_actor_ip_time' index on 'cu_changes' only db1160 and db1104 - should be on all wikis
  • 'cu_changes.cuc_timestamp' has mismatched type - should be type 'mwtimestamp' (which is BINARY(14) in this case according to mediawiki schema change docs)
  • 'cu_log.cul_timestamp' has mismatched type - should be 'mwtimestamp' (same as point above for it's raw DB type)
  • 'cuc_user_time' index on 'cu_changes' needs to be removed from all wikis as it's not present in the current tables.json schema for CheckUser
  • 'cu_changes.cuc_agent' needs to exist on all wikis (I'm very surprised this hasn't caused a prod error)
  • 'cu_log.cul_range_end' needs to exist on all wikis (I'm surprised this hasn't caused a prod error)
  • 'cul_actor_time' index on table 'cu_log' needs to be on all wikis
  • Columns 'cuc_user' and 'cuc_user_text', and index 'cuc_user_ip_time' need to be removed from 'cu_changes'
  • Missing column 'cuc_only_for_read_old' from 'cu_changes'

Related Objects

Event Timeline

I suggest creating a subticket for each one, see the FR ticket T313253

The type of cul_timestamp would be nice to known from enwiki
cuc_agent and cul_range_end are the last columns on the production tables and both are part of the initial schema, where I would not expect them as missing. It is possible the schema drift checker makes something wrong here?

cuc_agent and cul_range_end are the last columns on the production tables and both are part of the initial schema, where I would not expect them as missing. It is possible the schema drift checker makes something wrong here?

It is quite possible that for connection issues, the replica has responded with empty response and the script interpreted that as "nothing here" and marked everything as missing. I have seen it happened before. I need to fix it :D

Thanks for creating the subtasks.

@Ladsgroup would it be possible to run the drift checker again? From my understanding all drifts should now be fixed, but it would be good to check before this is closed.

I'm seeing this:

"cu_changes cuc_timestamp field-type-mismatch": {
    "s7": [
        "db1158:arwiki",
        "db1158:cawiki",
        "db1158:eswiki",
        "db1158:fawiki",
        "db1158:frwiktionary",
        "db1158:hewiki",
        "db1158:huwiki",
        "db1158:kowiki",
        "db1158:metawiki",
        "db1158:rowiki",
        "db1158:ukwiki",
        "db1158:viwiki"
    ]
},

and this on basically every wiki:

"cu_log cul_timestamp field-type-mismatch": {
    "s3": [
        "db1157:aawiki",
        "db1112:aawiki",
        "db1123:aawiki",
        "db1166:aawiki",
        "db1175:aawiki",
        "db1179:aawiki",
        "db1189:aawiki",
        "db1198:aawiki",

Hmm.

Could you get the type of the timestamp fields which it says are a mismatch?

db1158 is listed in the schema-change task at T310011#7991330

The type of cul_timestamp is also unclear to me, from the code it was the correct type from begin (see T321063#8326828 above)

db1158 was repooled exactly one hour later meaning the depool failed to drain. I don't know if it's after the reload config or before but it can happen from time to time (if the calling script doesn't have the reload call, etc.). Redoing it should be fine and easy though.

Regarding the cul_timestamp, that's what I'm getting at aawiki:

`cul_timestamp` varbinary(14) NOT NULL DEFAULT '',

the schema change missing on db1158 was done before the reload config being deployed. That's why. Anyway, I restarted it on that host only. That'll be done.

In eqiad it's this https://drift-tracker.toolforge.org/report/checkuser/

It doesn't have cuc_only_for_read_old in it though. Weird.

Might I suggest then adding cuc_only_for_read_old on testcommonswiki, moving back to write new on group0 and seeing if any other wikis present the error?

Unfortunately not, because if the drift is more complicated, and the most dangerously, the column existing on master but not on replicas, it can break replication leading to a full set of wikis going read-only for extended period of time. We have had incidents like that before.

Thanks for the explanation. I'm guessing this is not easy to fix, especially as the drift tracker isn't saying where the issues are.

Yeah, I think there might be a bug in the analyzor which is worth investigating and fixing regardless. The code is here: https://github.com/Ladsgroup/db-analyzor-tools/blob/master/db_drift_checker.py

I'll try to debug it ASAP. If you feel adventurous, you can take a look too.

I've had a look, but I think I'd have to debug it when it's running to find the issue.

I think I found the issue. Fixed it and running it again.

Updated the report. I confirm it's only testcommonswiki (and some other schema changes are only missing there too)

I don't think there is anything left from DBAs at the moment. Please create a subticket for each drift and we will take care of it.

The new run:
(beside T356631: Column cu_log.cul_timestamp has an incorrect type and incorrectly has a default value set)

"cu_changes cuc_actiontext field-mismatch-prod-extra": {
    "s1": [
        "db1163:enwiki"
    ],
    "s4": [
        "db1160:commonswiki",
        "db1160:testcommonswiki"
    ],
    "s8": [
        "db1193:wikidatawiki"
    ]
},
"cu_changes cuc_agent_id field-mismatch-codebase-extra": {
    "s8": [
        "db1178:wikidatawiki"
    ]
},
"cu_changes cuc_only_for_read_old field-mismatch-prod-extra": {
    "s1": [
        "db1163:enwiki"
    ],
    "s4": [
        "db1160:commonswiki",
        "db1160:testcommonswiki"
    ],
    "s8": [
        "db1193:wikidatawiki"
    ]
},
"cu_changes cuc_private field-mismatch-prod-extra": {
    "s1": [
        "db1163:enwiki"
    ],
    "s4": [
        "db1160:commonswiki",
        "db1160:testcommonswiki"
    ],
    "s8": [
        "db1193:wikidatawiki"
    ]
},
"cu_log_event cule_agent_id field-mismatch-codebase-extra": {
    "s8": [
        "db1178:wikidatawiki"
    ]
},
"cu_private_event cupe_agent_id field-mismatch-codebase-extra": {
    "s8": [
        "db1178:wikidatawiki"
    ]
}

Checked eqiad only for now.

Fixed eqiad ones (except T370903 on masters). Now running the script on codfw.

Ladsgroup claimed this task.
Ladsgroup edited projects, added DBA; removed Data-Persistence (work done).
Ladsgroup moved this task from Triage to Done on the DBA board.

Thanks for finishing up work on this!

@Ladsgroup, out of interest I was checking the wiki DBs to see the changes and I've found another mismatch on db2219 for testcommonswiki:

MariaDB [testcommonswiki]> show create table cu_changes\G
*************************** 1. row ***************************
       Table: cu_changes
Create Table: CREATE TABLE `cu_changes` (
  ...
  `cuc_timestamp` binary(14) NOT NULL DEFAULT '\0\0\0\0\0\0\0\0\0\0\0\0\0\0',
  ...
) ...
1 row in set (0.000 sec)

It seems that there is a default where there is not supposed to be one. I don't think it's particularly a big concern, but wanted to flag it.

I think we don't check for default value for now. That's why. We probably need to start a round of "less important" drifts in a year or so once we are done with more important ones in all extensions.