⚓ T321063 Fix CheckUser database schema drifts in production

Status	Assigned	Task
Open	None	T313251 Collect and fix schema drifts of wmf-deployed extension tables in production
Open	None	T321063 Fix CheckUser database schema drifts in production
Resolved	• Marostegui	T321123 Drop old index cuc_user_time on cu_changes table for wmf wikis
Resolved	• Marostegui	T321127 Add index cuc_actor_ip_time to cu_changes on wmf wikis
Resolved	• Marostegui	T321126 Add column 'cul_actor' and index cul_actor_time to cu_log on wmf wikis
Resolved	• Marostegui	T321130 Add column cuc_private to cu_changes on wmf wikis
Resolved	Papaul	T322988 db2173 HW errors
Resolved	• Marostegui	T343174 Add missing column cuc_only_for_read_old to testcommonswiki
Resolved	• Marostegui	T343175 Remove old fields 'cuc_user' and 'cuc_user_text' as well as index 'cuc_user_ip_time' from a few production wikis
Open	None	T356631 Column cu_log.cul_timestamp has an incorrect type and incorrectly has a default value set
Resolved	• Marostegui	T356634 Column cu_log.cul_type incorrectly has a default value set on s1

Dreamy_Jazz created this task.Oct 18 2022, 11:35 AM

Restricted Application added a subscriber: Aklapper. · View Herald TranscriptOct 18 2022, 11:35 AM

Dreamy_Jazz updated the task description. (Show Details)Oct 18 2022, 11:40 AM

Dreamy_Jazz updated the task description. (Show Details)Oct 18 2022, 11:50 AM

Dreamy_Jazz triaged this task as Medium priority.Oct 18 2022, 11:53 AM

Dreamy_Jazz mentioned this in T321041: CheckUser: Wikimedia\Rdbms\DBQueryError: Error 1054: Unknown column 'cuc_private' in 'field list'.

I suggest creating a subticket for each one, see the FR ticket T313253

Lucas_Werkmeister_WMDE added a parent task: T313251: Collect and fix schema drifts of wmf-deployed extension tables in production.Oct 18 2022, 2:23 PM

Lucas_Werkmeister_WMDE mentioned this in T313251: Collect and fix schema drifts of wmf-deployed extension tables in production.

cu_changes.cuc_private was added in ede4937ffbd887530c01785a95be349236990ecb
index cuc_actor_ip_time is also part of the schema change (ef2c3c3df5205a0b0b62a6dca1efc5d57b931825), but it seems missed on column addition in T303603, while it is showing up in the script
cu_changes.cuc_timestamp should be clean now, the schema update was in progress - T310011
cu_log.cul_timestamp was binary(14) not null from the begin (6cb7f881835019e5d5d2e3c16546e440e0ab61a0), not sure what the difference is
index cuc_user_time is just an old name, was renamed in 09a2c75156faf03fd1d12f7ba34366b1c91613ee
cu_changes.cuc_agent exists in the intial schema (13d27f3f3c06bbadcf8b367de0c6e5dd09904006) and it showing up in the table description under T321041#8324226
cu_log.cul_range_end exists in the inital schema (6cb7f881835019e5d5d2e3c16546e440e0ab61a0) and is used unconditional in the code
index cul_actor_time is part of the schema change for the column (48ea3bfb4771fd0e353abfc22b7264900dcf5791), but I cannot find the task where the column is added to prod

The type of cul_timestamp would be nice to known from enwiki
cuc_agent and cul_range_end are the last columns on the production tables and both are part of the initial schema, where I would not expect them as missing. It is possible the schema drift checker makes something wrong here?

Umherirrender mentioned this in T321123: Drop old index cuc_user_time on cu_changes table for wmf wikis.Oct 18 2022, 7:19 PM

Umherirrender added a subtask: T321123: Drop old index cuc_user_time on cu_changes table for wmf wikis.

Umherirrender added a subtask: T321127: Add index cuc_actor_ip_time to cu_changes on wmf wikis.Oct 18 2022, 7:28 PM

Umherirrender added a subtask: T321126: Add column 'cul_actor' and index cul_actor_time to cu_log on wmf wikis.

Umherirrender mentioned this in T321130: Add column cuc_private to cu_changes on wmf wikis.Oct 18 2022, 7:39 PM

Umherirrender added a subtask: T321130: Add column cuc_private to cu_changes on wmf wikis.

In T321063#8326828, @Umherirrender wrote:

cuc_agent and cul_range_end are the last columns on the production tables and both are part of the initial schema, where I would not expect them as missing. It is possible the schema drift checker makes something wrong here?

It is quite possible that for connection issues, the replica has responded with empty response and the script interpreted that as "nothing here" and marked everything as missing. I have seen it happened before. I need to fix it :D

TheresNoTime subscribed.Oct 18 2022, 8:52 PM

Thanks for creating the subtasks.

Legoktm awarded a token.Oct 19 2022, 4:18 AM

• Marostegui closed subtask T321127: Add index cuc_actor_ip_time to cu_changes on wmf wikis as Resolved.Oct 27 2022, 6:03 AM

• Marostegui closed subtask T321123: Drop old index cuc_user_time on cu_changes table for wmf wikis as Resolved.Nov 11 2022, 9:02 AM

Dreamy_Jazz updated the task description. (Show Details)Nov 14 2022, 11:21 AM

• Marostegui closed subtask T321130: Add column cuc_private to cu_changes on wmf wikis as Resolved.Nov 23 2022, 7:59 AM

• Marostegui closed subtask T321126: Add column 'cul_actor' and index cul_actor_time to cu_log on wmf wikis as Resolved.Nov 30 2022, 6:50 AM

Dreamy_Jazz updated the task description. (Show Details)Dec 1 2022, 11:11 PM

@Ladsgroup would it be possible to run the drift checker again? From my understanding all drifts should now be fixed, but it would be good to check before this is closed.

Started it, it'll take a bit

Thanks!

I'm seeing this:

"cu_changes cuc_timestamp field-type-mismatch": {
    "s7": [
        "db1158:arwiki",
        "db1158:cawiki",
        "db1158:eswiki",
        "db1158:fawiki",
        "db1158:frwiktionary",
        "db1158:hewiki",
        "db1158:huwiki",
        "db1158:kowiki",
        "db1158:metawiki",
        "db1158:rowiki",
        "db1158:ukwiki",
        "db1158:viwiki"
    ]
},

and this on basically every wiki:

"cu_log cul_timestamp field-type-mismatch": {
    "s3": [
        "db1157:aawiki",
        "db1112:aawiki",
        "db1123:aawiki",
        "db1166:aawiki",
        "db1175:aawiki",
        "db1179:aawiki",
        "db1189:aawiki",
        "db1198:aawiki",

Hmm.

Could you get the type of the timestamp fields which it says are a mismatch?

db1158 is listed in the schema-change task at T310011#7991330

The type of cul_timestamp is also unclear to me, from the code it was the correct type from begin (see T321063#8326828 above)

db1158 was repooled exactly one hour later meaning the depool failed to drain. I don't know if it's after the reload config or before but it can happen from time to time (if the calling script doesn't have the reload call, etc.). Redoing it should be fine and easy though.

Regarding the cul_timestamp, that's what I'm getting at aawiki:

`cul_timestamp` varbinary(14) NOT NULL DEFAULT '',

the schema change missing on db1158 was done before the reload config being deployed. That's why. Anyway, I restarted it on that host only. That'll be done.

Umherirrender unsubscribed.Jul 12 2023, 8:28 PM

@Ladsgroup, can the schema drift code be run again? Would be useful to see if any other wikis are missing cuc_only_for_read_old.

Dreamy_Jazz added a project: Anti-Harassment.Jul 27 2023, 3:51 PM

Dreamy_Jazz moved this task from Untriaged to CheckUser on the Anti-Harassment board.Jul 27 2023, 3:56 PM

Started it.

In eqiad it's this https://drift-tracker.toolforge.org/report/checkuser/

It doesn't have cuc_only_for_read_old in it though. Weird.

Might I suggest then adding cuc_only_for_read_old on testcommonswiki, moving back to write new on group0 and seeing if any other wikis present the error?

Unfortunately not, because if the drift is more complicated, and the most dangerously, the column existing on master but not on replicas, it can break replication leading to a full set of wikis going read-only for extended period of time. We have had incidents like that before.

Thanks for the explanation. I'm guessing this is not easy to fix, especially as the drift tracker isn't saying where the issues are.

Yeah, I think there might be a bug in the analyzor which is worth investigating and fixing regardless. The code is here: https://github.com/Ladsgroup/db-analyzor-tools/blob/master/db_drift_checker.py

I'll try to debug it ASAP. If you feel adventurous, you can take a look too.

I've had a look, but I think I'd have to debug it when it's running to find the issue.