Page MenuHomePhabricator

Fix AUTO_INCREMENT values for the content_models table
Closed, DeclinedPublic

Description

Apparently there multiple replicas across multiple shards where the AUTO_INCREMENT is wrongly at 1. I am not sure what causes this and if it is easy to fix it.

frwiki

wikiadmin2023@10.64.32.55(frwiki)> SELECT `AUTO_INCREMENT` FROM  INFORMATION_SCHEMA.TABLES WHERE TABLE_SCHEMA = 'frwiki' AND   TABLE_NAME   = 'content_models';
+----------------+
| AUTO_INCREMENT |
+----------------+
|              1 |
+----------------+
1 row in set (0.001 sec)

wikiadmin2023@10.64.32.55(frwiki)>
wikiadmin2023@10.64.135.12(frwiki)> SELECT `AUTO_INCREMENT` FROM  INFORMATION_SCHEMA.TABLES WHERE TABLE_SCHEMA = 'frwiki' AND   TABLE_NAME   = 'content_models';
+----------------+
| AUTO_INCREMENT |
+----------------+
|             11 |
+----------------+
1 row in set (0.000 sec)

wikiadmin2023@10.64.135.12(frwiki)>

dewiki

wikiadmin2023@10.64.0.48(dewiki)> SELECT `AUTO_INCREMENT` FROM  INFORMATION_SCHEMA.TABLES WHERE TABLE_SCHEMA = 'dewiki' AND   TABLE_NAME   = 'content_models';
+----------------+
| AUTO_INCREMENT |
+----------------+
|              9 |
+----------------+
1 row in set (0.001 sec)

wikiadmin2023@10.64.0.48(dewiki)> exit
wikiadmin2023@10.64.135.11(dewiki)> SELECT `AUTO_INCREMENT` FROM  INFORMATION_SCHEMA.TABLES WHERE TABLE_SCHEMA = 'dewiki' AND   TABLE_NAME   = 'content_models';
+----------------+
| AUTO_INCREMENT |
+----------------+
|              1 |
+----------------+
1 row in set (0.001 sec)

wikiadmin2023@10.64.135.11(dewiki)>

Event Timeline

Marostegui subscribed.

I just quickly checked the first example you gave and:

cumin2024@db1231.eqiad.wmnet[frwiki]> select * from content_models;
+----------+------------------------+
| model_id | model_name             |
+----------+------------------------+
|        9 | MassMessageListContent |
|        2 | Scribunto              |
|        6 | css                    |
|        3 | flow-board             |
|        5 | javascript             |
|        7 | json                   |
|       10 | sanitized-css          |
|        8 | text                   |
|        1 | wikitext               |
+----------+------------------------+
9 rows in set (0.002 sec)

At the very least we could set the AUTO_INCREMENT = 11 with an ALTER table. Why did this happen? I don't know but it may be something very old and got out of sync for some reason (maybe a bad delete, maybe something done without logging to binlog...)
We should need a report on where this is a problem and on which replicas, and fix those individually.

I have been worried that something might have tries to update/edit content_id which could lead to fun issues but I didn't find anything funny in the code. I will try to fix it globally on Monday.

I remember this and I'm sure I have done investigation before. The problem is that if you run a script to check for auto_increment differences, many many many tables across the cluster have the master having the correct value and other replicas having 1 as the auto_increment value. Here is a very small list (literally the start of the script) between master of eqiad and codfw:

aawiki category eqiad: 1 codfw 61
aawikibooks wbc_entity_usage eqiad: 1 codfw 14
aawiktionary category eqiad: 1 codfw 3
aawiktionary wbc_entity_usage eqiad: 1 codfw 9
advisorswiki category eqiad: 1 codfw 14
advisorswiki externallinks eqiad: 1 codfw 92
advisorywiki category eqiad: 1 codfw 9
advisorywiki externallinks eqiad: 1 codfw 129
advisorywiki sites eqiad: 1 codfw 889
afwiktionary securepoll_voters eqiad: 1 codfw 2
akwiktionary category eqiad: 1 codfw 4
altwiki slot_roles eqiad: 2 codfw 1
amwikimedia sites eqiad: 1 codfw 2
amwikiquote category eqiad: 1 codfw 36
angwikisource externallinks eqiad: 1 codfw 4
angwikisource wbc_entity_usage eqiad: 1 codfw 29
apiportalwiki category eqiad: 7 codfw 1
apiportalwiki sites eqiad: 955 codfw 1
apiportalwiki wbc_entity_usage eqiad: 2 codfw 1
apiportalwiki slot_roles eqiad: 2 codfw 1
arbcom_dewiki sites eqiad: 1 codfw 925
arbcom_enwiki abuse_filter eqiad: 1 codfw 2
arbcom_enwiki abuse_filter_history eqiad: 1 codfw 6
arbcom_enwiki abuse_filter_log eqiad: 1 codfw 6
arbcom_enwiki sites eqiad: 1 codfw 916
arbcom_fiwiki category eqiad: 1 codfw 40
arbcom_fiwiki sites eqiad: 1 codfw 912
arbcom_nlwiki sites eqiad: 1 codfw 898
arbcom_ruwiki sites eqiad: 956 codfw 1
arbcom_ruwiki slot_roles eqiad: 2 codfw 1
arwiki ores_model eqiad: 1 codfw 4
arwiki cur eqiad: 1 codfw 14149
arwiki slot_roles eqiad: 1 codfw 2
arwikimedia sites eqiad: 1 codfw 889
arwikinews abuse_filter eqiad: 1 codfw 12

(and this is just checking between masters of eqiad and codfw)
I think my conclusion was that it's not a real data drift and it gets automatically bumped to the correct value once something tries to insert (obviously, I'm not going to try to insert a row into a random replica to test this theory). I think very likely root cause was that after restart of mariadb until an insert happens, the values of all auto_increments is set to one. I'm sure I read it somewhere but can't find it. So i.e. it's not a real data drift. It's just requires multiple tries to insert sometimes (as it happened in T399372#11006520). Is it great? No not really. But given how widespread is, I think we either should just take the retries or get the upstream to fix this. No matter what kind of alter we run, we will end up with this right after a restart (or some other action that resets auto increment to one).

Thanks for the investigation about the masters @Ladsgroup - I think that changes everything.
If the masters are okay, then this is expected on the replicas as the auto_inc values aren't replicated (the row is of course, but not the counter) (this of course changes in master-master https://mariadb.com/docs/server/ha-and-performance/standard-replication/replication-and-binary-log-system-variables)
In any case, if the masters are showing the correct autoinc, then it is expected on replicas (with no direct writes) to show 1.

I am going to close this, but if there are more things to be discussed, please feel free to reopen.

I feel like we're missing something. In T399372, T401641, and T408488, we encountered a situation where the insert would not work with any amount of on-wiki retries, and we had to get a deployer to run the insert. We troubleshooted the root cause of that to be this auto increment issue.

The line of code in question was the following, which of course uses the master database since it's a write action and is using $dbw.

		$dbw->newInsertQueryBuilder()
			->insertInto( $this->table )
			->ignore()
			->row( $this->getFieldsToStore( $name ) )
			->caller( __METHOD__ )->execute();

In this ticket, you mention REPLICAS getting out of sync with the master, which feels possibly unrelated. Makes me wonder if this ticket is NOT the root cause of T399372, T401641, and T408488. If it's not the root cause, I guess I should open a new ticket so we can start a new investigations/discussion?

If the issues are on the master, then it is something different from this (as the ticket creation is about replicas) and Amir mentions everything looks good on the master. I'd suggest then a different ticket so we can track that.