Page MenuHomePhabricator

Convert primary key integers and references thereto from int to bigint (unsigned)
Open, LowPublic

Description

Some wikis (e.g. Inclupedia) may wish to import Wikipedia's logging, page, recentchange, revision, and user data, using the same primary key values for that imported data as Wikipedia, while also having their own data in those fields, starting at, e.g., log_id, page_id, rc_id, rev_id, and user_id 1 quadrillion for future-proofing purposes. This will require a change of these primary keys to bigint unsigned, since int unsigned maxes out at about 2.7 billion.

These keys are going to hit their maxes eventually anyway on enwiki, although it might take a few decades more. Going to bigint unsigned will use up more storage, though.

Another issue with this is extensive core use of intval() on these primary keys. One solution is to create a wfBigintval() global function that removes any decimal as intval does and removes any non-numeric characters. However, then it will return a string rather than an int; hopefully no one does any mathematical operations on these, or it will be necessary to come up with alternative ways of doing those.

I can change old_id to bigint unsigned as well, if you think that would be helpful. Should we just migrate all primary keys to bigint unsigned? See also bug 60962 (T62962), which this will at least partly fix. See also [[mw:Manual:Primary key storage in other fields]].


Version: 1.23.0
Severity: enhancement

Event Timeline

bzimport raised the priority of this task from to Lowest.Nov 22 2014, 2:54 AM
bzimport added a project: Wikimedia-Rdbms.
bzimport set Reference to bz61111.
bzimport added a subscriber: Unknown Object (MLST).

The use case I have in mind is that the forthcoming rewritten [[mw:Extension:MirrorTools]] will do some schema changes that ALTER the logging, page, recentchanges, revision, and user tables with AUTO_INCREMENT = 1000000000000000; (i.e. 1 quadrillion). That way, we won't have collisions when the primary keys for imported log events, pages, recent changes, revisions, and users from Wikipedia reach high numbers.

The alternative would be to set an auto increment of, say, 3 billion. That would lead to eventual collisions, though.

Oh yeah, I just realized -- 3 billion won't work either, because that's above the intval max of ~2.1 billion. One way or another, this intval stuff has to change.

Note that PHP's 'int' type is dependent on the underlying architecture's C-language 'int' type.

This means that 'int's and intval() are 32-bit on 32-bit architectures, but are 64-bit on most 64-bit architectures. (Windows may be an exception to this as Windows is weird. If you're using a Windows server, you may wish to consider a nice Linux virtual machine.)

(In reply to comment #3)

Note that PHP's 'int' type is dependent on the underlying architecture's
C-language 'int' type.

This means that 'int's and intval() are 32-bit on 32-bit architectures, but
are
64-bit on most 64-bit architectures. (Windows may be an exception to this as
Windows is weird. If you're using a Windows server, you may wish to consider
a
nice Linux virtual machine.)

Good point. I'm running 32-bit Ubuntu Linux; it looks like I'll need to install the 64-bit flavor. That seems like a better solution than making such extensive changes to the core, which might introduce some undesired effects.

The biggest two of those primary keys are (The other's are at least an order of magnitude smaller):

MariaDB [enwiki_p]> select max(rc_id) from recentchanges;
+------------+

max(rc_id)

+------------+

636232414

+------------+
1 row in set (0.04 sec)

MariaDB [enwiki_p]> select max(rev_id) from revision;
+-------------+

max(rev_id)

+-------------+

594743381

+-------------+

So in 13 years, we've managed to get up to an rc_id of 636,232,414 or about 30% of possible values (Since rc_id appears to be unsigned for some reason). It doesn't seem like this is something we should really be worrying about. It also doesn't seem like fixing this will really become any harder in the future than it is now. Thus this seems like a solution in search of a problem. (Although perhaps making the default schema have these fields be unsigned might be a good idea just for cleanliness sake)

Yeah, if there weren't a particular use case I had in mind, I would totally say, Leave it as is for now. There are a few different options for Inclupedia, that I can think of. (1) discard the enwiki rev_ids of imported revisions. (2) keep those rev_ids in the mirrorbot table, in case they're needed later. (3) do what I'm doing now, which is import them with the enwiki rev_ids, and use higher rev_ids for Inclupedia-only revisions. (4) add a few fields to the revision table, or add new tables, similar to what you see at [[mw:Extension:MirrorTools#mirror-logging.sql]].

Option 3 seemed the simplest, especially since options 1 and 2 would still have required another field to be added indicating that the revisions were imported (it's pertinent information when people are browsing through). If I go with option 3, then an auto-increment starting point has to be selected, e.g. 3 billion. It might take decades, but there will eventually be a collision. These are wikis whose life spans are projected to be possibly decades or centuries for all we know. For peace of mind, I'd prefer to not set up the equivalent of a Y2K issue.

The next decision is, (a) change the schema for everyone, or (b) change it for just Inclupedia. I don't really mind changing it just for Inclupedia. Part of the point of filing the bug was to figure out which direction people wanted to go in. We can WONTFIX and then I'll just add SQL files to MirrorTools for the necessary changes. This is especially true in light of the fact that intval's behavior can be changed by going to a 64-bit system.

Since I was going to make a schema change anyway, I figured I'd ask, Hey, want to just change it in the core?

See also [[m:Solution in search of a problem]]

Thanks for the feedback; I'm WONTFIXing this. That saves me some work, as it's easier to make a schema change by MediaWiki extension than by MediaWiki core.

For what it’s worth, on enwiki the rc_id has now crossed the halfway point (assuming it stays signed):

MariaDB [enwiki_p]> SELECT MAX(rc_id) FROM recentchanges; SELECT MAX(rev_id) FROM revision;
+------------+
| MAX(rc_id) |
+------------+
| 1128813734 | # 52.6% of 2³¹ - 1
+------------+
1 row in set (0.00 sec)

+-------------+
| MAX(rev_id) |
+-------------+
|   882667340 | # 20.6% of 2³² - 1
+-------------+
1 row in set (0.00 sec)

And after just 6½ years, the Wikidata rev_id is about to catch up to the enwiki one:

MariaDB [wikidatawiki_p]> SELECT MAX(rc_id) FROM recentchanges; SELECT MAX(rev_id) FROM revision;
+------------+
| MAX(rc_id) |
+------------+
|  891589879 | # 41.5% of 2³¹ - 1
+------------+
1 row in set (0.00 sec)

+-------------+
| MAX(rev_id) |
+-------------+
|   855257322 | # 20.0% of 2³² - 1
+-------------+
1 row in set (0.01 sec)

(The rc_id is less close, but will presumably also overtake enwiki before either of them come close to the size limit.) This still isn’t a problem now, but I reckon on Wikidata both of these IDs will exceed signed int range in less than 10 years. (Cracking the unsigned int of the revision table will still take a while longer.)

Krinkle renamed this task from Change log_id, page_id, rc_id, rev_id, and user_id to bigint unsigned to Convert primary key integers and references thereto from int to bigint (unsigned).Feb 10 2019, 6:31 PM
Krinkle added a project: DBA.
Krinkle moved this task from Untriaged to Schema changes on the Wikimedia-Rdbms board.
Krinkle added a subscriber: Krinkle.

Re-opening so as to let the DBAs triage this. One question I wasn't able to answer quickly is: What units and signed-ness do all our current integer fields use in core's default schema, and in WMF production?

In core, most integer fields that point to primary keys are int unsigned, e.g. rc_oldid, rc_logid, page_latest, actor_user, etc. The only signed fields are or fields that don't relate to a primary key and are meant to support negative values so that seems fine.

What I'm less unsure about is the primary keys themselves. Some them have AUTO INCREMENT and also specify int unsigned (such as ipc_rev_id), but we we also have primary key integers that don't specify whether they're signed or not. What is the default there? Are MySQL/MariaDB smart enough to assume unsigned there? Or, given they do allow manual insertion, perhaps they have to support negatives unless explicitly marked as unsigned?

Looking at the DESCRIBE output in production, it seems to not specify signed-ness for the ones that didn't specify it in the schema, so at least it's not entirely obvious what it expands to.

(enwiki)> DESCRIBE recentchanges
| Field         | Type                | Null | Key | Default | Extra          |
| rc_id         | int(8)              | NO   | PRI | NULL    | auto_increment |


(enwiki)> DESCRIBE ip_changes;
| Field             | Type             | Null | Key | Default        | Extra |
| ipc_rev_id        | int(10) unsigned | NO   | PRI | 0              |       |

I also see wildly varying values for the size of the integer fields. I assume these are not hardcoded at that size but some kind of dynamic/automatic "growing feature" of MariaDB? That is, the schema doesn't limit the integer size and presumably MariaDB takes the shortest size that will fit the values as an optimisation for resource usage, and auto grows as needed?

Marostegui added subscribers: Reedy, Marostegui.

Re-opening so as to let the DBAs triage this. One question I wasn't able to answer quickly is: What units and signed-ness do all our current integer fields use in core's default schema, and in WMF production?

We converted a bunch to unsigned not long ago at T89737: Make several mediawiki table fields unsigned ints on wmf databases (CC @Reedy)

In core, most integer fields that point to primary keys are int unsigned, e.g. rc_oldid, rc_logid, page_latest, actor_user, etc. The only signed fields are or fields that don't relate to a primary key and are meant to support negative values so that seems fine.

What I'm less unsure about is the primary keys themselves. Some them have AUTO INCREMENT and also specify int unsigned (such as ipc_rev_id), but we we also have primary key integers that don't specify whether they're signed or not. What is the default there? Are MySQL/MariaDB smart enough to assume unsigned there? Or, given they do allow manual insertion, perhaps they have to support negatives unless explicitly marked as unsigned?

If not specified otherwise, it is signed by default as far as I remember. (Also: https://mariadb.com/kb/en/library/auto_increment/)

Looking at the DESCRIBE output in production, it seems to not specify signed-ness for the ones that didn't specify it in the schema, so at least it's not entirely obvious what it expands to.

(enwiki)> DESCRIBE recentchanges
| Field         | Type                | Null | Key | Default | Extra          |
| rc_id         | int(8)              | NO   | PRI | NULL    | auto_increment |


(enwiki)> DESCRIBE ip_changes;
| Field             | Type             | Null | Key | Default        | Extra |
| ipc_rev_id        | int(10) unsigned | NO   | PRI | 0              |       |

I also see wildly varying values for the size of the integer fields. I assume these are not hardcoded at that size but some kind of dynamic/automatic "growing feature" of MariaDB? That is, the schema doesn't limit the integer size and presumably MariaDB takes the shortest size that will fit the values as an optimisation for resource usage, and auto grows as needed?

The value is just a visual thing - which is basically how many characters to display when using the client but storage related it doesn't mean anything. The column will not restrict any value until it reaches the maximum value for an int column (32 bit). Normally it is good not to specify any value just to avoid those misunderstandings.

1enwiki:
2
3root@db1114[sys]> select * FROM schema_auto_increment_columns LIMIT 4 \G
4*************************** 1. row ***************************
5 table_schema: enwiki
6 table_name: recentchanges
7 column_name: rc_id
8 data_type: int
9 column_type: int(8)
10 is_signed: 1
11 is_unsigned: 0
12 max_value: 2147483647
13 auto_increment: 1137690279
14auto_increment_ratio: 0.5298
15*************************** 2. row ***************************
16 table_schema: enwiki
17 table_name: cu_changes
18 column_name: cuc_id
19 data_type: int
20 column_type: int(11)
21 is_signed: 1
22 is_unsigned: 0
23 max_value: 2147483647
24 auto_increment: 854445063
25auto_increment_ratio: 0.3979
26*************************** 3. row ***************************
27 table_schema: enwiki
28 table_name: text
29 column_name: old_id
30 data_type: int
31 column_type: int(8) unsigned
32 is_signed: 0
33 is_unsigned: 1
34 max_value: 4294967295
35 auto_increment: 898750098
36auto_increment_ratio: 0.2093
37*************************** 4. row ***************************
38 table_schema: enwiki
39 table_name: revision
40 column_name: rev_id
41 data_type: int
42 column_type: int(8) unsigned
43 is_signed: 0
44 is_unsigned: 1
45 max_value: 4294967295
46 auto_increment: 887709732
47auto_increment_ratio: 0.2067
484 rows in set, 36 warnings (0.04 sec)
49
50wikidatawiki:
51
52root@db1071.eqiad.wmnet[sys]> select * FROM schema_auto_increment_columns LIMIT 4 \G
53*************************** 1. row ***************************
54 table_schema: wikidatawiki
55 table_name: recentchanges
56 column_name: rc_id
57 data_type: int
58 column_type: int(11)
59 is_signed: 1
60 is_unsigned: 0
61 max_value: 2147483647
62 auto_increment: 919219099
63auto_increment_ratio: 0.4280
64*************************** 2. row ***************************
65 table_schema: wikidatawiki
66 table_name: cu_changes
67 column_name: cuc_id
68 data_type: int
69 column_type: int(11)
70 is_signed: 1
71 is_unsigned: 0
72 max_value: 2147483647
73 auto_increment: 899023427
74auto_increment_ratio: 0.4186
75*************************** 3. row ***************************
76 table_schema: wikidatawiki
77 table_name: text
78 column_name: old_id
79 data_type: int
80 column_type: int(10) unsigned
81 is_signed: 0
82 is_unsigned: 1
83 max_value: 4294967295
84 auto_increment: 889703558
85auto_increment_ratio: 0.2072
86*************************** 4. row ***************************
87 table_schema: wikidatawiki
88 table_name: revision
89 column_name: rev_id
90 data_type: int
91 column_type: int(10) unsigned
92 is_signed: 0
93 is_unsigned: 1
94 max_value: 4294967295
95 auto_increment: 882728203
96auto_increment_ratio: 0.2055
974 rows in set, 52 warnings (0.37 sec)
was run in march. We probably should have monitoring on this on all hosts. Recentchanges are the first to find the issues. Converting them to unsigned T62962 would double the limit.

Update of these numbers:

enwiki 

root@db1089.eqiad.wmnet[sys]> select * FROM schema_auto_increment_columns LIMIT 4 \G
*************************** 1. row ***************************
        table_schema: enwiki
          table_name: recentchanges
         column_name: rc_id
           data_type: int
         column_type: int(8)
           is_signed: 1
         is_unsigned: 0
           max_value: 2147483647
      auto_increment: 1220497142
auto_increment_ratio: 0.5683
*************************** 2. row ***************************
        table_schema: enwiki
          table_name: cu_changes
         column_name: cuc_id
           data_type: int
         column_type: int(11)
           is_signed: 1
         is_unsigned: 0
           max_value: 2147483647
      auto_increment: 908617190
auto_increment_ratio: 0.4231
*************************** 3. row ***************************
        table_schema: enwiki
          table_name: text
         column_name: old_id
           data_type: int
         column_type: int(8) unsigned
           is_signed: 0
         is_unsigned: 1
           max_value: 4294967295
      auto_increment: 946292759
auto_increment_ratio: 0.2203
*************************** 4. row ***************************
        table_schema: enwiki
          table_name: revision
         column_name: rev_id
           data_type: int
         column_type: int(8) unsigned
           is_signed: 0
         is_unsigned: 1
           max_value: 4294967295
      auto_increment: 934644421
auto_increment_ratio: 0.2176
4 rows in set, 32 warnings (0.09 sec)
wikidatawiki
*************************** 1. row ***************************
        table_schema: wikidatawiki
          table_name: recentchanges
         column_name: rc_id
           data_type: int
         column_type: int(11)
           is_signed: 1
         is_unsigned: 0
           max_value: 2147483647
      auto_increment: 1131254488
auto_increment_ratio: 0.5268
*************************** 2. row ***************************
        table_schema: wikidatawiki
          table_name: cu_changes
         column_name: cuc_id
           data_type: int
         column_type: int(11)
           is_signed: 1
         is_unsigned: 0
           max_value: 2147483647
      auto_increment: 1114807877
auto_increment_ratio: 0.5191
*************************** 3. row ***************************
        table_schema: wikidatawiki
          table_name: text
         column_name: old_id
           data_type: int
         column_type: int(10) unsigned
           is_signed: 0
         is_unsigned: 1
           max_value: 4294967295
      auto_increment: 1104265812
auto_increment_ratio: 0.2571
*************************** 4. row ***************************
        table_schema: wikidatawiki
          table_name: revision
         column_name: rev_id
           data_type: int
         column_type: int(10) unsigned
           is_signed: 0
         is_unsigned: 1
           max_value: 4294967295
      auto_increment: 1093332597
auto_increment_ratio: 0.2546
4 rows in set, 52 warnings (0.07 sec)

Change 577202 had a related patch set uploaded (by Addshore; owner: Addshore):
[analytics/wmde/scripts@master] Track wikibase repo table auto increment usage above 25%

https://gerrit.wikimedia.org/r/577202

Another table to keep an eye on is Wikidata’s wb_changes. Its change_id is thankfully unsigned, but it’s in the same ballpark as old_id and rev_id, currently at 1274061080 (23.7% of 2^32 - 1).

Edit: actually, it turns out we have a signed reference to change_id in a different table… https://gerrit.wikimedia.org/r/c/mediawiki/extensions/Wikibase/+/628470/1/repo/sql/mysql/changes_dispatch.sql#8

LSobanski raised the priority of this task from Lowest to Low.Apr 26 2021, 10:09 AM

Change 577202 abandoned by Addshore:

[analytics/wmde/scripts@master] Track wikibase repo table auto increment usage above 25%

Reason:

We do this as a nice % of ids available elsewhere now, thanks Amir!

https://gerrit.wikimedia.org/r/577202