Page MenuHomePhabricator

ips_site_page is too short to store some (full) page titles
Closed, ResolvedPublic

Description

Today I discovered that our ips_site_page field is to small to store all page titles out there (eg. https://www.wikidata.org/wiki/Q6703647 is affected by that). The field is defined as varbinary(255) NOT NULL which would be enough for the unqualified page titles (page_title is varbinary(255) everywhere), but it's not large enough to hold the page titles plus their namespaces (that's how we store the titles, unlike the page table which has a separate int field for the namespace).

To make that field fit for all pages on every Wikimedia wiki we should go for a length of at very least 300, as the longest page title I could find has 255 chars and the longest non-talk namespace has a length of 39 (The longest talk namespace has a length of 67!), also we need to store the colon. I'd suggest to go for a bit more than that, just to play it safe.

Event Timeline

hoo raised the priority of this task from to Needs Triage.
hoo updated the task description. (Show Details)
hoo added subscribers: hoo, daniel, aude and 3 others.
hoo set Security to None.
hoo triaged this task as High priority.May 17 2015, 7:59 PM

31 items are potentially affected:

mysql:wikiadmin@db2052.codfw.wmnet [wikidatawiki]> SELECT COUNT(*) FROM wb_items_per_site WHERE LENGTH(ips_site_page) = 255;
+----------+
| COUNT(*) |
+----------+
|       31 |
+----------+

Once this bug is fixed, I'll run rebuildItemsPerSite (which I wanted to run anyway, but it failed because of this).

I feat that MySQL doesn't fully index varchar fields wider than 255 characters (or even bytes?). And I'm unsure if it's smart enough to fall back to prefix search and post-processing. I seem to recall that this doesn't work. Can you compare the output of EXPLAIN for the relevant queries with and without this change?

This shouldn't be an issue, I think indexes can be up to 767bytes without any issues. For completeness, I've tested this:

Production:

EXPLAIN SELECT * FROM wb_items_per_site WHERE ips_site_id = 'dewiki' AND ips_site_page = 'Berlin'\G
*************************** 1. row ***************************
           id: 1
  select_type: SIMPLE
        table: wb_items_per_site
         type: const
possible_keys: wb_ips_item_site_page,wb_ips_site_page
          key: wb_ips_item_site_page
      key_len: 291
          ref: const,const
         rows: 1
        Extra:

Locally (after running ALTER TABLE wb_items_per_site CHANGE ips_site_page ips_site_page varbinary(400) NOT NULL;):

EXPLAIN SELECT * FROM wb_items_per_site WHERE ips_site_id = 'dewiki' AND ips_site_page = 'Berlin'\G
*************************** 1. row ***************************
           id: 1
  select_type: SIMPLE
        table: wb_items_per_site
         type: const
possible_keys: wb_ips_item_site_page,wb_ips_site_page
          key: wb_ips_item_site_page
      key_len: 436
          ref: const,const
         rows: 1
        Extra:

Ok then, thanks for checking!

Smaller fields are still better for performance, so I vote for a conservative 300, unless 400 is demonstrably needed.

Correct, 767 bytes (bytes, not characters) is the limit for a single InnoDB index with our current DB config, which is enough, and in any case with a config change[1], a policy change[2], and table rebuilds[3] that limit could be raised to 3072 bytes.

[1] https://mariadb.com/kb/en/mariadb/xtradbinnodb-server-system-variables/#innodb_large_prefix
[2] We still claim MySQL 5.0.2 support for Mediawiki.
[3] Needs InnoDB tables with barracuda file format.

Today I discovered that our ips_site_page field is to small to store all page titles out there (eg. https://www.wikidata.org/wiki/Q6703647 is affected by that). The field is defined as varbinary(255) NOT NULL which would be enough for the unqualified page titles (page_title is varbinary(255) everywhere), but it's not large enough to hold the page titles plus their namespaces (that's how we store the titles, unlike the page table which has a separate int field for the namespace).

Is it a good idea for the ips_site_page field to store the namespace name as a string? Why is it doing that?

To make that field fit for all pages on every Wikimedia wiki we should go for a length of at very least 300, as the longest page title I could find has 255 chars and the longest non-talk namespace has a length of 39 (The longest talk namespace has a length of 67!), also we need to store the colon. I'd suggest to go for a bit more than that, just to play it safe.

page.page_title is 255 bytes today, but I'm not sure it's a great idea to make other fields dependent on its length. Matching its length might be more reasonable. While I understand smaller fields are better for performance, I'd like if page.page_title and similar fields followed the model now being used with the *_comment fields (cf. T6715). That is, have a higher limit on the database side and enforce the limit at the application level.

From my reading of the numbers here, 300 bytes wouldn't be sufficient in the current most pathological case (255 bytes + 67 bytes).

MZMcBride renamed this task from ips_site_page is to short to store some (fulll) page titles to ips_site_page is too short to store some (full) page titles.May 18 2015, 2:13 AM

Today I discovered that our ips_site_page field is to small to store all page titles out there (eg. https://www.wikidata.org/wiki/Q6703647 is affected by that). The field is defined as varbinary(255) NOT NULL which would be enough for the unqualified page titles (page_title is varbinary(255) everywhere), but it's not large enough to hold the page titles plus their namespaces (that's how we store the titles, unlike the page table which has a separate int field for the namespace).

Is it a good idea for the ips_site_page field to store the namespace name as a string? Why is it doing that?

For the same reason the iwlink table uses only varchar(255) for the iwl_title field: because we don't know the namespace IDs of the target wiki. We don't even know the target side is a wiki at all. 255 is a completely arbitrary limit in this context, picked in the hope that it's efficient, and sufficient in most cases.

The problem is that in contrast to iwlinks, ips_site_page is actually used to address items, not just for reverse link lookups. So incomplete information there is more annyoing. WE could make sure we at least cover all codes in which the target is a MediaWiki.

Is it a good idea for the ips_site_page field to store the namespace name as a string? Why is it doing that?

For the same reason the iwlink table uses only varchar(255) for the iwl_title field: because we don't know the namespace IDs of the target wiki. We don't even know the target side is a wiki at all. 255 is a completely arbitrary limit in this context, picked in the hope that it's efficient, and sufficient in most cases.

Huh, right. Briefly scanning through tables.sql, it looks like langlinks.ll_title is affected as well.

I'm actually more closely reminded of externallinks.el_to, but that's a blob with a weird custom index.

Change 228756 had a related patch set uploaded (by Hoo man):
Resize wb_items_per_site.ips_site_page to VARCHAR(310)

https://gerrit.wikimedia.org/r/228756

Change 228756 merged by jenkins-bot:
Resize wb_items_per_site.ips_site_page to VARCHAR(310)

https://gerrit.wikimedia.org/r/228756

hoo claimed this task.
hoo removed a project: Patch-For-Review.

By @hoo request, I have applied the schema change also to testwikidatawiki. Please, in the future, for us silly DBAs :-), we need affected wikis very explicitly.