Page MenuHomePhabricator

Increase size of categorylinks.cl_collation column
Open, MediumPublic

Description

categorylinks.cl_collation is currently a varbinary(32), but if we want to add a version number to it (T146341), it will need to be bigger, for example, to accommodate "uca-bn@collation=traditional 57.1" or "uca-de-AT@collation=phonebook 57.1", which are 33 and 34 characters respectively. Let's change it to varbinary(50).

Event Timeline

Change 339109 had a related patch set uploaded (by Kaldari):
Increase size of categorylinks.cl_collation column

https://gerrit.wikimedia.org/r/339109

kaldari triaged this task as Medium priority.Feb 22 2017, 2:12 AM
kaldari edited projects, added DBA, Community-Tech-Sprint; removed Community-Tech.

If there are issues with changing the field size, we could potentially just hash it if it doesnt fit.

I think this is way overkill. We only support a couple dozen collations. Even if we supported multiple collations for every human languages, there would still be at most a couple thousand different ones.

We do not need 32 bytes to store this information (note that this is 256**32 = 115792089237316195423570985008687907853269984665640564039457584007913129639936 possible values), much less 50 bytes. We just need to store them a little bit smarter. I liked the truncation, but if people don't like it for some reason, we can maybe just hash the names (fun fact, md5() just happens to return 32 hex characters), or maybe assign numbers to them.

Just hash it, please, as I already suggested on the original patch. MD5 is perfectly fine for this use case. Security does not matter here (no user input but configuration) and collisions are non-existing for such a small set of values.

Security does not matter here (no user input but configuration) and collisions are non-existing for such a small set of values.

This is definitely true, but I'd prefer we use truncated sha-256 for anything new, as that way in later auditing, you can just skip over it and know its safe without looking at the details.

Hashing it seems like a bad idea to me. We would be obfuscating the data on 100% of wikis in order to accommodate the 0% of wikis that actually need long collation strings. What about doing Thiemo's other idea: Just using the first number of the ICU version? It seems unlikely that we would want to regenerate collations just for minor ICU changes. Using only the first number in the version let's us support all possible collations up to ICU version 99 ("uca-de-AT@collation=phonebook 99" is 32 characters). If we throw out the space, we could support all versions up to 999.

We could just hash only in the case that the key is too long.

I think ICU minor versions could cause the sort order to change, not sure. If they did cause the sort order to change, we would need to record them.

We could just hash only in the case that the key is too long

It's kind of an awkward solution, but I would be willing to go along with it. I'm curious though what's wrong with just changing to varbinary(50)? That seems like the most straightforward solution. I still haven't heard any downsides to it.

I'm curious though what's wrong with just changing to varbinary(50)? That seems like the most straightforward solution. I still haven't heard any downsides to it.

Perhaps it would make very little difference, and perhaps I'm cargo-culting a bit, but it seems like a step in the wrong direction and just wasteful. Elsewhere we're making steps to get rid of long varchar fields which only really accept a small set of values – e.g. in Brion's proposed revision table refactoring (https://www.mediawiki.org/wiki/User:Brion_VIBBER/Compacting_the_revision_table_round_2), rev_content_model and rev_content_type (now varchar(32) and varchar(64)) become integer keys.

There is worse problems on *links tables, that is why I ok with a poor's man patch until I send your way a complete refactoring.

Poor's man may come as too negative. What I mean is I prefer something fixed now, even if it's not perfect, than something perfect but may never be deployed. We can deploy this now (it has my support), and at the same time working in the background for a better design. I do not like to block patches that make things better overall. I know I can apply such a patch soon-ish, a full refactoring may take years.

Perhaps it would make very little difference, and perhaps I'm cargo-culting a bit, but it seems like a step in the wrong direction and just wasteful.

Varbinary doesn't use any padding (unlike binary), so I don't think it would be wasteful (unless you just mean a waste of time). The practical effect on the production databases will be nil, as I don't expect any production wikis to use long collation strings. It just means that in the very unlikely event that some wiki needs to use "uca-de-AT@collation=phonebook" and we're appending version numbers, it won't fail.

I guess theres always the question of what happens when someone makes an even longer collation name.

(Consider me neutral on the issue)

I guess theres always the question of what happens when someone makes an even longer collation name.

If anyone makes an even longer collation name I will personally hunt them down and force them to calculate sha-256 hashes for us, on paper, without a calculator.

One other issue with using hashes (although I'm not sure if it would actually matter), is that in most cases we would be switching from 6 bytes of data in the cl_collation column to 32 bytes of data. Although that doesn't sound like much, we're talking about hundreds of millions of records. I wonder if that would affect the performance of the tables any?

@thiemowmde: We seem to be at a stalemate, which is blocking the merge of two other features. What are your thoughts on my comments at T158724#3048124 and T158724#3054267?

What are your thoughts on my comments at T158724#3048124 and T158724#3054267?

As I said: just use a hash function. The use case we are discussing here is pretty much exactly what hash functions are made for.

Hashing it seems like a bad idea to me. We would be obfuscating the data […]

I may miss something, but as far as I can tell nobody will ever look at the actual content of this column. Am I right that the only code that ever reads this column are a few cl_collation != … comparisons? So just store a hash.

What about doing Thiemo's other idea: Just using the first number of the ICU version?

This doesn't solve the issue. The strings will be shorter, but can still exceed the limit, cutting off relevant information.

[…] we would be switching from 6 bytes of data in the cl_collation column to 32 bytes […] we're talking about hundreds of millions of records.

If this is a relevant issue you must change the column type to integer. You can create a lookup table for all known collations, or simply use CRC32. Really. What is the chance of a collision, and what impact would that have? That some unnecessary updates are done? When considering that the purpose of the discussed change (add the version number) is to trigger more updates (not only when the collation changes but also when it's version changes), I don't believe that some more (very unlikely) updates will be a problem.

If this is a relevant issue you must change the column type to integer. You can create a lookup table for all known collations, or simply use CRC32. Really. What is the chance of a collision, and what impact would that have? That some unnecessary updates are done? When considering that the purpose of the discussed change (add the version number) is to trigger more updates (not only when the collation changes but also when it's version changes), I don't believe that some more (very unlikely) updates will be a problem.

I'm not worried about collisions. I'm worried about bloating an already huge table with more data for no practical reason.

I may miss something, but as far as I can tell nobody will ever look at the actual content of this column.

I was looking at it all the time while working on https://gerrit.wikimedia.org/r/#/c/272419/ and testing https://gerrit.wikimedia.org/r/#/c/327762/. Both of those tasks would have been more difficult if we were storing a hash rather than the actual collation.

I'm not worried about collisions. I'm worried about bloating an already huge table with more data for no practical reason.

From a table bloat perspective - the keep length the same and hash only if over length limit would be the least table bloating option out of all the choices.

I'm worried about bloating an already huge table […]

I wonder why you are phrasing this as a response to what I wrote? Personally I agree that increasing the length of the column is a bad idea. Just use a hash that fits into the existing column. Note that the column is binary. SHA-256 is a perfect fit. But as I already said multiple times you can use whatever hash function you want, even truncated hashes.

I was looking at it all the time while working […] and testing […]

So you are a dev and can handle hashes. Fine.

[ Though people might disagree on approaches, could everyone please be respectful and assume people mean well? Thanks a lot! :) ]

If we are doing things "right", hashes are a bad idea for primary keys (foreign keys of this table). They can have collisions, and are too large- for non crypto results and an id that it is going to be only internal, and never exposed to the user, auto_increments are can be as small as 1 byte and much more efficient. They also do not have the risk of people calculating the hash and violating the foreign key, as people has suggested (which worries me). Also, if the name changes, an arbitrary key is way more efficient because if someone decides to change the "Thalassian" collation to "high elven" denomination, you only need a single row change.

I keep hearing people suggesting hashes, when in most cases that is a bad idea. There are cases where auto-increments are, too, but it is not the case here. Hashes are (in general) a bad idea to insert in a B+Tree, due to its inherent randomness nature, while monotonically increased values are way faster to import and alter for InnoDB. I have presented those results at several MySQL conferences, using mediawiki schema as an example: https://www.slideshare.net/jynus/query-optimization-from-0-to-10-and-up-to-57/110

Of course, the largest bloat of this table is the title-related fields, which I ask you for help to completely redo the same way that revision is being done.

I think there is too much fuss being made over this bug. As a temporary solution until some larger table refactoring takes place (which i assume we want to do all at once) - both the increase field size solution is totally fine and the hash if field size doesnt fit is totally fine. I dont think the drawbacks that either solution hasreally actually matter.

As a temporary solution until some larger table refactoring takes place (which i assume we want to do all at once) - both the increase field size solution is totally fine and the hash if field size doesnt fit is totally fine. I dont think the drawbacks that either solution hasreally actually matter.

+1. It is the same idea I expressed here: T158724#3050000 and I only changed opinion because some people disagreed. Better refactor on a separate ticket.

1978Gage2001 moved this task from Triage to In progress on the DBA board.

I wonder why you are phrasing this as a response to what I wrote? Personally I agree that increasing the length of the column is a bad idea.

I think we are misunderstanding each other. I was saying that using a hash would bloat the table as we would be storing 32 bytes per row rather than 6-16 bytes (in practice). Changing the column to varbinary(50) would have no effect on the size of the table or keys (assuming that none of our wikis have any plans to switch to collations with extra long names).

While I agree that a hash seems like a good solution in theory, in this case it has two down-sides: increasing the size of the data that is being stored in practice, and making it slightly harder to do development and debugging around collation. I still haven't heard any downside to changing the column to varbinary(50) other than two theoretical issues which aren't realistic concerns, IMO: that new collations could be invented with names longer than 50 characters, and table bloat from the use of collation names longer than 32 characters.

Removing task assignee due to inactivity, as this open task has been assigned to the same person for more than two years (see the emails sent to the task assignee on Oct27 and Nov23). Please assign this task to yourself again if you still realistically [plan to] work on this task - it would be welcome.
(See https://www.mediawiki.org/wiki/Bug_management/Assignee_cleanup for tips how to best manage your individual work in Phabricator.)

LSobanski subscribed.

Removing the DBA tag and subscribing myself instead. Once there are specific actions for DBA please re-add us and/or @mention me.