Page MenuHomePhabricator

Evaluate how to best add a column for full entity ID to wb_terms without affecting wikidata.org users
Closed, ResolvedPublic

Description

This is a part of schema change of wb_terms table (custom table of Wikibase extension).

We would like to know if it would be possible to add empty (all NULLs) column to wb_terms without needing to make Wikidata DB read-only.
https://gerrit.wikimedia.org/r/#/c/341322/1/repo/sql/AddTermsFullEntityId.sql is a draft of the change we want to do.
There will be no code writing to and reading from the new column until the process of adding the column is complete.

Background:
Current schema change plan (as suggested by Jaime)

  1. Create a new column with new indexes
  2. Populate new column at the same time than the current one + populate old values with maintenance
  3. Deploy the new code using only the new column
  4. At some point in the future, drop the old column

Related Objects

StatusSubtypeAssignedTask
Declineddchen
OpenNone
OpenNone
DuplicateNone
OpenFeatureNone
OpenFeatureNone
DuplicateNone
ResolvedNone
ResolvedNone
ResolvedNone
DuplicateNone
InvalidLydia_Pintscher
OpenNone
OpenNone
StalledNone
OpenNone
ResolvedAddshore
Resolvedthiemowmde
ResolvedAddshore
DeclinedNone
OpenNone
Resolvedhoo
ResolvedLydia_Pintscher
ResolvedNone
DeclinedNone
InvalidLydia_Pintscher
ResolvedLadsgroup
ResolvedAddshore
ResolvedLadsgroup
DeclinedNone
ResolvedLydia_Pintscher

Event Timeline

Hi!

The wb_terms table is a quite big table (as you already probably know) it is around 230G on disk. That means the ALTER table will take quite a while.
The good news is that adding a column can be done in-place so we can probably do it without causing delay on the slaves (although we'd depool them). However, we can do also the ALTER on the stand by master in the standby DC and once we have switched over the other DC in April, we can get it done in the primary master too. I do not think we have to set up wikidata in read-only, as long as the code is compatible with the column being added one by one on all the servers and you do not write to it until it is 100% completed.

Normally, it is useful for us, DBAs, if you can follow the template to request an ALTER table, so we can have all the information needed at once, speeding up the process for everyone.
This is it: https://wikitech.wikimedia.org/wiki/Schema_changes#Workflow_of_a_schema_change

Thanks!

@Marostegui I think they do not yet want it done yet, but an ok from us/review. But they should probably clarify that. "feasibility" is an ambiguous term.

WMDE-leszek renamed this task from Evaluate feasibility of adding a column for full entity ID to wb_terms to Evaluate if it is possbile to add an empty column for full entity ID to wb_terms without affecting wikidata.org users.Mar 6 2017, 4:32 PM
WMDE-leszek updated the task description. (Show Details)

Evaluate if it is feasible to add such an "empty" column without making Wikidata readonly.

we can probably do it

How certain are you? In my experience, the biggest blocker on production is not the size, but how busy the table is. That would create metadata locking issues if it is very SELECT-heavy. The difference between suffering one or the other are hours vs. weeks of deployment (as it happened with the last page table schema). This need deeper examination.

Evaluate if it is feasible to add such an "empty" column without making Wikidata readonly.

we can probably do it

How certain are you? In my experience, the biggest blocker on production is not the size, but how busy the table is. That would create metadata locking issues if it is very SELECT-heavy. The difference between suffering one or the other are hours vs. weeks of deployment (as it happened with the last page table schema). This need deeper examination.

If we depool the slaves we should be fine, shouldn't we? And if we use the DC switchover to alter the masters we'd also get rid of that issue?

If we depool the slaves we should be fine, shouldn't we? And if we use the DC switchover to alter the masters we'd also get rid of that issue?

Hey, don't tell me, tell @WMDE-leszek, and see if he is ok with that schedule. :-)

Thanks @Marostegui for infromation so far.
And yes, @jcrespo is right. We first want to get an OK from you if this is a right way for us to go. I've changed task title slightly. Sorry for not being clear.

So yes, we wouldn't have a code writing/reading to a new column until adding a column is completed.
What we would like to know is if it is possible to add an empty to column to a big table wb_terms without it being noticed by users of wikidata.org.
Adding an empty column seems to be the most simple first step we can do.
Depending on the answer to this, we will plan further steps.
E.g. if we need to first add it on stand-by master, and then update primary master during DC switch over, we can than possibly not only add the column but also populate it? But if we can just add empty column with no "bad" effect on wikidata performance, it would be easier to just add it first.

Thanks for pointing me to https://wikitech.wikimedia.org/wiki/Schema_changes#Workflow_of_a_schema_change. We're going to follow it when we request a schema change "for real".

Depending on the answer to this, we will plan further steps.

I think you should add the full plan here ASAP, even if it is not 100% clear or decided, otherwise we may be adding steps to the process and make it unnecessarily long. E.g. if you plan to add an index later, it will be much easier to know it in advance. Adding an index later may take as much time as adding a column, and we will not have another dc swithchover soon.

Oh, sorry, I missed few comments while typing mine :)

It seems to me @Marostegui and @jcrespo that it would better to wait until DC switch over, we can plan with that.

We're completely fine with waiting until April.

I think you should add the full plan here ASAP, even if it is not 100% clear or decided, otherwise we may be adding steps to the process and make it unnecessarily long.

You're 100% right.

General plan for the whole schema change is:

  1. Add term_entity_id_s VARBINARY(32) to wb_terms. All values NULL
  2. Populate term_entity_id_s based on existing data in wb_terms
  3. Create index on term_entity_s
  4. Create index on term_language, term_entity_id_s, term_type, term_search_key(16)
  5. Drop old column term_entity_id from wb_terms.
  6. Rename term_entity_id_s to term_entity_id

That's quite a lot given size of wb_terms. Is this all something that is doable during DC switchover?
Should we plan our work so we are ready to do all those change on April's switchover?

Ok, now I have some comments against that method, logistically, I am at a meeting, let me finish it and I will have some time to properly explain myself (nothing against the spirit of the changes, I would just do it in a different way, if code can handle it).

So the comments:

  • do not defer the creation of the indexes- those are extra alter tables and do no make things easier in any way- just create the indexes from the start- assuming they will be used.
  • Renaming columns is a big no- specially to an already existent name. Always add a new column and stop using the previous one at some point. We then can drop the old one with a lower precedence. Think not only as a bad practice in general with code- some applications that are not mediawiki, like labs, may break in bad ways.

So my suggestion would be:

  1. Create a new column with new indexes
  2. Populate new column at the same time than the current one (ideally with code, but if you do not want to touch code, it can even be made with a trigger) + populate old values with maintenance
  3. Deploy the new code using only the new column
  4. At some point in the future, drop the old column (it can be made all null and it will save space already) -low priority

If you are changing an int for a string in a trivial way, that mysql would be compatible with (1234 -> '1234'), it could even be done in a single blow, but I assume it is not the case, probably something more on the lines of 1234 -> 'Q1234', am I right?

This simplifies the process, and less schema changes is not only less work for us- it is also you getting less blocked by us.

This is not an absolute thing- I am open to criticism, as you may know more details that I don't. It is ok to keep ids like term_entity_id_s, reusing identifiers is a bit frowned upon.

  1. Populate new column at the same time than the current one (ideally with code, but if you do not want to touch code, it can even be made with a trigger) + populate old values with maintenance

If possible, I would avoid using triggers on such big (and potentially concurrent) table and I would try to stick to the code to populate the column if this is not a lot of extra work.
The reason for it is just based on past experiences with triggers on big big tables (and this is quite a big table) and the load they might (or might not, depending on the usage) put on the server.

Thank you @jcrespo and @Marostegui for your comments. I am happy I asked and you made me sketch the whole big plan.

As you might have realized , I am really not a DB expert. Therefore some not smart things in the plan :)

I really had no good reason for renaming the column, neither for creating indexes later. It just seemed "easier" that way. I agree with all you said @jcrespo, thanks for providing clear arguments. Our code will also be simpler the way you suggested, I think.

I am going to bring your plan below to the whole developer team. For now I don't see any reason not to do it that way.

  1. Create a new column with new indexes
  2. Populate new column at the same time than the current one (ideally with code, but if you do not want to touch code, it can even be made with a trigger) + populate old values with maintenance
  3. Deploy the new code using only the new column
  4. At some point in the future, drop the old column (it can be made all null and it will save space already) -low priority

Regarding int to string change: Yes, it will be changes like 123 -> 'Q123', and 456 -> 'P456'.
Is it done with maint script or with trigger, it does not matter much from our perspective. We'll do it the way you guys says is the best for this. Writing a maint PHP script for doing this is clearly something we're capable of :)

Thank you @jcrespo and @Marostegui for your comments. I am happy I asked and you made me sketch the whole big plan.

As you might have realized , I am really not a DB expert. Therefore some not smart things in the plan :)

That's what we are here for: to help, and what you did was exactly what we normally want, get involved in the early stages so we can provide our point of view and give some more context to the db-land! :-)

I really had no good reason for renaming the column, neither for creating indexes later. It just seemed "easier" that way. I agree with all you said @jcrespo, thanks for providing clear arguments. Our code will also be simpler the way you suggested, I think.

I am going to bring your plan below to the whole developer team. For now I don't see any reason not to do it that way.

  1. Create a new column with new indexes
  2. Populate new column at the same time than the current one (ideally with code, but if you do not want to touch code, it can even be made with a trigger) + populate old values with maintenance
  3. Deploy the new code using only the new column
  4. At some point in the future, drop the old column (it can be made all null and it will save space already) -low priority

Regarding int to string change: Yes, it will be changes like 123 -> 'Q123', and 456 -> 'P456'.
Is it done with maint script or with trigger, it does not matter much from our perspective. We'll do it the way you guys says is the best for this. Writing a maint PHP script for doing this is clearly something we're capable of :)

I would go for the option to populate from the code if, as you said, you guys don't mind writing some extra code :-)

WMDE-leszek renamed this task from Evaluate if it is possbile to add an empty column for full entity ID to wb_terms without affecting wikidata.org users to Evaluate how to best add a column for full entity ID to wb_terms without affecting wikidata.org users.Mar 7 2017, 10:24 AM
Lydia_Pintscher claimed this task.

Further work in T162539.