Evaluate how to best add a column for full entity ID to wb_terms without affecting wikidata.org users
Closed, ResolvedPublic
Actions

Assigned To

Authored By

	WMDE-leszek
	Mar 6 2017, 3:20 PM

Description

This is a part of schema change of wb_terms table (custom table of Wikibase extension).

We would like to know if it would be possible to add empty (all NULLs) column to wb_terms without needing to make Wikidata DB read-only.
https://gerrit.wikimedia.org/r/#/c/341322/1/repo/sql/AddTermsFullEntityId.sql is a draft of the change we want to do.
There will be no code writing to and reading from the new column until the process of adding the column is complete.

Background:
Current schema change plan (as suggested by Jaime)

Create a new column with new indexes
Populate new column at the same time than the current one + populate old values with maintenance
Deploy the new code using only the new column
At some point in the future, drop the old column

Related Objects
Search...

Status	Subtype	Assigned	Task
Declined		dchen	T118706 Conduct heuristic evaluation of image upload and insert flow in VisualEditor
Open		None	T115858 Design improvements for mw.ForeignStructuredUpload.BookletLayout
Open		None	T115865 Insert image in content immediately after it's uploaded, skipping the "General settings" step
Duplicate		None	T115864 Figure out if the description of the image can be used as the caption on-wiki
Open	Feature	None	T53032 When inserting an image, set its caption by default to be the Commons image description
Open	Feature	None	T39534 Wikimedia Commons should support searching by color
Duplicate		None	T39535 Wikimedia Commons should support filtering by color
Resolved		None	T19503 Provide metadata support on Wikimedia Commons
Resolved		None	T51662 VisualEditor: Use Multimedia/Wikidata's proposed rich structured meta-data in the image insertion dialog
Resolved		None	T68108 [Epic] Store media information for files on Wikimedia Commons as structured data
Duplicate		None	T66288 basic support for structured data on mediawiki files
Invalid		Lydia_Pintscher	T76012 make use of new entity type for multimedia / structured data of media files
Open		None	T109579 [Epic] Give more sister projects access to Wikidata
Open		None	T187900 There is no way to reference a specific quote on Wikiquote
Stalled		None	T71753 [Story] Wikibase / Wikidata support on Wikiquote
Open		None	T67626 [Epic] Support for queries on-wiki (automated list generation)
Resolved		Addshore	T76019 [Story] Support new types of Entities in Wikibase Client
Resolved		thiemowmde	T135650 [Task] Migrate PropertySuggester away from assuming all entities are numeric
Resolved		Addshore	T75496 [Epic] Support new types of Entities in Wikibase Repository
Declined		None	T58711 [Task] Update the wb_terms table so it does not have a numeric entity id
Open		None	T30599 Deadlock tracking bug (tracking)
Resolved		hoo	T111535 Wikibase\Repo\Store\SQL\EntityPerPageTable::{closure} creating high number of deadlocks
Resolved		Lydia_Pintscher	T146637 Wikidata 2016 Q4 goals
Resolved		None	T86530 Replace wb_terms table with more specialized mechanisms for terms (tracking)
Declined		None	T51982 Add missing wb_entity_per_page entries on LinksUpdate
Invalid		Lydia_Pintscher	T70176 EntityPerPageTable class should be usable from the client
Resolved		Ladsgroup	T67333 Wikibase\EntityPerPageTable::getItemsWithoutSitelinks slow query with large LIMIT offset
Resolved		Addshore	T114902 Remove numeric entity IDs from database schema
Resolved		Ladsgroup	T95685 Drop wb_entity_per_page table
Declined		None	T114903 Migrate wb_terms to using prefixed entity IDs instead of numeric IDs
Resolved		Lydia_Pintscher	T159718 Evaluate how to best add a column for full entity ID to wb_terms without affecting wikidata.org users

Event Timeline

WMDE-leszek created this task.Mar 6 2017, 3:20 PM

WMDE-leszek moved this task from Proposed to Monitoring on the Wikidata-Former-Sprint-Board board.

• jcrespo moved this task from Triage to Pending comment on the DBA board.Mar 6 2017, 3:22 PM

Hi!

The wb_terms table is a quite big table (as you already probably know) it is around 230G on disk. That means the ALTER table will take quite a while.
The good news is that adding a column can be done in-place so we can probably do it without causing delay on the slaves (although we'd depool them). However, we can do also the ALTER on the stand by master in the standby DC and once we have switched over the other DC in April, we can get it done in the primary master too. I do not think we have to set up wikidata in read-only, as long as the code is compatible with the column being added one by one on all the servers and you do not write to it until it is 100% completed.

Normally, it is useful for us, DBAs, if you can follow the template to request an ALTER table, so we can have all the information needed at once, speeding up the process for everyone.
This is it: https://wikitech.wikimedia.org/wiki/Schema_changes#Workflow_of_a_schema_change

Thanks!

@Marostegui I think they do not yet want it done yet, but an ok from us/review. But they should probably clarify that. "feasibility" is an ambiguous term.

WMDE-leszek renamed this task from Evaluate feasibility of adding a column for full entity ID to wb_terms to Evaluate if it is possbile to add an empty column for full entity ID to wb_terms without affecting wikidata.org users.Mar 6 2017, 4:32 PM

WMDE-leszek updated the task description. (Show Details)

Evaluate if it is feasible to add such an "empty" column without making Wikidata readonly.

we can probably do it

How certain are you? In my experience, the biggest blocker on production is not the size, but how busy the table is. That would create metadata locking issues if it is very SELECT-heavy. The difference between suffering one or the other are hours vs. weeks of deployment (as it happened with the last page table schema). This need deeper examination.

In T159718#3076201, @jcrespo wrote:

Evaluate if it is feasible to add such an "empty" column without making Wikidata readonly.

we can probably do it

How certain are you? In my experience, the biggest blocker on production is not the size, but how busy the table is. That would create metadata locking issues if it is very SELECT-heavy. The difference between suffering one or the other are hours vs. weeks of deployment (as it happened with the last page table schema). This need deeper examination.

If we depool the slaves we should be fine, shouldn't we? And if we use the DC switchover to alter the masters we'd also get rid of that issue?

If we depool the slaves we should be fine, shouldn't we? And if we use the DC switchover to alter the masters we'd also get rid of that issue?

Hey, don't tell me, tell @WMDE-leszek, and see if he is ok with that schedule. :-)

Thanks @Marostegui for infromation so far.
And yes, @jcrespo is right. We first want to get an OK from you if this is a right way for us to go. I've changed task title slightly. Sorry for not being clear.

So yes, we wouldn't have a code writing/reading to a new column until adding a column is completed.
What we would like to know is if it is possible to add an empty to column to a big table wb_terms without it being noticed by users of wikidata.org.
Adding an empty column seems to be the most simple first step we can do.
Depending on the answer to this, we will plan further steps.
E.g. if we need to first add it on stand-by master, and then update primary master during DC switch over, we can than possibly not only add the column but also populate it? But if we can just add empty column with no "bad" effect on wikidata performance, it would be easier to just add it first.

Thanks for pointing me to https://wikitech.wikimedia.org/wiki/Schema_changes#Workflow_of_a_schema_change. We're going to follow it when we request a schema change "for real".

Depending on the answer to this, we will plan further steps.

I think you should add the full plan here ASAP, even if it is not 100% clear or decided, otherwise we may be adding steps to the process and make it unnecessarily long. E.g. if you plan to add an index later, it will be much easier to know it in advance. Adding an index later may take as much time as adding a column, and we will not have another dc swithchover soon.

daniel mentioned this in T159703: [Tracking] Support for Forms and Senses.Mar 6 2017, 4:57 PM

Oh, sorry, I missed few comments while typing mine :)

It seems to me @Marostegui and @jcrespo that it would better to wait until DC switch over, we can plan with that.

We're completely fine with waiting until April.

I think you should add the full plan here ASAP, even if it is not 100% clear or decided, otherwise we may be adding steps to the process and make it unnecessarily long.

You're 100% right.

General plan for the whole schema change is:

Add term_entity_id_s VARBINARY(32) to wb_terms. All values NULL
Populate term_entity_id_s based on existing data in wb_terms
Create index on term_entity_s
Create index on term_language, term_entity_id_s, term_type, term_search_key(16)
Drop old column term_entity_id from wb_terms.
Rename term_entity_id_s to term_entity_id

That's quite a lot given size of wb_terms. Is this all something that is doable during DC switchover?
Should we plan our work so we are ready to do all those change on April's switchover?

WMDE-leszek updated the task description. (Show Details)Mar 6 2017, 5:01 PM

WMDE-leszek added a project: Schema-change.Mar 6 2017, 5:04 PM

Ok, now I have some comments against that method, logistically, I am at a meeting, let me finish it and I will have some time to properly explain myself (nothing against the spirit of the changes, I would just do it in a different way, if code can handle it).

So the comments:

do not defer the creation of the indexes- those are extra alter tables and do no make things easier in any way- just create the indexes from the start- assuming they will be used.
Renaming columns is a big no- specially to an already existent name. Always add a new column and stop using the previous one at some point. We then can drop the old one with a lower precedence. Think not only as a bad practice in general with code- some applications that are not mediawiki, like labs, may break in bad ways.

So my suggestion would be:

Create a new column with new indexes
Populate new column at the same time than the current one (ideally with code, but if you do not want to touch code, it can even be made with a trigger) + populate old values with maintenance
Deploy the new code using only the new column
At some point in the future, drop the old column (it can be made all null and it will save space already) -low priority

If you are changing an int for a string in a trivial way, that mysql would be compatible with (1234 -> '1234'), it could even be done in a single blow, but I assume it is not the case, probably something more on the lines of 1234 -> 'Q1234', am I right?

This simplifies the process, and less schema changes is not only less work for us- it is also you getting less blocked by us.

This is not an absolute thing- I am open to criticism, as you may know more details that I don't. It is ok to keep ids like term_entity_id_s, reusing identifiers is a bit frowned upon.

In T159718#3077513, @jcrespo wrote:

Populate new column at the same time than the current one (ideally with code, but if you do not want to touch code, it can even be made with a trigger) + populate old values with maintenance

If possible, I would avoid using triggers on such big (and potentially concurrent) table and I would try to stick to the code to populate the column if this is not a lot of extra work.
The reason for it is just based on past experiences with triggers on big big tables (and this is quite a big table) and the load they might (or might not, depending on the usage) put on the server.

Thank you @jcrespo and @Marostegui for your comments. I am happy I asked and you made me sketch the whole big plan.

As you might have realized , I am really not a DB expert. Therefore some not smart things in the plan :)

I really had no good reason for renaming the column, neither for creating indexes later. It just seemed "easier" that way. I agree with all you said @jcrespo, thanks for providing clear arguments. Our code will also be simpler the way you suggested, I think.

I am going to bring your plan below to the whole developer team. For now I don't see any reason not to do it that way.

Create a new column with new indexes

Populate new column at the same time than the current one (ideally with code, but if you do not want to touch code, it can even be made with a trigger) + populate old values with maintenance

Deploy the new code using only the new column

At some point in the future, drop the old column (it can be made all null and it will save space already) -low priority

Regarding int to string change: Yes, it will be changes like 123 -> 'Q123', and 456 -> 'P456'.
Is it done with maint script or with trigger, it does not matter much from our perspective. We'll do it the way you guys says is the best for this. Writing a maint PHP script for doing this is clearly something we're capable of :)

In T159718#3079126, @WMDE-leszek wrote:

Thank you @jcrespo and @Marostegui for your comments. I am happy I asked and you made me sketch the whole big plan.

As you might have realized , I am really not a DB expert. Therefore some not smart things in the plan :)

That's what we are here for: to help, and what you did was exactly what we normally want, get involved in the early stages so we can provide our point of view and give some more context to the db-land! :-)

I really had no good reason for renaming the column, neither for creating indexes later. It just seemed "easier" that way. I agree with all you said @jcrespo, thanks for providing clear arguments. Our code will also be simpler the way you suggested, I think.

I am going to bring your plan below to the whole developer team. For now I don't see any reason not to do it that way.

Create a new column with new indexes

Populate new column at the same time than the current one (ideally with code, but if you do not want to touch code, it can even be made with a trigger) + populate old values with maintenance

Deploy the new code using only the new column

At some point in the future, drop the old column (it can be made all null and it will save space already) -low priority

Regarding int to string change: Yes, it will be changes like 123 -> 'Q123', and 456 -> 'P456'.
Is it done with maint script or with trigger, it does not matter much from our perspective. We'll do it the way you guys says is the best for this. Writing a maint PHP script for doing this is clearly something we're capable of :)

I would go for the option to populate from the code if, as you said, you guys don't mind writing some extra code :-)

WMDE-leszek updated the task description. (Show Details)Mar 7 2017, 8:08 AM

WMDE-leszek updated the task description. (Show Details)Mar 7 2017, 8:12 AM

WMDE-leszek renamed this task from Evaluate if it is possbile to add an empty column for full entity ID to wb_terms without affecting wikidata.org users to Evaluate how to best add a column for full entity ID to wb_terms without affecting wikidata.org users.Mar 7 2017, 10:24 AM

WMDE-leszek mentioned this in T159851: Add a column for full entity ID to wb_terms table.Mar 7 2017, 5:09 PM

Reedy moved this task from Unsorted to Add / Create on the Schema-change board.Apr 26 2017, 3:35 PM

• Marostegui moved this task from Pending comment to In progress on the DBA board.Apr 27 2017, 8:54 AM

• Marostegui moved this task from In progress to Pending comment on the DBA board.

Further work in T162539.

Evaluate how to best add a column for full entity ID to wb_terms without affecting wikidata.org usersClosed, ResolvedPublicActions

Description

Related ObjectsSearch...

Event Timeline

Evaluate how to best add a column for full entity ID to wb_terms without affecting wikidata.org users
Closed, ResolvedPublic
Actions

Related Objects
Search...