Page MenuHomePhabricator

Overview of wb_terms redesign
Open, Needs TriagePublic

Description

Summary

In May-June 2019, the Wikidata development team will drop the wb_terms table from the database in favor of a new optimized schema. Over years, this table has become too big, causing various issues. The table being used by plenty of external tools, we are setting up a process to make sure that the change can be done together with the developers and maintainers, without causing issues and broken tools.

On this task, you can find a description of the changes and the process, and you can ask for more details or for help in the comments. On the wb_terms - Tool Builders Migration board you will find all the details about the migration, how to update your tool, and you can add your own tasks.

Context: what is the problem with wb_terms?

wb_terms is an important table in Wikidata’s internal structure. It supports vital features in Wikidata such as performing search on entities and rendering the right labels of items and properties that get linked in statements. It has also been used by several other extensions, like PropertySuggester that uses the search feature mentioned before, and by many tool builders to query entity terms (labels, descriptions and aliases) for their needs.

In April 2018, the table took up to 900 GB, and in March 2019 the table takes up to 946 GB (46 GB increase, of which 30 happened in last 3 months) and we expect the rate in which the table size increases will be bigger with more adoption of Wikidata.
Currently our master host of database has less free space than the table size, which puts it in a risky situation as any attempt to alter the table or any other tables in the database become dangerous due to shortage on disk space required for these operations.
wb_terms takes up a lot of disk space because it contains over 70% duplication due to its design, and because it is too denormalized and has 7 indexes. The indexes size being about 590 GB.

In order to tackle these potential risks, we are working on a solution:

  • redesign the table by normalizing it to avoid the high degree of duplication (see new schema diagram, new schema DDL)
  • provide a clean abstraction on code level to access the new schema functionality
  • migrate data gradually from the old wb_terms table into the new schema while we make sure we do not interrupt any usage of the table functionality

Overview of the new schema:

How are existing tools building on top of Wikidata affected?

Everyone querying wb_terms directly (on Cloud Services Wiki Replicas through database connection) will be affected. Those tools will need an update in its code to continue working. Tool builders and maintainers will need to adapt their code to the new schema before the migration starts and switch to the new code when the migration starts.
➡️ You can find more details about how to prepare your code to do in this task.

Other tools, for example those getting data through the API, pywikibot, gadgets, etc. will not be affected and will work as before with no action required from the maintainers.

Tool builders are encouraged to navigate to the dedicated workboard created for their tools migration, where we will organize detailed information and help out answering questions and fixing issues to our best capacity. Feel free to add your own tasks to this board, and to collaborate with each others.

Timeline

  • Breaking change announcement: April 24th
  • 29th of May: Test environment for tool builders will be ready
  • 24th of June: Property Terms migration starts
  • 26th of June: Read property terms from new schema on Wikidata
  • 3rd of July: Item terms migration begins
  • 10th of July: Read item terms from one of the two schemas

➡️ You can find more details about the migration plan in this task.

In case of inevitable circumstances that would lead to a change in this timeline, this task will be updated accordingly and an announcement will be made.

I’m running one or several tools using wb_terms, what should I do?

You may want to check if this alternative, possibly easier solution might suit your needs first

  • You need to adapt the code of your tool(s) to the new schema. Here's the documentation that will help you:
  • The migration will take place on May 29th, but don’t wait last minute to start working on your code :) A test system will be ready starting on May 15th to try your code against the new database structure.
  • If you need help or advice, we offer you several options:
    • You can ask for details or help in a comment under this task or create a task with the tag wb_terms - Tool Builders Migration. Wikidata developers will answer you as soon as possible.
    • Two sessions of the Technical Advice IRC Meeting (TAIM) will be dedicated to Wikidata’s wb_terms redesign: you can join and discuss about your issues with Wikidata developers. These meetings will take place on May 15th and May 22th, at 15:00 UTC, on the #wikimedia-tech channel.
    • A dedicated session will be organized at the Wikimedia hackathon in Prague. It will take place on Friday, May 17th at 15:00 at the Wikidata table (more details)

Is this going to affect other Wikibase instances outside of Wikidata/Cloud Services?

Not yet. The new schema and migration will be deployed on Wikidata first. There will be a separate announcement and release for that to follow with more details.

Event Timeline

Restricted Application added a subscriber: Aklapper. · View Herald TranscriptApr 24 2019, 12:16 PM
Lea_Lacroix_WMDE updated the task description. (Show Details)

Pinging a few people who may need to be aware of this: @Magnus @Pintoch @Multichill @matej_suchanek @Pasleim @Envlh @Ricordisamoa @MusikAnimal
Feel free to ping more people if you think they should be aware of this change!

bd808 updated the task description. (Show Details)Apr 24 2019, 4:32 PM
Pintoch removed a subscriber: Pintoch.Apr 24 2019, 4:35 PM

Thanks for the ping! I am not using wb_terms in any of my projects. Good luck with the migration, it looks epic.

Posted notice to cloud-announce and cloud mailing lists.

alaa_wmde updated the task description. (Show Details)Apr 24 2019, 6:13 PM

Posted notice to cloud-announce and cloud mailing lists.

thanks @bd808

Instead of migrating to a different schema for the database terms storage, some tools could also use the API instead, e. g. action=wbgetentities&props=labels|descriptions|aliases (get all terms, or limit to some languages) or action=wbformatentities (format HTML for the user – includes support for lexemes, senses, and forms, which aren’t even available in wb_terms, and also handles language fallbacks out of the box). This should be more robust, possibly easier to work with, and hopefully not significantly less performant for most tools (though some might still have to use the database).

alaa_wmde updated the task description. (Show Details)Apr 25 2019, 5:38 PM
Envlh removed a subscriber: Envlh.Apr 28 2019, 7:44 PM

Thanks for the ping! I don't use this table either, but it seems a huge and well prepared work. Good luck :-)

This breaks a couple of queries for me and at least one tool. Looking at the new schema: What happened with term_full_entity_id? Without the full entity id joining against the page related tables becomes impossibly slow. This query is already quite slow, but without the full entity id it will just break;

SELECT term_text AS painter, COUNT(term_text) AS paintercount FROM page
JOIN pagelinks AS colllink ON page_id=colllink.pl_from AND
colllink.pl_from_namespace=0 AND
colllink.pl_namespace=0 AND colllink.pl_title='Q18600731'
LEFT JOIN pagelinks AS creatorlink ON page_id=creatorlink.pl_from AND
creatorlink.pl_from_namespace=0 AND
creatorlink.pl_namespace=120 AND
creatorlink.pl_title='P170'
JOIN wb_terms ON page_title=term_full_entity_id AND
term_entity_type='item' AND
term_language='en' AND
term_type='description' AND term_text LIKE 'painting by %'
WHERE
page_namespace=0 AND
page_is_redirect=0 AND
creatorlink.pl_from IS NULL
GROUP BY term_text
ORDER BY COUNT(term_text) DESC, painter
LIMIT 300;

@Multichill I think this SPARQL query is more or less equivalent? (Except for the filtering out of “onbekend”, which I added as a bonus.)

SELECT ?itemDescription (COUNT(?item) AS ?count) WHERE {
  SERVICE wikibase:mwapi {
    bd:serviceParam wikibase:api "Search";
                    wikibase:endpoint "www.wikidata.org";
                    mwapi:srsearch "haswbstatement:P195=Q18600731 \"painting by\"";
                    mwapi:srlimit "max".
    ?title wikibase:apiOutput mwapi:title.
    # ?item wikibase:apiOutputItem mwapi:item. # causes an exception for some reason?
  }
  BIND(IRI(CONCAT(STR(wd:), ?title)) AS ?item) # workaround
  MINUS { ?item wdt:P170 ?creator. }
  MINUS { ?item schema:description "painting by onbekend"@en. } # “unknown”
  SERVICE wikibase:label { bd:serviceParam wikibase:language "[AUTO_LANGUAGE],en". }
}
GROUP BY ?itemDescription
ORDER BY DESC(?count)
LIMIT 300

I’ve dumped the current results at P8467 for your convenience; the query currently takes around 45 seconds. (It would probably be even faster if WikibaseCirrusSearch had a keyword for searching in descriptions, but currently it only indexes labels specially.)

Still, the fact that it’s no longer possible to join the page table on the terms might be important for other things. I’m not sure how much that’s currently accounted for.

That's assuming that this query is stand alone. It's combining the results with another SQL database (the mix'n'match one).
I'm quite unhappy about you guys changing tables again and dropping the full entity. Last time you did that, several things broke and the team completely ignored T114904 . This is very demotivating.

alaa_wmde added a comment.EditedMay 3 2019, 2:59 PM

@Multichill

We understand this requires some work from tool builders. We unfortunately were forced into this situation as the current table design has reached its performance limits.

In the new schema design, we tried to follow, as much as possible, best practices on database schema design for maximum performance and space usage, which lead to different decisions, including two relevant ones to your concerns about the full entity id:

  1. After normalizing the schema to avoid duplication as much as possible, we decided to link normalized terms (wbt_term_in_lang table) to entities in a table-per-entity basis (hence wbt_item_terms and wbt_property_terms, and later other entity types will have their own tables).

    The reason behind this decision is that different entity types might have orders of magnitude difference in their count (and their terms count). For example, properties now constitute about 0.001% of total entities compared to items in wikidata. If we put all of them in only one table, this might lead again to both duplication (due to a column that will be added then to identify the entity type) as well as limitation on possible optimizations we might want to do for certain entity types should they again reach some limits.
  1. What followed is to that keeping full entity ids in each of these tables is redundant, and will only cost us more space usage that can build up to a big amount and can be saved easily.

If full entity ids are desired as an outcome of a query (in case they need to be joined with other tables/sub-queries), you can still build up the full id like:

SELECT CONCAT( 'Q', wbt_item_terms.wbit_item_id ), ...
FROM wbt_item_terms
...

We strongly advise not to bind directly to the new schema or any other database table whenever there are other alternatives through Wikidata APIs. This will increase the stability of your tools as Wikidata APIs are designed for public use while db schema is not.

SELECT CONCAT( 'Q', wbt_item_terms.wbit_item_id ), ...
FROM wbt_item_terms
...

That completely kills any performance left. I you look at my example query, you'll notice you're even not addressing my concern because I'm using it in a join.

@Multichill

I'm sorry I didn't address your exact query directly. To address this issue concretely, I created another task in the Backlog T222602 (subscribed you to it too).
We will check that solution in the test node and get back with some results.

eranroz added a subscriber: eranroz.May 9 2019, 3:35 PM

Multichill I think a solution to your problems can be done on wikireplicas (or a similar level)- wikireplicas don't need to have the same structure than production, and additional tables or indexes can be done there. I don't think internal production needs should cater tool needs. That doesn't mean tool needs should not be provided, on the contrary, better query methods should be provided but I think they are different problems that should not be confused with one another. Better APIs should be made available with a stable interface, I 100% agree with that, but the database cannot be an interface that is guaranteed to be stable.

Note this lack of stability is documented at https://wikitech.wikimedia.org/wiki/Help:Toolforge/Database#Stability_of_the_mediawiki_database_schema While the database schema on HEAD is documented, there is close to no documentation about the actual database schema deployed.

I would prefer to work together to define and maintain a stable and flexible api for power users (including tools) rather than trying to keep up with all the changes every time there is an internal-only optimization. wb_terms is close to 1TB of data at the moment, no matter how it is done, querying that is not going to be fast.

Here's an update regarding the dates of test environment and migration of wb_terms table replacement solution.

Due to various complications that the developers in the Wikidata team have been working on solving over the last few weeks, we unfortunately will have to push the dates for when a test environment for tools builders will be ready, which was supposed to be ready today, and the following dates for starting migration of wb_terms data into the new schema in production.

The new dates are:

  • 29th of May: Test environment for tool builders will be ready
  • 12th of June: Property Terms migration starts
  • 19th of June: Read property terms from new schema on Wikidata
  • 26th of June: Item terms migration begins
  • 3rd of July: Read item terms from one of the two schemas (as explained in this task)

As a reminder, if you want to discuss with the developers, ask questions and get help in order to update your tools, you can join the IRC Mediawiki meeting, today at 15:00 UTC on #wikimedia-tech.

Update: the test system is now in place: https://wikidata-termstore.wmflabs.org/index.php/Main_Page
To access to its database, you need to get the access granted. Feel free to add a comment to this ticket if you need it.

Hello all,

This is an update regarding the dates of migration.

When we originally planned the last schedule for migration in production, we must have overlooked the planned offsite for SRE (Software Reliability Engineering) team that overlaps with the first week of it (specifically, beginning to migrate property terms on 12th of June). We unfortunately will have to push the dates for starting migration of wb_terms data into the new schema in production another week into the future from the current dates.

The new dates are:

  • 19th of June: Property Terms migration starts
  • 26th of June: Read property terms from new schema on Wikidata
  • 3rd of July: Item terms migration begins
  • 10th of July: Read item terms from one of the two schemas

We apologize and thank you for your understanding!

Update: we are about to start migrating property terms this week. The original planned date was 19th of June. Due to WMF US Staff holiday, there will be no deployments today. Therefore, we are going to switch the migration on (write both stage) of property terms one day later than planned, on Thursday the 20th of June. All other dates regarding migration remain the same for the moment.

Thanks for the update @Lea_Lacroix_WMDE - I remember I talked to @alaa_wmde about waiting until we had the new db primary master in place for wikidata on some tasks. I have only been able to find this T219145#5088395 but I reckon we spoke about it somewhere else too.
By reading T221765 I guess this first migration will be only the 1% of all the items, and the rest are still marked as TBD, right?
Once we have swapped the current db master, we can proceed with the rest of the migration, is that still the plan?

Thanks!

Hey @Marostegui

By reading T221765 I guess this first migration will be only the 1% of all the items, and the rest are still marked as TBD, right?
Once we have swapped the current db master, we can proceed with the rest of the migration, is that still the plan?

Yes and yes :)

We migrate only first 1% of items. That will require ~17GB of extra disk space, which we agreed on being acceptable before having the new db primary master as we discussed on T220480.
The remaining 99% we will wait on until we have the new master, as that will require an extra of ~175GB which was not acceptable in the current situation (as discussed in T219145).

Is that still good to go from your side?

Hey @alaa_wmde!

Thanks for confirming, I just wanted to make sure the planning didn't change, as there have been many migration subtasks and it was hard to keep up with all the plans :-)
I will try to get the db master changed scheduled for July

Hey @alaa_wmde!
Thanks for confirming, I just wanted to make sure the planning didn't change, as there have been many migration subtasks and it was hard to keep up with all the plans :-)
I will try to get the db master changed scheduled for July

Yes it was quite an effort to keep up with things all over the place :D we gotta improve on that, probably by breaking down changes down into much smaller iterations.

It would be really cool if db master is changed that soon .. even August would be great. The sooner that happens, the sooner we drop wb_terms finally :)

WMDE-leszek updated the task description. (Show Details)Jun 20 2019, 3:56 PM

Update: we are about to start migrating property terms this week. The original planned date was 19th of June. Due to WMF US Staff holiday, there will be no deployments today. Therefore, we are going to switch the migration on (write both stage) of property terms one day later than planned, on Thursday the 20th of June. All other dates regarding migration remain the same for the moment.

Hi all! Apologies for another change of plans. We have only started migrating property terms on test.wikidata.org today as planned. Due to issues with the current deployment train, the code created in the last two weeks has not yet been deployed on Wikidata. As the current train carries including parts critical for the migration, we cannot start the migration on production wikdata servers yet.
We will start the migration of property terms on Wikidata on Monday June, 24th. Migration on test.wikidata.org has been running without issues so far, so we are not expecting inconveniences.

Other migration dates are not expected to be affected by this delay and remain as recently announced by @Lea_Lacroix_WMDE

Update regarding migration plans of terms on wikidata production

Hello all,

We have successfully put Wikidata production in the stage where property terms are stored in and written to both the old store (wb_terms table) and the new replacement store. Retrieving property terms is still being done using the old store.

The previously announced dates are no longer effective. No changes to tools are needed yet.

We will run it in several stages and there will be separate announcements regarding those stages to announce and inform about how to deal with it, in case it affects your work.

We will announce the dates for when the tools have to switch to reading from the new store, for property terms and for item terms separately, at least 6 weeks before we do the switch in production to give tool builders enough time to update and test their tools.