We are dropping the wb_terms table, you might ask why we are doing it and why it’s such an important endeavour.
What is the wb_terms table?
wb_terms is the biggest table of Wikidata. It’s bigger than all other tables of Wikidata, combined, and multiplied by two. It’s larger than 1TB and more than 2.5B rows and it was created more than seven years ago to act as a caching layer for terms of Items and Properties and also as the search backend.
Terms (labels, descriptions, and aliases) are the bridge between the machine-readable parts of an entity (Items and Properties in this case) and humans. For example when you render an Item in English, the original content might have been something like “P31:Q5” which is not understandable by humans, so we need to render it to “Instance of: human” that are labels of P31 and Q5 in English respectively. We don’t load all of the Items and Properties used in an Item every time the Item page is rendered because you only need the labels in English. It wouldn’t scale (especially with the current size of Wikidata) so we store all terms in a table called wb_terms and query that table instead. Originally, wb_terms table was also used as the search backend until ElasticSearch has been introduced instead.
As of January 2020, the wb_terms table was getting around 300,000 read queries per minute and up to one million reads per minute in spikes. Keep in mind client wikis (like German Wikipedia) use this table quite a lot as well to show the information in their infoboxes for example, which adds to the load on the table, on top of Wikidata’s “own” usage.
Tables this large cause numerous problems. First of all, the wb_terms table has become unpredictable and it’s extremely risky given the amount of reads on this table, the query plans might change suddenly causing large scale issues. A major outage on Wikipedia a couple of years ago happened when some changes on the tables changed cardinality of a field passing a tipping point for MySQL’s query execution planner and then MySQL started to use a very different and very slow way of running queries.
Secondly, the table has been unmaintainable for years now. Making changes to the table structure (e.g. running ALTER TABLE commands) has been impossible, as the database machines would run out of disk space during such maintenance changes. Improving the table structure, by for example dropping a column, is not just risky, but it’s outright impossible.
That brings us to the third issue. This table is occupying so much space in our database infrastructure that we are running out of space altogether. That means Wikidata would not be able to store any new data anymore. Buying new hardware is not a solution for several reasons. The first is that reading from disk is slow, so for databases to work at a good speed, we need to load its tables in memory. There’s a limit in size of memory in the database hosts that we can’t surpass. The hardware upgrade would also need to happen on not just one database server but it would need to be done for all of around twenty, as we have one master node per primary datacenter (currently two) and each replicates to around nine hosts. You also need to count backups that we need to keep historically for a very long time. Another problem with buying new hardware is that the network and I/O impose their own limits on our databases and we can’t run away from those limits. For example we can’t have fast replication in Wikidata or replication to lots of nodes if we continue to use wb_terms because of the limitations in network I/O.
What are we doing about it?
We are replacing this table with a set of six tables (names of all of them starting with the wbt_ prefix) that are normalized, and in total are ten times smaller than wb_terms. You can find more information on the queries, the design and other notes on its documentation page.
One reason for going with the normalization approach is the sheer quantity of duplicated text. For example, millions of Items have “Wikimedia template” as their English description. A similar text has been repeated millions of times for Items of disambiguation pages, categories, and so on. The analysis we did on a test database showed us each text on average has been repeated 17 times in the wb_terms table.
How are we doing it?
The migration happens on two channels: Migration of writes and migration of reads.
Migration of writes is being done by a maintenance script that gradually rebuilds the term store for the given set of Items, i.e. Item’s terms stored in wb_terms are “converted” to the new structure in wbt_ prefixed tables. This migration has started from Q1 and it’s going up. Given the sheer size of the data stored in Wikidata, it took us several months and finally finished two weeks ago.
Writing for the new term store is not simple given that the relations between tables are not simple anymore. For example, we need to make sure to delete orphan rows in the wbt_text (the table that stores mapping of raw strings to IDs) not just because of storage reasons but for privacy reasons as well. We don’t want texts disclosing private information to linger in our database forever and get replicated to the cloud. The job to fix that problem caused deadlocks on a large scale and after fixing it, another bug showed up and by deleting too much (looser conditions than it should have) caused holes in our data and deleted data that was actually needed.
The migration of reads is more complicated than it looks. Given that wb_terms table has been around for eight years now, we were able to find more than sixteen use cases that read from the wb_terms table. We have started to migrate the code responsible for those use cases, and are making sure they would work fine. The system also needs to work in a hybrid mode, so that it reads some data from the old table and some from the new ones, as we do not want to switch to the new term storage, but rather do it gradually
Since reading from a set of tables is slower than a large table on average, we need to make sure we don’t cause a performance regression. Our solution has been to add layers of caching on the “hot” data. For example, the English label of “Q5” (“human”) is being used when rendering every Item of a human. Having this in memory cache (APCu) or central cache (memcached) would reduce the need to load this value for most cases. Adding an APCu cache for one of the read use cases with a TTL of 10 seconds managed to cut down our reads for that use case from around 10,000/minute to 5,000/minute. Since March 18th 2020, we are not reading the wb_terms table anymore.
What will happen?
Soon we will drop the wb_terms table. This will likely happen in a week or two and we already stopped writing to it so it’s growing outdated. If you want to follow our progress, check out our epic task: T208425: [EPIC] Kill the wb_terms table
Thank you to all the tool builders who are adjusting their tools to accommodate this change. You help us keep up with the growth of Wikidata.
If you see any issues, or bugs or you have any questions. Feel free to reach out to us.