Page MenuHomePhabricator

Investigate optimzing wb_terms
Closed, ResolvedPublic

Description

wb_terms with 1.4B rows (with average of 140 bytes per row) is one of the biggest tables on wikidata.org that needs clean up in several areas:

  • Drop term_entity_id (and maybe term_entity_type) in favor of term_full_entity_id. We just switched to reading from term_full_entity_id everywhere.
  • Normalize term_entity_type (item=0, property=1, etc.) if we are not dropping it. It's better to be a settings file
  • Normalize term_type (label=0, etc.). It's better to be a settings file
  • Normalize term_lang (en=0, en-gb =1, etc.) in either another table or a settings file

How we are going to normalize (where to store the mapping) and migration plan needs to be determined.

Event Timeline

thiemowmde triaged this task as Medium priority.Mar 2 2018, 2:14 PM
thiemowmde subscribed.
  • The term_entity_type is not needed any more in Wikibase code, as far as I'm aware of. Third parties can extract the entity type from the first letter of term_full_entity_id, if needed.
  • The mapping for term_type should live in code, in my opinion. E.g. as constants in the TermIndex interface.
  • I would not "normalize" term_lang further. It's a language code, it's short, and the allowed values potentially change constantly. If anything, this should be a primary key to the sites table. But the primary key there is also a string, and it's even longer, so there would be no benefit with regards of size.

@Ladsgroup @Lucas_Werkmeister_WMDE @Marostegui Let's talk here about provinding a temporary wikidata replica so wikidata team can play with changes to it.

I suggest db2083 - 512GB + SSDs which is pretty much the same we have in eqiad (all the hosts but the master in eqiad have those HW specs)

Adding DBA to support the setup of this test db.

Change 423835 had a related patch set uploaded (by Marostegui; owner: Marostegui):
[operations/mediawiki-config@master] db-codfw.php: Depool db2083

https://gerrit.wikimedia.org/r/423835

Change 423836 had a related patch set uploaded (by Marostegui; owner: Marostegui):
[operations/puppet@production] db2083.yaml: Disable notifications

https://gerrit.wikimedia.org/r/423836

Change 423836 merged by Marostegui:
[operations/puppet@production] db2083.yaml: Disable notifications

https://gerrit.wikimedia.org/r/423836

Change 423835 merged by jenkins-bot:
[operations/mediawiki-config@master] db-codfw.php: Depool db2083

https://gerrit.wikimedia.org/r/423835

Mentioned in SAL (#wikimedia-operations) [2018-04-04T05:59:54Z] <marostegui@tin> Synchronized wmf-config/db-eqiad.php: Depool db2083 - T188279 (duration: 01m 17s)

db2083 is now depooled and with alert notifications disabled.
Assuming @Ladsgroup will use the wikiadmin user to perform all the needed ALTERs, I would like to change its password to use a temporary and a different one from production, so he needs to specifically use a different password, that will fail if it is connecting to a different non-test host.
We can either change the password, or the user, or both.

Does that sound reasonable to everyone?

We can either change the password, or the user, or both.

+1

So I have created the following user: test_user that has all privileges from terbium.
I have already left the password on @Ladsgroup's home in terbium and he's tested that he can connect finely.
Lastly, this host is downtimed, so if replication breaks due to the tests, we will not get disturbed.
Once we are done with the tests, we will need to rebuild the host.

Mentioned in SAL (#wikimedia-operations) [2018-04-04T07:36:03Z] <marostegui@tin> Synchronized wmf-config/db-codfw.php: Depool db2083 - T188279 (duration: 01m 17s)

I'd say let's call this done.