Page MenuHomePhabricator

Choose DB/Cluster for WikimediaEditorTasks tables
Closed, ResolvedPublic

Description

We're requesting DBA review of the DB/cluster to use for the tables created by the new WikimediaEditorTasks extension.

The tables are described in the schema here: https://www.mediawiki.org/wiki/Extension:WikimediaEditorTasks/Schema
The .sql files creating them can be viewed here: https://gerrit.wikimedia.org/r/plugins/gitiles/mediawiki/extensions/WikimediaEditorTasks/+/master/sql/

On the product side they're estimating that Suggested Edits, the feature making use of this extension, will have 10,000 users at a maximum (and most likely substantially fewer) during the initial evaluation period.

The only large table will be wikimedia_editor_tasks_entity_description_exists, which will in theory hold as many rows as there are sitelinks from Wikidata entities (i.e., as many rows as the wb_items_per_site table, or ~67.75 million).

For initial rollout, edits made through the app feature this supports will be exclusively on Wikidata, and therefore I'm proposing initially enabling the extension only on wikidatawiki. Yet, like with ReadingLists, the functionality enabled by the extension applies globally across the Wikipedias, rather than on a per-site basis.

Proposal: Since we're dealing with users on a global (cross-project) basis, like with ReadingLists, mimic the DB configuration of ReadingLists:

'wgWikimediaEditorTasksCluster' => [
	'default' => 'extension1',
],
'wgWikimediaEditorTasksDatabase' => [
	'default' => 'wikishared',
],

Event Timeline

Change 496211 had a related patch set uploaded (by Mholloway; owner: Michael Holloway):
[operations/mediawiki-config@master] WikimediaEditorTasks: Add config to InitializeSettings.php

https://gerrit.wikimedia.org/r/496211

extension1 (x1) looks good to me as a location for those.
Can you also post some queries you'll be running against those tables?

@Marostegui Sure:

wikimedia_editor_tasks_entity_description_exists

This table will have two queries run against it. Unfortunately in both cases it uses a filesort in the course of returning random results (by way of ORDER BY RAND()).

  1. SELECT wetede_entity_id FROM wikimedia_editor_tasks_entity_description_exists WHERE wetede_language = $lang AND wetede_description_exists = 0 AND wetede_rand > $rand ORDER BY RAND() LIMIT $limit;
  • EXPLAIN:
+------+-------------+--------------------------------------------------+-------+---------------+-------------+---------+------+------+---------------------------------------------------------------------+
| id   | select_type | table                                            | type  | possible_keys | key         | key_len | ref  | rows | Extra                                                               |
+------+-------------+--------------------------------------------------+-------+---------------+-------------+---------+------+------+---------------------------------------------------------------------+
|    1 | SIMPLE      | wikimedia_editor_tasks_entity_description_exists | range | wetede_rand   | wetede_rand | 4       | NULL |    1 | Using index condition; Using where; Using temporary; Using filesort |
+------+-------------+--------------------------------------------------+-------+---------------+-------------+---------+------+------+---------------------------------------------------------------------+
  1. SELECT source.wetede_entity_id FROM wikimedia_editor_tasks_entity_description_exists source INNER JOIN wikimedia_editor_tasks_entity_description_exists target ON (source.wetede_entity_id = target.wetede_entity_id AND source.wetede_language = $sourceLang AND target.wetede_language = $targetLang) WHERE (source.wetede_description_exists = 1 AND target.wetede_description_exists = 0 AND source.wetede_rand > $rand) ORDER BY RAND() LIMIT $limit;
  • EXPLAIN:
+------+-------------+--------+--------+---------------------+---------+---------+--------------------------------------------+------+----------------------------------------------+
| id   | select_type | table  | type   | possible_keys       | key     | key_len | ref                                        | rows | Extra                                        |
+------+-------------+--------+--------+---------------------+---------+---------+--------------------------------------------+------+----------------------------------------------+
|    1 | SIMPLE      | target | ALL    | PRIMARY             | NULL    | NULL    | NULL                                       |    1 | Using where; Using temporary; Using filesort |
|    1 | SIMPLE      | source | eq_ref | PRIMARY,wetede_rand | PRIMARY | 261     | wikidatawiki.target.wetede_entity_id,const |    1 | Using where                                  |
+------+-------------+--------+--------+---------------------+---------+---------+--------------------------------------------+------+----------------------------------------------+

wikimedia_editor_tasks_counts

For this table, the most common operation will be to iterate a count stored in a particular row:
UPDATE wikimedia_editor_tasks_counts SET wetc_count = wetc_count + 1 WHERE wetc_user = $centralId AND wetc_key_id = $keyId AND wetc_lang = $lang;

  • EXPLAIN:
+------+-------------+-------------------------------+-------+---------------+---------+---------+------+------+-------------+
| id   | select_type | table                         | type  | possible_keys | key     | key_len | ref  | rows | Extra       |
+------+-------------+-------------------------------+-------+---------------+---------+---------+------+------+-------------+
|    1 | SIMPLE      | wikimedia_editor_tasks_counts | range | PRIMARY       | PRIMARY | 265     | NULL |    1 | Using where |
+------+-------------+-------------------------------+-------+---------------+---------+---------+------+------+-------------+

It will also be a common query to get all of the data for a (central) user:
SELECT wet_key, wetc_lang, wetc_count FROM wikimedia_editor_tasks_counts LEFT JOIN wikimedia_editor_tasks_keys ON wet_id = wetc_key_id WHERE wetc_user = $centralId;

  • EXPLAIN:
+------+-------------+-------------------------------+--------+---------------+---------+---------+--------------------------------------------------------+------+-------+
| id   | select_type | table                         | type   | possible_keys | key     | key_len | ref                                                    | rows | Extra |
+------+-------------+-------------------------------+--------+---------------+---------+---------+--------------------------------------------------------+------+-------+
|    1 | SIMPLE      | wikimedia_editor_tasks_counts | ref    | PRIMARY       | PRIMARY | 4       | const                                                  |    1 |       |
|    1 | SIMPLE      | wikimedia_editor_tasks_keys   | eq_ref | PRIMARY       | PRIMARY | 4       | wikidatawiki.wikimedia_editor_tasks_counts.wetc_key_id |    1 |       |
+------+-------------+-------------------------------+--------+---------------+---------+---------+--------------------------------------------------------+------+-------+

wikimedia_editor_tasks_targets_passed

The most common query on this table will be to get all currently active entries for a user:
SELECT wettp_key, wettp_count FROM wikimedia_editor_tasks_targets_passed LEFT JOIN wikimedia_editor_tasks_keys ON wet_id = wettp_key_id WHERE wettp_user = $centralId AND wettp_effective_time <= $timestamp;

  • EXPLAIN:
+------+-------------+---------------------------------------+--------+---------------+---------+---------+-----------------------------------------------------------------+------+-------------+
| id   | select_type | table                                 | type   | possible_keys | key     | key_len | ref                                                             | rows | Extra       |
+------+-------------+---------------------------------------+--------+---------------+---------+---------+-----------------------------------------------------------------+------+-------------+
|    1 | SIMPLE      | wikimedia_editor_tasks_targets_passed | ref    | PRIMARY       | PRIMARY | 4       | const                                                           |    1 | Using where |
|    1 | SIMPLE      | wikimedia_editor_tasks_keys           | eq_ref | PRIMARY       | PRIMARY | 4       | wikidatawiki.wikimedia_editor_tasks_targets_passed.wettp_key_id |    1 |             |
+------+-------------+---------------------------------------+--------+---------------+---------+---------+-----------------------------------------------------------------+------+-------------+

wikimedia_editor_tasks_keys

This is just a simple ID lookup table which is accessed through MediaWiki's NameTableStore abstraction or used in joins with other queries. It will only actually contain one row during the period following initial launch.

We'd need to check out the performance of SELECT wetede_entity_id FROM wikimedia_editor_tasks_entity_description_exists WHERE wetede_language = $lang AND wetede_description_exists = 0 AND wetede_rand > $rand ORDER BY RAND() LIMIT $limit; once it is in production, as sometimes, when there is data, the optimizer can become a bit unpredictable.
What are the list of wikis this will be eventually enabled on?
Is it a good idea to start with wikidatawiki? As it is one of the biggest? Can we start with a smaller one till we are sure this won't misbehave? I assume this can be disabled if needed, right?

UPDATE wikimedia_editor_tasks_counts SET wetc_count = wetc_count + 1 WHERE wetc_user = $centralId AND wetc_key_id = $keyId AND wetc_lang = $lang;

This is the problematic one I mentioned at T218087#5042864 and will pileup if executed once per edit. Talk to wikidata team to understand why and what they are doing about it. On a best case scenario, if this is synchronous to the edit, it will make wikidata edits fails, on a worst case scenario -if not indexed properly and block more than 1 row- it will make the whole wiki fail.

Change 496211 merged by jenkins-bot:
[operations/mediawiki-config@master] WikimediaEditorTasks: Add config to InitializeSettings.php

https://gerrit.wikimedia.org/r/496211

UPDATE wikimedia_editor_tasks_counts SET wetc_count = wetc_count + 1 WHERE wetc_user = $centralId AND wetc_key_id = $keyId AND wetc_lang = $lang;

This is the problematic one I mentioned at T218087#5042864 and will pileup if executed once per edit.

Actually, the query will run only for every edit made by a logged-in user in one of the Wikipedia apps (according to the user agent) via wbsetdescription (which edits only a single page at a time). The extension will check each edit for these criteria on page save complete, but only a tiny fraction of edits will actually meet the criteria and cause the query above to run.

the query will run only for every edit made by one of the Wikipedia apps

Then that is Cool, but so how does the "list of interesting pages" table is updated them? Is it per edit, in batches, in a jobque, a cron every some time?

Thanks, @Mholloway I will give a deeper look at the selects/indexes and query plans and structure, as wellas to that CR, but all my biggest fears seem gone.

We'd need to check out the performance of SELECT wetede_entity_id FROM wikimedia_editor_tasks_entity_description_exists WHERE wetede_language = $lang AND wetede_description_exists = 0 AND wetede_rand > $rand ORDER BY RAND() LIMIT $limit; once it is in production, as sometimes, when there is data, the optimizer can become a bit unpredictable.
What are the list of wikis this will be eventually enabled on?
Is it a good idea to start with wikidatawiki? As it is one of the biggest? Can we start with a smaller one till we are sure this won't misbehave? I assume this can be disabled if needed, right?

Right now, the plan is for the extension only to be enabled on wikidatawiki. The feature it supports is (currently) highly specific to Wikidata, so I don't have a choice there. That said, it should be a small subset of users who cause that query to run, namely a subset of users of the Wikipedia app for Android.

Our plan in the case of trouble is to disable the extension in wmf-config.

I don't have a choice there

Allow me to suggest to start production deployment with testwikidatawiki- it is a production wiki, but it will be much much safer as a first deploy.

@jcrespo I can certainly enable there to do initial testing, e.g., of the populateEntityDescriptionExistsTable script.

@Dbrant is testwikidatawiki (https://test.wikidata.org) at all useful to you for client-side feature testing, or are you essentially blocked on the APIs being available on Wikidata proper?

How about this: I enable on both testwikidatawiki and wikidatawiki, do a test run of populateEntityDescriptionExistsTable and some other sanity checks on testwikidatawiki, then assuming all is well there, run populateEntityDescriptionExistsTable on wikidatawiki and open it to the app team for testing?

To be clear, we don't expect any traffic to any of this (aside from testing by the Android team) until the Android app release with the Suggested Edits feature rolls out, after they have had the chance to test it to their satisfaction on Wikidata.

@Tgr mentioned that, for production, the DB table creation SQL needs to be run manually by either a DBA or the deployer. How should we go about it in this case?

DBAs do not create the tables, that must be done, typically, by a deployer: https://wikitech.wikimedia.org/wiki/Schema_changes#What_is_not_a_schema_change
As these tables will be on wikishared (x1) which filtered, please put up a puppet patch to add them to the private tables list: https://gerrit.wikimedia.org/r/plugins/gitiles/operations/puppet/+/production/manifests/realm.pp#166
We can merge the puppet patch for you once it is up.

How about this: I enable on both testwikidatawiki and wikidatawiki, do a test run of populateEntityDescriptionExistsTable and some other sanity checks on testwikidatawiki, then assuming all is well there, run populateEntityDescriptionExistsTable on wikidatawiki and open it to the app team for testing?

If possible, I would narrow everything down first to just testwikidatawiki.

DBAs do not create the tables, that must be done, typically, by a deployer: https://wikitech.wikimedia.org/wiki/Schema_changes#What_is_not_a_schema_change
As these tables will be on wikishared (x1) which filtered, please put up a puppet patch to add them to the private tables list: https://gerrit.wikimedia.org/r/plugins/gitiles/operations/puppet/+/production/manifests/realm.pp#166
We can merge the puppet patch for you once it is up.

Thank you for the pointers. I'll have a look at this first thing tomorrow.

How about this: I enable on both testwikidatawiki and wikidatawiki, do a test run of populateEntityDescriptionExistsTable and some other sanity checks on testwikidatawiki, then assuming all is well there, run populateEntityDescriptionExistsTable on wikidatawiki and open it to the app team for testing?

If possible, I would narrow everything down first to just testwikidatawiki.

I agree in principle with moving cautiously, but there's not much I can do on testwikidatawiki beyond performing a test run of populateEntityDescriptionExistsTable (which won't be representative of wikidatawiki, due to the much smaller amount of data on testwikidatawiki) and running through a basic workflow or two. What else needs to happen on testwikidatawiki before we can move on to wikidatawiki and allow the app team to test?

If possible, I would narrow everything down first to just testwikidatawiki.

I agree in principle with moving cautiously, but there's not much I can do on testwikidatawiki beyond performing a test run of populateEntityDescriptionExistsTable (which won't be representative of wikidatawiki, due to the much smaller amount of data on testwikidatawiki) and running through a basic workflow or two. What else needs to happen on testwikidatawiki before we can move on to wikidatawiki and allow the app team to test?

Basically the idea is to have a testwikidatawiki is to test there as much as possible and catch any non expected/non obvious error there, to avoid testing in production.
I get that some issues might only arise once we have lots of data, but as you said, if you can run basic workflows on the testwikidatawiki there that is also good, the idea is to use that one as much as possible before moving onto the big one, at least move there once everything that is testable, has been tested on the testwiki.

Thanks, @Marostegui, that makes sense. In that case it would be best to get the extension enabled in testwikidatawiki ASAP. I've pinged Greg to see if we can do it after the train rolls to group0 today.

Change 499191 had a related patch set uploaded (by Mholloway; owner: Michael Holloway):
[operations/puppet@production] Add WikimediaEditorTasks tables to the private tables list

https://gerrit.wikimedia.org/r/499191

Change 499227 had a related patch set uploaded (by Mholloway; owner: Michael Holloway):
[operations/mediawiki-config@master] Enable WikimediaEditorTasks on testwikidatawiki

https://gerrit.wikimedia.org/r/499227

Change 499191 merged by Jcrespo:
[operations/puppet@production] Add WikimediaEditorTasks tables to the private tables list

https://gerrit.wikimedia.org/r/499191

Change 499227 merged by jenkins-bot:
[operations/mediawiki-config@master] Enable WikimediaEditorTasks on testwikidatawiki

https://gerrit.wikimedia.org/r/499227

The tables are created and the extension is live on testwikidatawiki. However, @Tgr and I have been discussing how wikimedia_editor_tasks_entity_description_exists, unlike the others, really only makes sense as a wiki-local table. (Indeed, there is currently an error in maintenance/populateEntityDescriptionExistsTable.php assuming that it's located in the same DB/cluster as wb_items_per_site.)

@jcrespo @Marostegui Would you be OK with us creating and using a new wikimedia_editor_tasks_entity_description_exists table local to testwikidatawiki (and, later, wikidatawiki), and dropping the recently created one in x1? (Of course, that would also resolve the issue of needing to clear data from it before enabling the extension on wikidatawiki.)

The tables are created and the extension is live on testwikidatawiki. However, @Tgr and I have been discussing how wikimedia_editor_tasks_entity_description_exists, unlike the others, really only makes sense as a wiki-local table. (Indeed, there is currently an error in maintenance/populateEntityDescriptionExistsTable.php assuming that it's located in the same DB/cluster as wb_items_per_site.)

Indeed wb_iterms_per_site is a table which lives locally, not on x1.

@jcrespo @Marostegui Would you be OK with us creating and using a new wikimedia_editor_tasks_entity_description_exists table local to testwikidatawiki (and, later, wikidatawiki), and dropping the recently created one in x1? (Of course, that would also resolve the issue of needing to clear data from it before enabling the extension on wikidatawiki.)

You mean that table needs to live locally rather than on a shared db? (x1)?

@jcrespo @Marostegui Would you be OK with us creating and using a new wikimedia_editor_tasks_entity_description_exists table local to testwikidatawiki (and, later, wikidatawiki), and dropping the recently created one in x1? (Of course, that would also resolve the issue of needing to clear data from it before enabling the extension on wikidatawiki.)

You mean that table needs to live locally rather than on a shared db? (x1)?

It would make more sense conceptually for that one to be wiki-local. I don't think it's strictly necessary to move, though; I just wanted to ask your opinion of the possible change.

@jcrespo @Marostegui Would you be OK with us creating and using a new wikimedia_editor_tasks_entity_description_exists table local to testwikidatawiki (and, later, wikidatawiki), and dropping the recently created one in x1? (Of course, that would also resolve the issue of needing to clear data from it before enabling the extension on wikidatawiki.)

You mean that table needs to live locally rather than on a shared db? (x1)?

It would make more sense conceptually for that one to be wiki-local. I don't think it's strictly necessary to move, though; I just wanted to ask your opinion of the possible change.

I am fine with that.
Please place a ticket (different one, so it doesn't get mixed up) once that other table can be dropped from x1 so we can drop it for you.

Great, thanks @Marostegui! I've created the table on testwikidatawiki and will file a new ticket for removal of that table from x1 soon.

I would like to review the table structure.

Mholloway claimed this task.

I think it's safe to call it resolved. Thanks again, @Marostegui and @jcrespo!

I would like to review the table structure.

Reopening.

I would like to review the table structure.

Thanks!

@Mholloway was this deployed yesterday?
There were errors on logtash: https://logstash.wikimedia.org/goto/802d9f0328ff43d4bc0795553e216b50

message	       	MediaWiki\Storage\NameTableStore::loadTable	10.64.32.11	1146	Table 'testwikidatawiki.wikimedia_editor_tasks_keys' doesn't exist (10.64.32.11)	SELECT  wet_id AS `id`,wet_key AS `name`  FROM `wikimedia_editor_tasks_keys`     ORDER BY id 

db_name	       	testwikidatawiki

host	       	mw1342

Does wikimedia_editor_tasks_entity_description_exists being moved to the local DB mean that it should also be removed from the $private_tables list in https://gerrit.wikimedia.org/r/plugins/gitiles/operations/puppet/+/production/manifests/realm.pp?

Does wikimedia_editor_tasks_entity_description_exists being moved to the local DB mean that it should also be removed from the $private_tables list in https://gerrit.wikimedia.org/r/plugins/gitiles/operations/puppet/+/production/manifests/realm.pp?

Does it contain public data that can be exposed on Labs hosts without leaking anything? Probably worth asking Security about it cc @Reedy @Bawolff

There's nothing private on that table. It only contains data on whether a description exists in the languages of Wikipedia pages linked to Wikidata items.

Then yeah, I guess we have to remove it from the filters, copy it to sanitariums, restart them without the filters and let them replicate to labs.
@Reedy @Bawolff can you confirm this is good to go from your side?

That table looks fine to be made public

@Mholloway the one on x1 can be deleted then?

root@db1069:/srv/sqldata# mysql wikishared -e "select count(*) from wikimedia_editor_tasks_entity_description_exists"
+----------+
| count(*) |
+----------+
|        0 |
+----------+

There's nothing private in the table, but probably nothing worth taking up resources for replication, either (functionally it's more or less a materialized view of some of the Wikidata item tables). And maybe at some point there will be some kind of locking mechanism (task X is being held by user Y)? I'd just make it private for now, seems like less hassle. (For which AIUI it would still have to be deleted from $private_tables, and added to [[https://gerrit.wikimedia.org/r/plugins/gitiles/operations/puppet/+/production/modules/role/files/mariadb/filtered_tables.txt|filtered_tables.txt]] instead.)

Having a table under $private_tables means it will not be replicated (replication filters will prevent that from happening) to labs.
Having a table added to filtered_tables.txt and setting its columns (or some of them) to F, (filtered) means that those column will be redacted (triggers).
Not having it on $private_tables and not on filtered_tables.txt (or with all its columns set to K) means the table will be replicated to labs, however, it is not exposed until a view is created, which doesn't happen by default and requires a ticket to WMCS.
views can be based on anything or expose only certain columns under certain conditions.

My proposal if this table is meant to be public:

  • Remove it from private_tables
  • Add it to filtered_tables.txt with all the columns "K" for now.
  • Do not create the view until we are sure it is needed (that needs a ticket to WMCS) and discuss what needs to be exposed once the view is requested.

Removing the table from private_tables requires us to copy the table back to sanitarium and labs, which is the same if we add the table fully filtered on filtered_tables.txt and later decide that we want to expose some columns, we'd need to copy the table again and then filter it.

If at some point we feel some columns need to be redacted, that can "easily" be done by changing the column on filtered_tables.txt and running redact_sanitarium script.

Krinkle subscribed.

(Does not affect the wikimedia/rdbms PHP library.)

Change 500894 had a related patch set uploaded (by Marostegui; owner: Marostegui):
[operations/puppet@production] mariadb: Move wikimedia_editor_tasks_entity_description_exists

https://gerrit.wikimedia.org/r/500894

Change 500899 had a related patch set uploaded (by Marostegui; owner: Marostegui):
[operations/mediawiki-config@master] db-eqiad,db-codfw.php: Depool s8 sanitarium masters

https://gerrit.wikimedia.org/r/500899

Change 500899 merged by jenkins-bot:
[operations/mediawiki-config@master] db-eqiad,db-codfw.php: Depool s8 sanitarium masters

https://gerrit.wikimedia.org/r/500899

Mentioned in SAL (#wikimedia-operations) [2019-04-03T08:18:35Z] <marostegui> Stop replication on db2082 and db1087 (s8 sanitarium masters) T218302

Change 500894 merged by Marostegui:
[operations/puppet@production] mariadb: Move wikimedia_editor_tasks_entity_description_exists

https://gerrit.wikimedia.org/r/500894

Mentioned in SAL (#wikimedia-operations) [2019-04-03T08:23:02Z] <marostegui> Restart mysql on sanitarium hosts db1124 db1125 db2094 db2095 - T218302

Having a table under $private_tables means it will not be replicated (replication filters will prevent that from happening) to labs.
Having a table added to filtered_tables.txt and setting its columns (or some of them) to F, (filtered) means that those column will be redacted (triggers).
Not having it on $private_tables and not on filtered_tables.txt (or with all its columns set to K) means the table will be replicated to labs, however, it is not exposed until a view is created, which doesn't happen by default and requires a ticket to WMCS.
views can be based on anything or expose only certain columns under certain conditions.

My proposal if this table is meant to be public:

  • Remove it from private_tables
  • Add it to filtered_tables.txt with all the columns "K" for now.
  • Do not create the view until we are sure it is needed (that needs a ticket to WMCS) and discuss what needs to be exposed once the view is requested.

This is all done. Table is now replicating on labs (reminder: it is not available as there is no view associated).