Page MenuHomePhabricator

Enable usage of translatable bundle status on Wikimedia sites
Closed, ResolvedPublic8 Estimated Story Points

Description

We have added a new table to track the status of translatable pages. This is working on Translatewiki.net but the table is still not created or used on Wikimedia sites.

To enable this feature,

  1. Create the table
  2. Run the syncTranslatableBundleStatus script to populate the table with the current status of translatable pages.

Creating new tables on Wikimedia sites: https://wikitech.wikimedia.org/wiki/Creating_new_tables

Table information

Information regarding table creation can be found in this Google doc and you can review all the options that we considered on the task here T302729: Spike: Explore options for improving page listing database query for Special:PageTranslation.

Table structure

# Copied from https://github.com/wikimedia/mediawiki-extensions-Translate/blob/master/sql/mysql/translate_translatable_bundles.sql
CREATE TABLE /*_*/translate_translatable_bundles (
  ttb_page_id INT UNSIGNED NOT NULL,
  ttb_type SMALLINT UNSIGNED NOT NULL,
  ttb_status SMALLINT UNSIGNED NOT NULL,
  ttb_sortkey VARBINARY(255) NOT NULL,
  UNIQUE INDEX ttb_type_sortkey_status (
    ttb_type, ttb_sortkey, ttb_status
  ),
  PRIMARY KEY(ttb_page_id)
) /*$wgDBTableOptions*/;

This table will contain the status for the latest revision of a translatable bundle.

Here’s a short description of each column:

  1. ttb_sortkey : A sort key created using Collation::getSortKey() with the page namespace and the page id.
  2. ttb_page_id : The page Id
  3. ttb_status : Different possible status for translatable bundles:
    1. tp:proposed - 1
    2. tp:active - 2
    3. tp:outdated - 3
    4. tp:broken - 4
  4. ttb_type: Type of translatable bundle. 1 for Translatable pages, 2 for Message bundles.

More information

Should this table be replicated to wiki replicas (does it not contain private data)?
Yes, we need to create the table on all the wikis where the Translate extension is enabled. It contains data about translatable pages and message bundles. It does not contain private data.

Will you be doing cross-joins with the wiki metadata?
We will be doing SQL JOIN with the page table and potentially other page related tables.

Size of the table (number of rows expected).
This table would store 1 row per translatable pages / message bundles. On Meta-Wiki, we currently have 10,000 translatable pages. I'd say we would be looking at around 50,000 rows at max.

Expected growth per year (number of rows).
I expect around 1,500-2,000 records at max.

Expected amount of queries, both writes and reads (per minute, per hour...per day, any of those are ok)
Approximate numbers:

  • reads - 7000 per day
  • writes - 350 per day

See: https://phabricator.wikimedia.org/T321929#9199246 to understand how we came up with these numbers

Examples of queries that will be using the table

# All translatable pages
SELECT * FROM translate_translatable_bundles WHERE type = 'tp'; 

# All translatable pages with pagination
SELECT * FROM translate_translatable_bundles WHERE type = 'tp' AND sortkey > 'x' LIMIT 200;

# All outdated translated pages
SELECT * FROM translate_translatable_bundles WHERE type = 'tp' AND status = 'outdated';

# All outdated, active, proposed etc translatable pages sorted, and with limit
SELECT * FROM translate_translatable_bundles WHERE type = 'tp' AND status IN (...) AND sortkey > 'x' LIMIT 200;

# Translatable pages with certain statuses, and with paging in a given namespace
SELECT * FROM translate_translatable_bundles WHERE type = 'tp' AND status IN (...) AND sortkey > 'x' LEFT JOIN page ON (page_id = page_id AND page_namespace = N) LIMIT 200; 

# Translatable pages with a status and paging in a given namespace and where title matches a value 
SELECT * FROM translate_translatable_bundles WHERE type = 'tp' AND status IN (...) AND sortkey > 'x' LEFT JOIN page ON (page_id = page_id AND page_namespace = N AND page_title LIKE '%name%') LIMIT 200; 

# Translatable pages with paging in a given namespace and where title matches a value 
SELECT * FROM translate_translatable_bundles WHERE type = 'tp' AND sortkey > 'x' LEFT JOIN page ON (page_id = page_id AND page_namespace = N AND page_title LIKE '%name%') LIMIT 200; 

# Operating on a single page.
INSERT/REPLACE (type, status, sortkey) = VALUES(...) where page_id = N;

The release plan for the feature (are there specific wikis you'd like to test first etc).

We'd like to do the release in 3 phases:

  1. Testwiki - Deploy and test on Testwiki, observe for few days.
  2. MediaWiki - Deploy and test on MediaWiki, observe for few days.
  3. Rollout to all other wikis using Translate extension. See list here: https://gerrit.wikimedia.org/g/operations/mediawiki-config/+/d7b7b75151b5980db8020c12963c64f68f76e3b8/wmf-config/InitialiseSettings.php#7089

Deployment notes

  • Create the table on the respective wiki
  • Run the extensions/Translate/scripts/syncTranslatableBundleStatus script to populate the table with data based on existing status of translatable pages.

Current status

We've started creating the tables on various wikis. You can track the progress here: https://phabricator.wikimedia.org/T321929#9333589

Event Timeline

abi_ triaged this task as Medium priority.Oct 28 2022, 6:08 PM
Nikerabbit set the point value for this task to 4.Jun 6 2023, 10:30 AM

We enabled this table on translatewiki.net in Oct of 2022. I ran the script that syncs status from the revtag table to the translatable_bundles table. Here's the output:

... Syncing translatable bundle status ...

Fetching translatable bundles and their statues

No "Missing" translatable bundle statuses found!
No "Incorrect" translatable bundle statuses found!
Extra translatable bundles statuses:
  * [Id: 7710109] <Title not available>: Proposed

Synchronizing...

No "Missing" bundle statuses
No "Incorrect" bundle statuses
Removing "extra" bundle statuses
Removed "extra" bundle statuses

...Done syncing translatable status...

The page with id: 77710109 is not present in the page table, but exists in the revtag table:

+---------+---------+-------------+----------+
| rt_type | rt_page | rt_revision | rt_value |
+---------+---------+-------------+----------+
| tp:tag  | 7710109 |    11195242 | NULL     |
+---------+---------+-------------+----------+

My hypothesis is that the page was broken and then deleted from the system.

Currently the translatable_bundles is only used to track the status of translatable page, and no decision making is done based on the data in this table. The script to sync the status of translatable_bundles from the revtag table runs after database update completes. The code checks if the table translatable_bundles exists, and then begins the sync process.

abi_ updated the task description. (Show Details)

Change 959999 had a related patch set uploaded (by Abijeet Patro; author: Abijeet Patro):

[mediawiki/extensions/WikimediaMaintenance@master] createExtensionTable.php: Add translate_translatable_bundles

https://gerrit.wikimedia.org/r/959999

I was trying to figure out how many queries we be making to this new table. It should be less than the number of queries currently being made to the translate_metadata table. To find a closer figure, I needed to see how many queries are being run on that table.

I used performance_schema.events_statements_summary_by_digest table to get a more accurate number.

More information about the table can be found here: https://mariadb.com/kb/en/performance-schema-events_statements_summary_by_digest-table/

wikiadmin2023@10.192.16.104(metawiki)> SELECT schema_name, digest_text, first_seen, last_seen, count_star FROM performance_schema.events_statements_summary_by_digest  WHERE digest_text LIKE '%translate_metadata%' AND schema_name = 'metawiki' \G;
*************************** 1. row ***************************
schema_name: metawiki
digest_text: SELECT `tmd_group` , `tmd_key` , `tmd_value` FROM `translate_metadata` WHERE `tmd_group` IN (...) 
 first_seen: 2023-08-22 08:03:11
  last_seen: 2023-09-07 10:56:45
 count_star: 61854
*************************** 2. row ***************************
schema_name: metawiki
digest_text: SELECT `tmd_group` , `tmd_key` , `tmd_value` FROM `translate_metadata` WHERE `tmd_group` = ? 
 first_seen: 2023-08-22 08:07:54
  last_seen: 2023-09-26 12:38:01
 count_star: 178744
*************************** 3. row ***************************
schema_name: metawiki
digest_text: REPLACE INTO `translate_metadata` ( `tmd_group` , `tmd_key` , `tmd_value` ) VALUES (...) 
 first_seen: 2023-08-22 08:45:08
  last_seen: 2023-09-26 09:37:37
 count_star: 6489
*************************** 4. row ***************************
schema_name: metawiki
digest_text: DELETE FROM `translate_metadata` WHERE `tmd_group` = ? AND `tmd_key` = ? 
 first_seen: 2023-08-22 08:45:08
  last_seen: 2023-09-26 09:37:37
 count_star: 5722
*************************** 5. row ***************************
schema_name: metawiki
digest_text: SELECT `b` . `tmd_group` AS `group` , `b` . `tmd_value` AS `langs` FROM `translate_metadata` `a` JOIN `translate_metadata` `b` ON ( ( `a` . `tmd_group` = `b` . `tmd_group` ) AND `a` . `tmd_key` = ? AND `a` . `tmd_value` = ? AND `b` . `tmd_key` = ? ) 
 first_seen: 2023-08-22 10:44:40
  last_seen: 2023-09-26 12:35:39
 count_star: 1012
*************************** 6. row ***************************
schema_name: metawiki
digest_text: SELECT `tmd_group` AS `value` FROM `translate_metadata` WHERE `tmd_key` = ? 
 first_seen: 2023-08-22 11:52:30
  last_seen: 2023-09-26 11:01:31
 count_star: 266
*************************** 7. row ***************************
schema_name: metawiki
digest_text: SELECT `tmd_group` , `tmd_key` , `tmd_value` FROM `translate_metadata` WHERE `tmd_group` IN (...) AND `tmd_key` IN (...) 
 first_seen: 2023-08-22 16:48:24
  last_seen: 2023-09-25 08:42:58
 count_star: 19
7 rows in set (0.078 sec)

Based on the above data over a period of ~36 days (2023-08-22 to 2023-09-26) I see the following stats:

  • reads - 241,895 (Addition of all the SELECT statements)
  • writes - 12,206 (Addition of all REPLACE and DELETE statements)
abi_ added a project: DBA.

Thanks for the detailed information. I will review this ASAP. can you paste the schema and indexes in the task description as well?

Thanks for the detailed information. I will review this ASAP. can you paste the schema and indexes in the task description as well?

Thank you! Added the schema and index information.

Notes:

  • The column type doesn't exist in the schema but exists in the queries given, did you remove that?
  • The ttb_status can be an int instead of string, you'd put the types as constants in the code, e.g. 1 would be outdated, 2 would be another, etc. Take a look at linter extension for good example of lint categories being stored via int.
    • It doesn't matter much in this case as the table is tiny but that's a good design to keep in mind. And future proof in case the table grows drastically in some wikis.
  • if you're querying based on where condition of ttb_status (or ttb_type but I don't know what ttb_type is here), you need index on it
  • If you're joining with page table, you should add an index on ttb_page_id
    • Not related to schema: In join, put the condition on namespace or page title or whatever that's not ttb_page = page_id in WHERE condition, that's more readable and cleaner. Query planner knows what direction to do best.
  • Also it seems the prefix is not really decided? some places it's tb_ some places it's ttb_ (I prefer ttb_ to avoid clashes).

@Ladsgroup I'm sorry, I think I gave you an outdated schema design. What an incredible waste of your time.

The column type doesn't exist in the schema but exists in the queries given, did you remove that?

Sorry, I've updated the schema. I had incorrectly copy pasted from the linked Google doc. The table has

The ttb_status can be an int instead of string, you'd put the types as constants in the code, e.g. 1 would be outdated, 2 would be another, etc. Take a look at linter extension for good example of lint categories being stored via int.

Again, poor copy pasting. ttb_status is a SMALLINT column.

if you're querying based on where condition of ttb_status (or ttb_type but I don't know what ttb_type is here), you need index on it

We have a unique index on ttb_type, ttb_sortkey, ttb_status. Most of the queries we run will have atleast the ttb_type, but almost always the ttb_sortkey too.

If you're joining with page table, you should add an index on ttb_page_id

Ack. We currently don't have a query that performs a JOIN with the page table, but in the future we will. I'd like to add this index at that time since this table is already present on tranaslatewiki.net

Also it seems the prefix is not really decided? some places it's tb_ some places it's ttb_ (I prefer ttb_ to avoid clashes).

Bad copy paste again. We're using ttb_ as the prefix everywhere.

The table is already enabled and being written to (but not read from) on translatewiki.net.

My apologies. I forgot to respond. Yes. It looks good to me.

@abi_ remember that DBAs do not create tables, those can be done via the release process (or by anyone with deployment rights). As it's been reviewed by @Ladsgroup this should be good to go

Change 972099 had a related patch set uploaded (by Abijeet Patro; author: Abijeet Patro):

[mediawiki/extensions/Translate@master] translatable_bundles.sql: Create table only if it doesn't exist

https://gerrit.wikimedia.org/r/972099

List of wikis where the Translate extension is enabled are:

  • advisorswiki - 21-11-2023 - 0 records in table.
  • amwikimedia - 21-11-2023 - 0 records in table.
  • azwikimedia - 21-11-2023 - 0 records in table.
  • bewikimedia - 21-11-2023 - 0 records in table.
  • betawikiversity - 21-11-2023 - 47 records in table.
  • brwikimedia - 21-11-2023 - 27 records in table.
  • bdwikimedia - 21-11-2023 - 29 records in table.
  • cawikimedia - 21-11-2023 - 38 records in table.
  • collabwiki - 29-11-2023 - 112 records in table.
  • commonswiki - 04-12-2023 - 2172 records in table.
  • testcommonswiki - Wiki inactive
  • foundationwiki - 30-11-2023 - 337 records in table.
  • frwiktionary - 27-11-2023 - 102 records in table.
  • gewikimedia - 23-11-2023 - 47 records in table.
  • grwikimedia - 23-11-2023 - 0 records in table.
  • hiwikimedia - 23-11-2023 - 0 records in table.
  • idwikimedia - 23-11-2023 - 24 records in table.
  • incubatorwiki - 29-11-2023 - 45 records in table.
  • legalteamwiki - 29-11-2023 - 18 records in table.
  • maiwikimedia - 23-11-2023 - 1 record in table.
  • mediawikiwiki - 06-12-2023 - 6477 records in table.
  • metawiki - 11-12-2023 - 10868 records in table.
  • nowikimedia - 23-11-2023 - 54 records in table.
  • otrs_wikiwiki - 29-11-2023 - 59 records in table.
  • outreachwiki - 30-11-2023 - 179 records in table.
  • plwikimedia - 23-11-2023 - 6 records in table.
  • ptwikisource - 29-11-2023 - 8 records in table.
  • punjabiwikimedia - 27-11-2023 - 0 records in table.
  • ruwikimedia - 27-11-2023 - 51 records in table.
  • sourceswiki - 30-11-2023 - 22 records in table.
  • specieswiki - 29-11-2023 - 97 records in table.
  • sewikimedia - 27-11-2023 - 103 records in table.
  • testwiki - 16-11-2023 - 90 records in table.
  • testwikidatawiki - 16-11-2023 - 35 records in table.
  • uawikimedia - 27-11-2023 - 24 records in table.
  • vewikimedia - 27-11-2023 - 1 record in table.
  • wbwikimedia - 27-11-2023 - 1 record in table.
  • wikidatawiki - 04-12-2023 - 908 records in table.
  • wikifunctionswiki - 20-11-2023 - 73 records in table.
  • wikimania2012wiki - 20-11-2023 - 30 records in table.
  • wikimania2013wiki - 20-11-2023 - 24 records in table.
  • wikimania2014wiki - 20-11-2023 - 18 records in table.
  • wikimania2015wiki - 20-11-2023 - 73 records in table.
  • wikimania2016wiki - 20-11-2023 - 57 records in table.
  • wikimania2017wiki - 20-11-2023 - 81 records in table.
  • wikimania2018wiki - 20-11-2023 - 21 records in table
  • wikimaniawiki - 20-11-2023 - 207 records in table.

Source: https://3v4l.org/Y4Ykl#v8.2.9

We will start with test wikis, and then smaller wikis until the script is run on all the wikis.

Scripts to run:

# Create the table
mwscript sql.php --wiki=wikidbname /srv/mediawiki/php-{mw-version}/extensions/Translate/sql/mysql/translate_translatable_bundles.sql
# Populate the table with current data
cd /srv/mediawiki/php-1.xx-wmf.xx/
mwscript  extensions/Translate/scripts/syncTranslatableBundleStatus.php --wiki {wikiname}

I've created the tables on testwiki, testwikidatawiki. Will monitor for any issues and will subsequently run on other wikis

abi_ changed the point value for this task from 4 to 8.Nov 20 2023, 8:32 AM

I've created the tables on wikimania wikis, and on wikifunctions wiki.

Nikerabbit changed the task status from Open to In Progress.Nov 23 2023, 8:48 AM

Pending wikis:

  • outreachwiki - Planned for tomorrow
  • sourceswiki - Planned for tomorrow
  • foundationwiki - Planned for tomorrow
  • wikidatawiki - Planned for early next week
  • commonswiki - Planned for early next week
  • mediawikiwiki - Planned for mid next week
  • metawiki - Early in the week of Dec, 11th 2023.

I ran a script to verify that the table was created on all wikis where the Translate extension is enabled. In addition I verified that the stats in the sites where the table exists is in sync with the latest status of translatable pages.

Change 972099 abandoned by Nikerabbit:

[mediawiki/extensions/Translate@master] translatable_bundles.sql: Create table only if it doesn't exist

Reason:

Not needed

https://gerrit.wikimedia.org/r/972099

Change 959999 merged by jenkins-bot:

[mediawiki/extensions/WikimediaMaintenance@master] createExtensionTable.php: Add translate_translatable_bundles

https://gerrit.wikimedia.org/r/959999