Page MenuHomePhabricator

Spike: Explore options for improving page listing database query for Special:PageTranslation
Closed, ResolvedPublic4 Estimated Story Points

Description

This is split off of the parent task: T141961: Improve Special:PageTranslation list.

The goals is to improve performance by exploring what options are available in the database layer, in order to decide what can be done in the front-end.

Special:PageTranslation query

Background

Purpose: provide a list of translatable pages grouped by status with some metadata.

Queried data:
Page table: page_id, page_namespace, page_title, page_latest
revtag table: rt_type, rt_revision
translate_metadata table: various fields

Query plan:
Part 1:
page + revtag, join on page_id=rt_page
Note: Returns two rows per translatable page to fetch MAX(rt_revision) for rt_type in (tp:mark, tp:tag).

Part 2:
translate_metadata: select by message group id

Derived data:
Page status by comparing page_latest and rt_revision (for each of the two possible types)

Challenges

We list all translatable pages, but to do so we fetch 2xcount(translatable pages) rows. We would like to add pagination, but we cannot do it effectively because grouping is done client side based on derived data.

Quick win: Can we return only one row per page?
Ideal solution: Can we move data derivation and grouping to SQL side?
Do we need to remove grouping? Is any kind of efficient filtering possible?

As a side note, translate_metadata table is queried separately. Ideally it would be joined too, but there is no field to join upon. Though in theory group id can be formed as "page-{Prefixed Title}".

Event Timeline

Nikerabbit triaged this task as Medium priority.Mar 1 2022, 9:31 AM
Nikerabbit updated the task description. (Show Details)
Nikerabbit set the point value for this task to 4.

Abijeet and I talked a bit about this. Today I manged to look at it a bit in more depth, My suggestion is something along the lines of making a query per type, e.g. one for discouraged, one for broken, and union the result in the output page. On top of that It should have some WAN cache (even for ten minutes). Then easily you can implement pagination and filtering of groups based on that.

Tacsipacsi renamed this task from Spike: Explore options for improving page listing database query for Special:TranslatablePages to Spike: Explore options for improving page listing database query for Special:PageTranslation.Jul 4 2022, 4:24 PM
Tacsipacsi subscribed.

For translatable pages, we track up-to two tags in the rev_tag table to identify their state,

  1. tp:mark - To track that the page can be marked for translation.
  2. tp:tag - To track that the page has been marked for translation.

To determine whether a page is ready to be marked for translation, is ready to be ready for translation, is outdated or broken, we need to fetch all the tags associated with the page. Hence its' not possible to efficiently have queries per type

Quick win: Can we return only one row per page?
Ideal solution: Can we move data derivation and grouping to SQL side?

I was able to build a query that returns a single row per translatable page to which we should also be able to add pagination:

SELECT 
    page_id,
    page_namespace,
    page_title,
    page_latest,
    GROUP_CONCAT(rt_type, ':', rt_revision
        SEPARATOR '|') AS rt_type_revsion
FROM
    (SELECT 
        page_id,
            page_namespace,
            page_title,
            page_latest,
            rt_type,
            MAX(rt_revision) AS rt_revision
    FROM
        `page`, `revtag`
    WHERE
        (page_id = rt_page)
            AND rt_type IN ('tp:mark' , 'tp:tag')
    GROUP BY page_id , page_namespace , page_title , page_latest , rt_type) t
GROUP BY page_id , page_namespace , page_title , page_latest
ORDER BY page_namespace , page_title;

Output from translatewiki:

+---------+----------------+------------------------------------------------+-------------+----------------------------------+
| page_id | page_namespace | page_title                                     | page_latest | rt_type_revsion                  |
+---------+----------------+------------------------------------------------+-------------+----------------------------------+
| 4346316 |              0 | Group_descriptions                             |    10856120 | tp:mark:10856120|tp:tag:10856120 |
| 7352583 |              0 | Help_us                                        |    10739608 | tp:mark:10739608|tp:tag:10739608 |
| 2936726 |              0 | Localisation_guidelines                        |    10749918 | tp:tag:10570974                  |
| 1367141 |              0 | Technology                                     |    10642872 | tp:tag:10642872|tp:mark:5208960  |
| 7430687 |              2 | Abijeet_Patro/TP_Test_10Aug2022                |    10856652 | tp:tag:10856652|tp:mark:10856652 |
| 7372067 |              2 | Abijeet_Patro/TestMyStuff                      |    10770947 | tp:tag:10770947                  |
| 7112613 |              2 | Abijeet_Patro/Testing_T36                      |    10689327 | tp:mark:10689327|tp:tag:10689327 |
| 7116135 |              2 | Amire80/глоссарий-standby                      |    10568213 | tp:mark:10568213|tp:tag:10568213 |
|   98142 |              2 | FuzzyBot                                       |    10467760 | tp:tag:10467760|tp:mark:10467760 |
| 5004548 |              2 | MacFan4000                                     |     9640122 | tp:tag:7766997                   |
| 7016079 |              2 | Nike/Sandbox_Test                              |    10671497 | tp:tag:10671497|tp:mark:10263573 |
| 5670385 |              2 | Nike/test                                      |     8142173 | tp:tag:8142173|tp:mark:8142173   |
| 4994999 |              2 | Plagiat/Sandbox                                |     7187003 | tp:tag:7187003                   |
| 2164516 |              2 | Raymond/WMDE-Wikipedia_for_World_Heritage      |     3109254 | tp:tag:3109254|tp:mark:3109254   |
| 2690705 |              2 | Raymond/Wiki_Loves_Monuments/infographic       |     3877406 | tp:tag:3877406|tp:mark:3873598   |
| 3551359 |              2 | Suriyaa_Kudo                                   |    10206557 | tp:tag:9725263                   |
| 7233963 |              2 | Wangombe                                       |    10690019 | tp:tag:10690019                  |
| 2918426 |              2 | Zviad                                          |     4501884 | tp:tag:4175686                   |
| 1449701 |              4 | About                                          |    10682298 | tp:mark:10635598|tp:tag:10682298 |
|  131510 |              4 | General_disclaimer                             |    10416737 | tp:tag:10416737|tp:mark:10197308 |
|  136816 |              4 | Privacy_policy                                 |     7652256 | tp:mark:5454075|tp:tag:7652256   |
|  867609 |              8 | Tpt-notsuitable/fr                             |    10672165 | tp:tag:9022184                   |
| 5141223 |              8 | Tpt-notsuitable/tr                             |    10672114 | tp:tag:9525871                   |
| 7025754 |             10 | Data                                           |    10522584 | tp:tag:10244770                  |
| 2674052 |             10 | Optional                                       |    10582474 | tp:tag:10582474                  |
|  657166 |             10 | ProcessNavigation                              |     6001833 | tp:tag:5993786                   |
| 4929072 |             10 | Welcome/i18n                                   |    10602933 | tp:tag:10602933|tp:mark:10392514 |
| 7278792 |             14 | FAQ                                            |    10642983 | tp:tag:10642983                  |
| 1464962 |             14 | Help                                           |    10602748 | tp:mark:10602748|tp:tag:10602748 |
| 1465236 |             14 | Projects                                       |    10639569 | tp:tag:10639569                  |
|  491091 |             14 | Supported_projects                             |    10742206 | tp:tag:10742206                  |
| 7353858 |             14 | Supported_projects/ab                          |    10741911 | tp:tag:10741856                  |
| 7353859 |             14 | Supported_projects/af                          |    10741910 | tp:tag:10741857                  |
| 7353860 |             14 | Supported_projects/aln                         |    10741912 | tp:tag:10741858                  |
| 7353861 |             14 | Supported_projects/anp                         |    10741913 | tp:tag:10741859                  |
| 7353862 |             14 | Supported_projects/ar                          |    10741897 | tp:tag:10741860                  |
| 7353863 |             14 | Supported_projects/arc                         |    10741922 | tp:tag:10741861                  |
| 7353864 |             14 | Supported_projects/arz                         |    10741923 | tp:tag:10741864                  |
| 7353865 |             14 | Supported_projects/as                          |    10741900 | tp:tag:10741865                  |
| 7353867 |             14 | Supported_projects/ast                         |    10741927 | tp:tag:10741869                  |
| 7353866 |             14 | Supported_projects/az                          |    10741909 | tp:tag:10741868                  |
| 7353868 |             14 | Supported_projects/azb                         |    10741930 | tp:tag:10741872                  |
| 7353869 |             14 | Supported_projects/ba                          |    10741931 | tp:tag:10741873                  |
| 7353870 |             14 | Supported_projects/ban                         |    10741934 | tp:tag:10741874                  |
| 7353871 |             14 | Supported_projects/bcc                         |    10741939 | tp:tag:10741875                  |
| 7353872 |             14 | Supported_projects/bcl                         |    10741878 | tp:tag:10741878                  |
| 7353873 |             14 | Supported_projects/be                          |    10741902 | tp:tag:10741879                  |
| 7353874 |             14 | Supported_projects/be-tarask                   |    10741941 | tp:tag:10741880                  |
| 7353875 |             14 | Supported_projects/bg                          |    10741901 | tp:tag:10741881                  |
| 7353876 |             14 | Supported_projects/bn                          |    10741898 | tp:tag:10741882                  |
| 7353877 |             14 | Supported_projects/br                          |    10741907 | tp:tag:10741883                  |
| 7353878 |             14 | Supported_projects/bs                          |    10741908 | tp:tag:10741884                  |
| 7353879 |             14 | Supported_projects/ca                          |    10741906 | tp:tag:10741887                  |
| 7353880 |             14 | Supported_projects/ce                          |    10741942 | tp:tag:10741888                  |
| 7353881 |             14 | Supported_projects/ckb                         |    10741943 | tp:tag:10741889                  |
| 7353882 |             14 | Supported_projects/cs                          |    10741905 | tp:tag:10741890                  |
| 7353883 |             14 | Supported_projects/da                          |    10741904 | tp:tag:10741891                  |
| 7353884 |             14 | Supported_projects/de                          |    10741903 | tp:tag:10741892                  |
| 7353885 |             14 | Supported_projects/diq                         |    10741896 | tp:tag:10741895                  |
|  788926 |             14 | Translatewiki.net                              |     9895134 | tp:tag:9895134                   |
| 4364747 |           1102 | Advanced                                       |    10602727 | tp:tag:10602727                  |
| 4364710 |           1102 | Getting_started                                |     6259703 | tp:tag:6259702                   |
|   44189 |           1102 | How_to_start                                   |    10746519 | tp:mark:10744808|tp:tag:10746519 |
|  494220 |           1102 | Index                                          |    10798592 | tp:tag:10798592|tp:mark:10577114 |
|  657164 |           1102 | Intro                                          |    10682299 | tp:mark:10682299|tp:tag:10682299 |
| 4364709 |           1102 | Joining_the_community                          |    10602728 | tp:tag:10602728                  |
| 5234271 |           1102 | Lib.reviews/FAQ                                |    10639556 | tp:mark:7555037|tp:tag:10639556  |
| 5234279 |           1102 | Lib.reviews/Terms_of_use                       |    10639639 | tp:tag:10639639|tp:mark:7540419  |
| 5644508 |           1102 | LibreMesh                                      |     9792185 | tp:tag:8150055                   |
|  491092 |           1102 | MediaWiki                                      |    10795515 | tp:tag:9716185                   |
| 7098657 |           1102 | MediaWiki/Basic_glossary                       |    10732630 | tp:tag:10732630|tp:mark:10732630 |
| 7112450 |           1102 | MediaWiki/Basic_glossary:_Preface              |    10602756 | tp:mark:10568154|tp:tag:10602756 |
| 7113317 |           1102 | MediaWiki/Basic_glossary:_Tips_for_maintainers |    10602760 | tp:mark:10568208|tp:tag:10602760 |
| 7113316 |           1102 | MediaWiki/Basic_glossary:_Tips_for_translators |    10602761 | tp:mark:10568164|tp:tag:10602761 |
|  142613 |           1102 | MediaWiki/Statistics_in_time                   |    10603118 | tp:tag:10603118|tp:mark:5993578  |
|  762445 |           1102 | New_project                                    |    10582759 | tp:mark:10582759|tp:tag:10582759 |
|  164664 |           1102 | Offline                                        |    10123073 | tp:mark:10123073|tp:tag:10123073 |
| 1042977 |           1102 | OpenStreetMap                                  |    10251030 | tp:tag:7021918                   |
|   22497 |           1102 | Process                                        |    10640419 | tp:tag:10640419|tp:mark:10640419 |
| 4364711 |           1102 | Proofreading_the_messages                      |     6259062 | tp:tag:6259062                   |
|   44184 |           1102 | Statistics                                     |    10798673 | tp:mark:10580582|tp:tag:10798673 |
| 4364708 |           1102 | Translating_the_messages                       |     6259030 | tp:tag:6259030                   |
| 1530504 |           1198 | Project_list/1/bn                              |     5017645 | tp:tag:2195094                   |
+---------+----------------+------------------------------------------------+-------------+----------------------------------+

Running explain on the current and the proposed query:

Current query

EXPLAIN SELECT 
    page_id,
    page_namespace,
    page_title,
    page_latest,
    rt_type,
    MAX(rt_revision) AS rt_revision
FROM
    `bw_page`,
    `bw_revtag`
WHERE
    (page_id = rt_page)
        AND rt_type IN ('tp:mark' , 'tp:tag')
GROUP BY page_id , page_namespace , page_title , page_latest, rt_type
ORDER BY page_namespace , page_title;

Output:

+------+-------------+-----------+--------+---------------+---------+---------+-------------------------------------+------+-----------------------------------------------------------+
| id   | select_type | table     | type   | possible_keys | key     | key_len | ref                                 | rows | Extra                                                     |
+------+-------------+-----------+--------+---------------+---------+---------+-------------------------------------+------+-----------------------------------------------------------+
|    1 | SIMPLE      | bw_revtag | range  | PRIMARY       | PRIMARY | 62      | NULL                                | 1338 | Using where; Using index; Using temporary; Using filesort |
|    1 | SIMPLE      | bw_page   | eq_ref | PRIMARY       | PRIMARY | 4       | translatewiki_net.bw_revtag.rt_page | 1    | Using where                                               |
+------+-------------+-----------+--------+---------------+---------+---------+-------------------------------------+------+-----------------------------------------------------------+

Proposed query

EXPLAIN SELECT 
    page_id,
    page_namespace,
    page_title,
    page_latest,
    GROUP_CONCAT(rt_type, ':', rt_revision
        SEPARATOR '|') AS rt_type_revsion
FROM
    (SELECT 
        page_id,
            page_namespace,
            page_title,
            page_latest,
            rt_type,
            MAX(rt_revision) AS rt_revision
    FROM
        `bw_page`, `bw_revtag`
    WHERE
        (page_id = rt_page)
            AND rt_type IN ('tp:mark' , 'tp:tag')
    GROUP BY page_id , page_namespace , page_title , page_latest , rt_type) t
GROUP BY page_id , page_namespace , page_title , page_latest
ORDER BY page_namespace , page_title;

Output:

+------+-------------+------------+--------+---------------+---------+---------+-------------------------------------+------+-----------------------------------------------------------+
| id   | select_type | table      | type   | possible_keys | key     | key_len | ref                                 | rows | Extra                                                     |
+------+-------------+------------+--------+---------------+---------+---------+-------------------------------------+------+-----------------------------------------------------------+
|    1 | PRIMARY     | <derived2> | ALL    | NULL          | NULL    | NULL    | NULL                                | 1338 | Using temporary; Using filesort                           |
|    2 | DERIVED     | bw_revtag  | range  | PRIMARY       | PRIMARY | 62      | NULL                                | 1338 | Using where; Using index; Using temporary; Using filesort |
|    2 | DERIVED     | bw_page    | eq_ref | PRIMARY       | PRIMARY | 4       | translatewiki_net.bw_revtag.rt_page | 1    | Using where                                               |
+------+-------------+------------+--------+---------------+---------+---------+-------------------------------------+------+-----------------------------------------------------------+

We've been exploring various options to improve the database query in a Google doc here but I've added the contents of the document below as well.

Currently we are contemplating going with either Option 5 or Option 6. Option 5 will be less work, but might not be as performant or as extensible as Option 6.

Happy to hear other thoughts and ideas.

Classification of translatable pages

Translatable pages are tagged in the revtag table. That’s what is used to identify if a page is translatable, and what its current classification is. The various states that a translatable page can have are:

  1. Active- Pages marked for translation. Ideal state. Have the tp:tag and tp:mark tag and both of them point to the latest page revision.
  2. Proposed - Pages that can be marked for translation. They have the tp:tag but not the tp:mark tag.
  3. Outdated- Pages that have been updated since the last time they were marked for translation. They have both the tp:mark and the tp:tag tags, but the revision for tp:mark will not point to the latest revision. Latest version should be marked for translation.
  4. Broken- Pages that have the tp:tag tag, but are missing the tp:mark tag and don’t have the tp:tag pointing to the latest revision. This might happen if a page that previously had the tp:mark tag has recently been updated to remove the <translate> tags.

Current logic

Following is the current logic in PHP that decides how to classify at translatable page

IF page does not have tp:mark
    IF latest revision of the page has tp:tag
        CLASSIFY: Page is proposed for translation
ELSE IF latest revision of page has tp:tag
    IF latest revision of the page has tp:mark tag
        CLASSIFY: Page is active
    ELSE 
        CLASSIFY: Page is outdated
ELSE
    CLASSIFY: Page is broken

Current structure

Following is the structure of the revtag table:

CREATE TABLE `revtag` (
  `rt_type` varbinary(60) NOT NULL,
  `rt_page` int(11) NOT NULL,
  `rt_revision` int(11) NOT NULL,
  `rt_value` blob DEFAULT NULL,
  PRIMARY KEY (`rt_type`,`rt_page`,`rt_revision`),
  KEY `rt_revision_type` (`rt_revision`,`rt_type`)
) DEFAULT CHARSET=binary;

Brief description of the columns:

  1. rt_type : Tags for translatable bundles. For example, these track if a translatable page is tagged and marked for translation. Eg: tp:tag, tp:mark, fuzzy.
  2. rt_page : The Id of the page to which this tag belongs
  3. rt_revision : The revision Id of the page. For example: The latest version of a translatable page may have the tp:tag tag, but not the tp:mark tag.
  4. rt_value : Used if we have additional data to store for a tag. For example: we store the revision Id of the source text that was translated with the tp:transver tag in order to display diffs for outdated messages.

Query

This is the query currently being run to fetch the translatable pages.

SELECT 
    page_id,
    page_namespace,
    page_title,
    page_latest,
    rt_type,
    MAX(rt_revision) AS rt_revision
FROM
    `bw_page`,
    `bw_revtag`
WHERE
    (page_id = rt_page)
        AND rt_type IN ('tp:mark' , 'tp:tag')
GROUP BY page_id , page_namespace , page_title , page_latest, rt_type
ORDER BY page_namespace , page_title;

Explain analysis of the existing query:

image.png (92×1 px, 17 KB)

Two tables are queried: page and revtag. The revtag table contains the necessary tagging information to identify if a page is a translatable page, and its current state / classification.

Metadata for translatable pages

After fetching the translatable pages, there are certain values that are needed from the translate_metadata table. These include:

  1. Discouraged - Whether translation for the page is discouraged
  2. Version - What syntax version does the page use?
  3. Transclusion - Whether the page supports transclusion

Query:

SELECT
    tmd_group, tmd_key, tmd_value
FROM
    `translate_metadata`
WHERE
    tmd_group IN (..group_id..)

Potential solutions

1/ Use GROUP_CONCAT to combine multiple rows into one

The GROUP_CONCAT aggregate function concatenates strings from multiples rows into a single row split by a separator

The existing query that we have can be updated to use GROUP_CONCAT as shown in the query below:

SELECT 
    page_id,
    page_namespace,
    page_title,
    page_latest,
    GROUP_CONCAT(rt_type, ':', rt_revision
        SEPARATOR '|') AS rt_type_revsion
FROM
    (SELECT 
        page_id,
            page_namespace,
            page_title,
            page_latest,
            rt_type,
            MAX(rt_revision) AS rt_revision
    FROM
        `page`, `revtag`
    WHERE
        (page_id = rt_page)
            AND rt_type IN ('tp:mark' , 'tp:tag')
    GROUP BY page_id , page_namespace , page_title , page_latest , rt_type) t
GROUP BY page_id , page_namespace , page_title , page_latest
ORDER BY page_namespace , page_title;

Sample output of the query:

+---------+----------------+------------------------------------------------+-------------+----------------------------------+
| page_id | page_namespace | page_title                                     | page_latest | rt_type_revsion                  |
+---------+----------------+------------------------------------------------+-------------+----------------------------------+
| 4346316 |              0 | Group_descriptions                             |    10856120 | tp:mark:10856120|tp:tag:10856120 |
| 7352583 |              0 | Help_us                                        |    10739608 | tp:mark:10739608|tp:tag:10739608 |
| 2936726 |              0 | Localisation_guidelines                        |    10749918 | tp:tag:10570974                  |
| 1367141 |              0 | Technology                                     |    10642872 | tp:tag:10642872|tp:mark:5208960  |
| 7430687 |              2 | Abijeet_Patro/TP_Test_10Aug2022                |    10856652 | tp:tag:10856652|tp:mark:10856652 |
| 7372067 |              2 | Abijeet_Patro/TestMyStuff                      |    10770947 | tp:tag:10770947                  |
| 7112613 |              2 | Abijeet_Patro/Testing_T36                      |    10689327 | tp:mark:10689327|tp:tag:10689327 |
| 7116135 |              2 | Amire80/глоссарий-standby                       |    10568213 | tp:mark:10568213|tp:tag:10568213 |
|   98142 |              2 | FuzzyBot                                       |    10467760 | tp:tag:10467760|tp:mark:10467760 |
| 5004548 |              2 | MacFan4000                                     |     9640122 | tp:tag:7766997                   |
| 7016079 |              2 | Nike/Sandbox_Test                              |    10671497 | tp:tag:10671497|tp:mark:10263573 |
| 5670385 |              2 | Nike/test                                      |     8142173 | tp:tag:8142173|tp:mark:8142173   |
| 4994999 |              2 | Plagiat/Sandbox                                |     7187003 | tp:tag:7187003                   |
| 2164516 |              2 | Raymond/WMDE-Wikipedia_for_World_Heritage      |     3109254 | tp:tag:3109254|tp:mark:3109254   |
| 2690705 |              2 | Raymond/Wiki_Loves_Monuments/infographic       |     3877406 | tp:tag:3877406|tp:mark:3873598   |
| 3551359 |              2 | Suriyaa_Kudo                                   |    10206557 | tp:tag:9725263                   |
| 7233963 |              2 | Wangombe                                       |    10690019 | tp:tag:10690019                  |
| 2918426 |              2 | Zviad                                          |     4501884 | tp:tag:4175686                   |
| 1449701 |              4 | About                                          |    10682298 | tp:mark:10635598|tp:tag:10682298 |
+---------+----------------+------------------------------------------------+-------------+----------------------------------+

The rt_type and rt_revision values from multiple rows are merged together in the rt_type_revision.

Here’s an output of the EXPLAIN analysis of the query:

image.png (113×1 px, 22 KB)

How does it help?

Our existing queries fetch upto 2 rows for each translatable page from the database. This makes pagination difficult. Returning a single row removes that limitation.

Benefits

  1. By returning a single row, pagination can be implemented.
  2. This is a small change, and should be easy to implement.

Limitations

  1. The classification of pages still has to be done via the code which in turn limits pagination capabilities.
  2. The query does not improve the existing query in any way, and is potentially slower than what’s already in place.

2/ Use smaller queries

Pages in translation

The following query will allow us to identify the pages that are in translation irrespective of their current classification. It fetches all pages that have a tp:tag tag

SELECT
    DISTINCT page_id,
    page_namespace,
    page_title,
    page_latest
FROM
    `page`,
    `revtag`
WHERE
    (page_id = rt_page)
    AND rt_type IN ('tp:tag')
ORDER BY page_namespace , page_title;

Active pages

First identify pages that have the tp:tag for the latest revision.

SELECT
    page_id
FROM
    `page`,
    `revtag`
WHERE
    (page_id = rt_page)
        AND rt_type IN ('tp:tag')
        AND rt_revision = page_latest;

Then check which of those pages have the tp:mark tag for the latest revisions

SELECT
    page_id,
    page_namespace,
    page_title,
    rt_revision
FROM
    `page`,
    `revtag`
WHERE    
    rt_type IN ('tp:mark')
    AND page_id IN (2, 15, 186)
    AND rt_revision IN (15553, 592, 1870)
GROUP BY page_id;

Outdated pages

Identifying outdated pages with this approach using database query is difficult since we are looking for records that are missing. For example the following query also adds the broken pages and is not scalable when querying a large number of pages:

SELECT
    page_id,
    page_namespace,
    page_title,
    rt_revision
FROM
    `page`,
    `revtag`
WHERE    
    rt_type IN ('tp:mark')
    AND page_id = 2 AND rt_revision != 15553 OR
    page_id = 592 AND rt_revision != 592 OR
    page_id = 186 AND rt_revision != 1870    
GROUP BY page_id;

Proposed pages

We do not add any the tp:mark tag to pages that have the translate tag but are not marked for translation. Identifying proposed pages with this approach using database query is difficult since we are looking for records that are missing.

Broken pages

The following query can be used to identify pages that are broken:

SELECT
    page_id,
    page_namespace,
    page_title,
    page_latest,
    rt_type,
    MAX(rt_revision) AS rt_revision
FROM
    `page`,
    `revtag`
WHERE
    (page_id = rt_page)
        AND rt_type IN ('tp:tag')
GROUP BY page_id , page_namespace , page_title
HAVING page_latest != rt_revision
ORDER BY page_namespace , page_title
LIMIT 0, 10;

No other query is necessary.

Benefits

  1. By returning a single row, we can now implement pagination.
  2. Moderate amount of changes
  3. Approach runs multiple queries to fetch different types of translatable pages which will lend well to the redesign that Pau wanted

Drawbacks

  1. No easy way to fetch outdated pages
  2. No way to fetch proposed pages which is a deal-breaker.
  3. Having different queries for different types of translatable pages has a large maintainability and mental cost

3/ Always provide a tp:mark tag

Even if the page is not marked for translation, we add a tp:mark tag in the revtag table with rt_revision as 0. This will help identify pages that are proposed for translation.

Pages in translation

This query will be the same as in Option 2/ Pages in translation.

Active pages

The following query will help identify pages that are active:

SELECT
    page_id,
    page_namespace,
    page_title,
    page_latest,
    rt_type,
    MAX(rt_revision) AS rt_revision
FROM
    `page`,
    `revtag`
WHERE
    (page_id = rt_page)
        AND rt_type IN ('tp:mark')
GROUP BY page_id , page_namespace , page_title
HAVING page_latest = rt_revision
ORDER BY page_namespace , page_title
LIMIT 0, 20;

Outdated pages

Identifying outdated pages with this approach in a single database query is difficult since we are looking for records that are missing.

See Outdated pages section under Option 2/

Proposed pages

Since we are now adding the tp:mark to proposed pages, with rt_revision as 0, we can identify those:

SELECT
    page_namespace,
    page_title,
    page_id
FROM
    `page`,
    `revtag`
WHERE
    (page_id = rt_page)     
        AND rt_type IN ('tp:mark')
        AND rt_revision = 0
ORDER BY page_namespace, page_title
LIMIT 0 , 20;

Broken pages

See the Broken pages section under Option 2/

Benefits

  1. By returning a single row, we can now implement pagination.
  2. Large amount of changes
  3. Approach runs multiple queries to fetch different types of translatable pages which will lend well to the redesign that Pau proposed

Drawbacks

  1. No easy way to fetch outdated pages
  2. Having different queries for different types of translatable pages has a large maintainability and mental cost
  3. No major performance improvements

4/ Separate table to store flags associated with translatable bundles

This approach proposes adding a separate table to track the status of the translatable bundles. This could be done using a BIT field or multiple rows in a database table. Sample structure:

CREATE TABLE IF NOT EXISTS `pagetranslation`.`translate_tb_pages` (
  `tb_page_id` INT(11) UNSIGNED NOT NULL,
  `tb_revision_id` INT(11) UNSIGNED NOT NULL,
  `tb_page_tag` VARBINARY(100) NOT NULL,
  `tb_page_tag_value` TINYINT(3) UNSIGNED NOT NULL DEFAULT 0,
  PRIMARY KEY (`tp_page_id`, `tp_revision_id`, `tp_page_tag`)
)

The above structure stores a tag, irrespective of whether the page has it or not.

For translatable pages, it would store both the tags - tp:tag, and tp:mark for each revision of a translatable page. Examples:

  1. When a page is created with the <translate> tag in it, we would add tp:tag (true) and tp:mark (false) for the page and its revision in this table. (Proposed)
  2. When a page is marked for translation, we would update the tp:mark tag to be true. (Active)
  3. When a page is updated, and the <translate> tag is kept, we would add the tp:tag as true, and tp:mark tag as false for that revision. (Outdated)
  4. When a page is updated, and the <translate> tag is removed, we would add the tp:tag, tp:mark tag both as false. (Broken)

Benefits

  1. We can add pagination
  2. Same query can be used to fetch translatable bundles of different types. The only difference would be the parameters passed to the query
  3. Can also be used for message bundles.
  4. We are able to extract the appropriate classification for translatable pages via records in the table. This allows for it to be used for other purposes in the future.

Drawbacks

  1. For every revision of a translatable bundle we will be tracking multiple tags. In large wiki’s this might cause a lot of rows to be created.
  2. Large change.
  3. Duplicating data between here and the revtag table.

Variation

4.1/ Use BIT flags for storing the flags

We could also store the flags using BITs. This would reduce the number of records per revision to 1.

The drawback would be that the data in the table would be hard to read without code.

4.2/ Reuse or replace the revtag table

The table is quite similar to the revtag table but has the following differences:

  1. We insert a record into this table for every tag irrespective of whether it's true or not.
  2. tb_page_tag_value - This column is built to store a TINYINT instead of a BLOB in case of rt_value.
  3. The revtag table contains some additional tags that will not be added in this table. For example: fuzzy,tp:transvertag.

Since the differences are not major we could look at reusing the revtag table, or migrate records from the revtag table here.

4.3/ Track only latest revision

Instead of tracking the tags for every revision, we could only track the status of the latest revision of the translatable bundle. For each translatable page, this would add only 2 records or a certain fixed number of records if we decide to mergerevtag and the new table.

This should be enough for our use case to list pages on Special:PageTranslation.

5/ Track current status in revtag table

This approach would track the current status of the latest revision of every translatable page. This could be a value that would be stored in the revtag table itself.

It could be done by introducing a new tag in the revtag table: tp:status that could have the following values (present in the rt_value column):

  1. proposed
  2. active
  3. outdated
  4. broken

Every translatable page should have a single record in the revtag table with rt_revision as`tp:status`. The revision ID and the status value would have to be updated every time the page is modified.

The query:

EXPLAIN SELECT
    page_id,
    page_namespace,
    page_title,
    page_latest    
FROM
    `page`,
    `revtag`
WHERE
    (page_id = rt_page)
        AND rt_revision = 'tp:status'
        AND rt_value = 'outdated'
ORDER BY page_namespace, page_title;

Output:

image.png (121×1 px, 118 KB)

Benefits

  1. Moderate amount of change. No new tables are needed. We would only need to add code to modify the tp:status when the user performs different actions.
  2. Pagination can be added since there is a single record in the revtag table for the latest revision of the page
  3. Query to fetch different types of translatable pages would be similar.
  4. Similar approach can be used for message bundles.

Limitations

  1. We would only be storing the latest status of the page. Currently this is not a drawback for the translatable page listing but might not be flexible for other requirements.
  2. The pagination and order by would still have to use the page table, thereby requiring the use of OFFSET. As the page table grows this might make things slower.
  3. Will require a maintenance script to populate the value for existing translatable pages.

Variation

5.1/ Use BIT flags for storing the flags

We could also store the tags: tp:mark and tp:tag using BITs and decisions on the type of translatable page could then be made on the basis of that.

The rt_value column is not a BIT column, so based on our requirement we would have to calculate the BIT field value in PHP and then query based on that.

6/ Table to store current status for translatable bundles

This approach explores the addition of a new table to help with making sorting and pagination fast using a sort column.

CREATE TABLE `translate_translatable_bundles` (
    `tb_sortkey` VARBINARY(260) NOT NULL,
    `tb_page_id` INT(11) UNSIGNED NOT NULL,
    `tb_status` VARBINARY(60) NOT NULL,
    PRIMARY KEY (`tb_sortkey` , `tb_page_id`)
);

This table will contain the status for the latest revision of a translatable bundle.

Here’s a short description of each column:

  1. tb_sortkey : A sort key created using Collation::getSortKey() with the page namespace and the page id.
  2. tb_page_id : The page Id
  3. tb_status : Different possible status for translatable bundles:
    1. tp:active
    2. tp:proposed
    3. tp:outdated
    4. tp:broken

There will be other statuses for message bundles.

Here’s the output of the EXPLAIN statement:

image.png (167×1 px, 76 KB)

The query will allow us to select the page_id(s) we need first from the translate_translatable_bundles table , and then do a JOIN with the page table to fetch further information about each page such as the title, namespace, and the latest version of the page.

Here’s a query to fetch outdated translatable pages:

EXPLAIN SELECT
    page_id,
    page_namespace,
    page_title,
    page_latest    
FROM
    `translate_translatable_bundles`,
    `page`
WHERE
    (page_id = tb_page_id)
        AND tb_status = 'tp:outdated'
ORDER BY tb_sortkey;

The EXPLAIN output:

image.png (115×1 px, 108 KB)

Benefits

  1. Easy and efficient pagination and sorting
  2. Since we are only storing one record for each translatable page, the solution is quite scalable.
  3. Query to fetch different types of translatable pages would be similar.
  4. Similar approach can be used for message bundle listing.
  5. It should be possible to store the various metadata required for listing translatable pages as columns in this table.

Limitations

  1. We would only be storing the latest status of the page. Currently this is not a drawback for the translatable page listing but might not be flexible for other requirements.
  2. Moderately large change.
  3. Will require a maintenance script to populate the value for existing translatable pages.

When I was just reading the problem statement, it felt to me bitwise computation is the best way to go. you can do that in the query as well, don't need to compute it server-side (https://www.w3resource.com/mysql/bit-functions/bitwise-and.php). I don't know about its performance but it should be fine at this scale.

I have looked at the data in revtag, what stands out to me is that it basically has two types of tags, tags that take value (like tp:transver) and the ones that don't (like tp:tag). This makes everything complicated. My suggestion is to migrate tp:tag to a new tagname with value that is bitwise of values you want to keep. (in some places tp:mark takes value as well, is that intentional? I don't know how to accommodate that). That way tags in revtag will become consistent and the code could probably simplified (we can probably make rt_value a non-nullable column later).

A bit counter-intuitive, I don't recommend normalizing rt_type, at least not now. the strings are short, the overhead a new table is a bit much and total number of rows in meta is 3M, not much. I would have used numeric constants in the code if I were to design it but now changing it is not worth the cost.

If I'm missing something super obvious, I'm sorry and let's hop on call and talk about it in more depth.

My suggestion is to migrate tp:tag to a new tagname with value that is bitwise of values you want to keep.

I don't quite understand how this would work. We need to keep track of two revisions, of which either could be missing and either could be more recent than the other one. I guess we could store these revisions in the rt_value, but that wouldn't be bitwise nor intuitive to me. It seems simpler to store the calculated page status (ready, active, outdated, broken) any time the page is updated, and only fetch the specific revisions from the existing tags for one page when needed.

I missed the fact that the tags can be on different revision ids, my bad. Yeah, let's not store rev_id in rt_value. One part of problem is that you try to get a datum about the page out of data from revisions. The most obvious solution is to calculate and store the value per page and not per revision (and update it in every edit), something along the lines of page_last_touched field. Do you need to keep the data for previous revisions? It can be simply something like pagetag instead, you can also take advantage of page_props table (you can store value there):

select * from page_props;
+---------+---------------------------+----------+------------+
| pp_page | pp_propname               | pp_value | pp_sortkey |
+---------+---------------------------+----------+------------+
|       7 | tp:status | 1        |          0 |
|       8 | tp:status | 2        |          0 |

Would that be useful?

Seems indeed that we are converging on a solution where we store tp:status somewhere. In order to understand whether we need a new table or not: Is there way to get efficient paging when filtering the page table with a join on another table? Also when pages are sorted (either by namespace + title or with a unicode collation)?

The best way to sort them would be based on page_id, because page_props still holds the page_id so it can sort it in itself.

Also noting with exception of some really large wikis with 100M+ pages (commons, wikidata), you don't really have trouble in most cases. Just make sure you set the pp_propname on small set of pages (only the ones that have translation tags?) and once that's there, MariaDB can make a small temp table and handle things from there.

The best way to sort them would be based on page_id, because page_props still holds the page_id so it can sort it in itself.

We display translatable pages to user's sorted by namespace + title, hence would like to sort on the basis of that.

Okay, we have a couple of options to tackle this:

  • Revisit the ordering, How important is that to order them alphabetically? If It's about Special:Translate, I personally prefer page id rather than alphabetical order. It's more meaningful. Unless you mean something else.
    • This is your decision, not mine :D take it as just a suggestion.
  • If it's used for "tagging", you can simply query page based on whatever criteria you have, then feed the page ids to the query of page_props table (or any other page you want to use for tagging). Or simply turn it into a subquery. And for the case of "filtering", you can order them based on page id instead?
  • Let it be, if you have small enough set of rows, database would be able to do it on the fly using temp tables.

Its for https://meta.wikimedia.org/wiki/Special:PageTranslation. Changing the order to use page id might be something that might require a wider discussion with the community.

If it's used for "tagging", you can simply query page based on whatever criteria you have, then feed the page ids to the query of page_props table (or any other page you want to use for tagging). Or simply turn it into a subquery. And for the case of "filtering", you can order them based on page id instead?

For a large wiki, do you think its safe (performance-wise) to first query all the records on the page table, order them by namespace + title, and then join the page_props table to filter the ones that we are interested in?

Its for https://meta.wikimedia.org/wiki/Special:PageTranslation. Changing the order to use page id might be something that might require a wider discussion with the community.

Yeah makes sense but keep it in mind that most of paging of pages in mw is based on page id (e.g. Whatlinkshere, etc.) and more importantly it's more persistent, a page can move and break the assumption.

If it's used for "tagging", you can simply query page based on whatever criteria you have, then feed the page ids to the query of page_props table (or any other page you want to use for tagging). Or simply turn it into a subquery. And for the case of "filtering", you can order them based on page id instead?

For a large wiki, do you think its safe (performance-wise) to first query all the records on the page table, order them by namespace + title, and then join the page_props table to filter the ones that we are interested in?

Not all records of page table, that'd be way too taxing. But for example, if number of translatable pages are small in comparison to the whole size of the wiki, mariadb would do the join backwards, first query all rows of page_props table that have that prop (e.g. all translateable pages) and then order the result based on looking all of that it sorts and limits results based on page values. It's not very nice but for a rather small wiki, it'd work.

Not all records of page table, that'd be way too taxing. But for example, if number of translatable pages are small in comparison to the whole size of the wiki, mariadb would do the join backwards, first query all rows of page_props table that have that prop (e.g. all translateable pages) and then order the result based on looking all of that it sorts and limits results based on page values. It's not very nice but for a rather small wiki, it'd work.

OK I think I understand,

  1. Query page_props to fetch page_id for all records that we are interested in
  2. Run an WHERE IN query with those page_id on the page table and perform any filtering, ordering and limiting.

The largest Wikimedia site right now is meta that has around 10,000 pages. I'd say that we might grow to around 40,000 records in a few years. We'd have to query through that many records on the pages table.

Yeah but it's better to be done in mysql/mariadb instead of the appserver, to avoid sending 10k rows over the wire.

Also noting with exception of some really large wikis with 100M+ pages (commons, wikidata)

Please note that these two wikis do have the Translate extension installed, although they’re not the largest users of it (Wikidata has 700-800 translatable pages, on Commons ~1800 pages use the extension—although there are a lot of pages that still use legacy translation methods and should eventually be migrated—, while Meta and mediawiki.org have both well over 5000 translatable pages), so if MariaDB joins on these starting from page_props, they should probably be okay.

I think there might be an easier way, store page ns and page title in pp_sortkey. It would make the queries blazing fast.

I think there might be an easier way, store page ns and page title in pp_sortkey. It would make the queries blazing fast.

I think that would be ideal, but that does not seem to be possible. The sortkey value is hardcoded: https://gerrit.wikimedia.org/g/mediawiki/core/+/0b767de7514eda72996a092982083650ba7fd928/includes/deferred/LinksUpdate/PagePropsTable.php#127

How hard would it be to patch to add support for setting sortkey value optionally? I don't think it would be too complicated.

It would involve changes to public methods of ParserOutput which are part of stable interface policy. That is somewhat out of our comfort zone.

Also, I'm not sure how easy it will be to "floatify" the sortkey values. It seems the sortkey field wasn't designed with this use case in mind.

The code contains

	 * @note In the future, we may allow the sortkey to be specified explicitly
	 *       in ParserOutput::setProperty.

so changing public methods is probably something @daniel had in mind when committing b1d6e78f7444. Also, I don’t know the ParserOutput code, but it should definitely continue to support not providing an explicit sort key, so I hope no backward-incompatible changes need to be made.

It would involve changes to public methods of ParserOutput which are part of stable interface policy. That is somewhat out of our comfort zone.

Also, I'm not sure how easy it will be to "floatify" the sortkey values. It seems the sortkey field wasn't designed with this use case in mind.

Yeah makes sense. I guess for now we are back to querying page_props and then sorting it in page instead. For mid- to long-term we probably should make the sortkey field more usable. It's barely used :/

It looks to me that we have converged to a possible solution: use the page_props table to store the status of the translatable page with an option to further optimize using the sortkey field.

We will file the following tasks and work on them incrementally:

  • Create translatable page status page property and set it when translatable page is updated
  • Create a script to backfill the property for pages not recently updated
    • Ensure the script is run automatically on 3rd party installs
  • Run backfilling script in production manually
  • Update Special:PageTranslation to query pages using the page_props table

Then back to the parent task for implementing pagination and other UI changes.

Potential issues with this approach:

  • This table is essentially managed the parser and updated after parsing via the ParserOutput object. Some actions like marking the page for translation may not generate a new revision (but is the page still parsed?), so we need to check that we can correctly update the page property value in such cases too.

This table is essentially managed the parser and updated after parsing via the ParserOutput object. Some actions like marking the page for translation may not generate a new revision (but is the page still parsed?), so we need to check that we can correctly update the page property value in such cases too.

I did not find a way to do this. I tried updating the page property using a manually created Parser object but that did not work either.

$parser = MediaWikiServices::getInstance()->getParser();
$parserOptions = ParserOptions::newFromAnon();
// Dummy content
$parserOutput = $parser->parse( 'test', $pageTitle, $parserOptions );

$currentPropValue = $parserOutput->getPageProperty( 'tp_status' );

I spoke to Scott Ananian from the Content Transform Team and they said the following:

Page Props are tied to the ParserOutput which is the result of parsing the page, and so will be updated only/exactly when the page is parsed.
You might be able to hack things by doing a null-edit or purge of the page to update the page props, but then it would get overwritten the next time the page is edited & reparsed.

Another potential issue I see is that when tagging a page during parsing we would also have to identify if the content has actually changed before updating the status of the page.

The use of Multi Content Revision was suggested. Daniel Kinzler said that if it's user-curated editable data (that is, changing it is an "edit" and we want to be able to see the history of it), then MCR would be a good choice. In that case, you could also easily expose the current values via page_props.

It might be easier to simply create a separate table.

Something to consider, not saying we should do it. Factor out page_props table writes into a dedicated stable class (in core) that parser would call that. Then do your updates in the extension by directly calling that class instead of going through parser. page props update should not couple to parser output. They are different concepts interacting with different parts of mediawiki domains (parsing/output vs page). That way you can also set sortkey the way you like (the challenge of floatifying text is going to stay, so not 100% sure on that). For read, you can query the table (via the class in core) instead of relying on PO and its cached value.

If that's too much work, creating a table is fine with me too but please make sure it's documented that this is tech debt and should be eventually paid back.

Something to consider, not saying we should do it. Factor out page_props table writes into a dedicated stable class (in core) that parser would call that. Then do your updates in the extension by directly calling that class instead of going through parser. page props update should not couple to parser output. They are different concepts interacting with different parts of mediawiki domains (parsing/output vs page). That way you can also set sortkey the way you like (the challenge of floatifying text is going to stay, so not 100% sure on that). For read, you can query the table (via the class in core) instead of relying on PO and its cached value.

In addition to that, we'd have to backport those changes to older versions of MediaWiki.

If that's too much work, creating a table is fine with me too but please make sure it's documented that this is tech debt and should be eventually paid back.

We discussed and this is what we are planning to do for now. I'll either create a task, or document the tech debt in the class documentation itself.

Marking this as resolved since we've chosen Option 6) from the comment here and started working on implementing the solution.