Page MenuHomePhabricator

page_restrictions field incomplete in current and historical dumps
Open, Stalled, LowPublic

Description

The page_restrictions field of mediawiki_wikitext_current apparently does not contain all restrictions. For example, Template:Track gauge has been protected since 2013, and api.php correctly says

"protection": [
                    {
                        "type": "edit",
                        "level": "templateeditor",
                        "expiry": "infinity"
                    },
                    {
                        "type": "move",
                        "level": "templateeditor",
                        "expiry": "infinity"
                    }
                ],

but in the Data Lake that protection does not seem to exist:

hive (default)> select page_title, page_restrictions from wmf.mediawiki_wikitext_current where snapshot = '2020-02' and page_namespace = 10 and page_title = 'Template:Track gauge' and wiki_db = 'enwiki';
page_title	page_restrictions
Template:Track gauge	[]

Event Timeline

Tgr created this task.Apr 29 2020, 1:38 PM
Restricted Application added subscribers: Liuxinyu970226, Aklapper. · View Herald TranscriptApr 29 2020, 1:39 PM

I checked raw data and 'page_restrictions' is empty for the given example (it is as well in the historical dumps). @ArielGlenn, I'm moving this to your realm!

JAllemandou renamed this task from page_restrictions field incomplete in Data Lake mediawiki_wikitext_current table to page_restrictions field incomplete in current and historical dumps.Apr 29 2020, 2:48 PM
JAllemandou added a project: Dumps-Generation.
JAllemandou moved this task from Incoming to Radar on the Analytics board.
wikiadmin@10.64.32.76(enwiki)> select * from page_restrictions where pr_page = 13856248;
+----------+---------+----------------+------------+---------+-----------+--------+
| pr_page  | pr_type | pr_level       | pr_cascade | pr_user | pr_expiry | pr_id  |
+----------+---------+----------------+------------+---------+-----------+--------+
| 13856248 | edit    | templateeditor |          0 |    NULL | infinity  | 503137 |
| 13856248 | move    | templateeditor |          0 |    NULL | infinity  | 503138 |
+----------+---------+----------------+------------+---------+-----------+--------+
2 rows in set (0.01 sec)

wikiadmin@10.64.32.76(enwiki)> select * from page where page_id = 13856248;
+----------+----------------+-------------+-------------------+------------------+-------------+----------------+----------------+--------------------+-------------+----------+--------------------+-----------+
| page_id  | page_namespace | page_title  | page_restrictions | page_is_redirect | page_is_new | page_random    | page_touched   | page_links_updated | page_latest | page_len | page_content_model | page_lang |
+----------+----------------+-------------+-------------------+------------------+-------------+----------------+----------------+--------------------+-------------+----------+--------------------+-----------+
| 13856248 |             10 | Track_gauge |                   |                0 |           0 | 0.836163647542 | 20200423081623 | 20200423082101     |   747834753 |      313 | wikitext           | NULL      |
+----------+----------------+-------------+-------------------+------------------+-------------+----------------+----------------+--------------------+-------------+----------+--------------------+-----------+
1 row in set (0.00 sec)

Now to see why they don't show up in the dumps.

I've been looking at the en wiki dump files for part 18 (containing this page) and there are no entries for page restrictions in either the 2019-02 dumps or the 2018-02 dumps. So this bug seems to have been around for awhile. I'm checking older dumps but it will take a while to download and decompress them.

This has been broken a long time. I need to check a little bit more of the history, but I can verify that exports use the contents of the obsolete page_restrictions field from the page table instead of the cooresponding entries in the page_restrictions table. Proof of this: enwiki page 'Authur Schoepnhauer' with page id 700 has the following row entry:

wikiadmin@10.64.32.76(enwiki)> select * from page where page_id = 700;
+---------+----------------+---------------------+-------------------+------------------+-------------+--------------------+----------------+--------------------+-------------+----------+--------------------+-----------+
| page_id | page_namespace | page_title          | page_restrictions | page_is_redirect | page_is_new | page_random        | page_touched   | page_links_updated | page_latest | page_len | page_content_model | page_lang |
+---------+----------------+---------------------+-------------------+------------------+-------------+--------------------+----------------+--------------------+-------------+----------+--------------------+-----------+
|     700 |              0 | Arthur_Schopenhauer | move=:edit=       |                0 |           0 | 0.8153719695428131 | 20200430073836 | 20200426170144     |   952937594 |   155341 | wikitext           | NULL      |
+---------+----------------+---------------------+-------------------+------------------+-------------+--------------------+----------------+--------------------+-------------+----------+--------------------+-----------+

and no entries in the page_restrictions table. But the contents of the xml element in the pages-articles bz2 file are "move=:edit="

I believe it has been this way since 2008, and specifically since this commit: https://phabricator.wikimedia.org/rMW7c130df66fa7c38927109174ee4e99c0906177a9

The page_restrictions table is already dumped separately and is available for download. I would suggest that we remove references to the page_restrictions field from WikiExporter.php and XmlDumpWriter.php, replace that information with nothing, and never output that information in the metadata or page content dumps.

Change 593668 had a related patch set uploaded (by ArielGlenn; owner: ArielGlenn):
[mediawiki/core@master] Do not use or output the contents of the page_restrictions field in dumps

https://gerrit.wikimedia.org/r/593668

ArielGlenn moved this task from Backlog to Active on the Dumps-Generation board.May 1 2020, 10:41 AM
daniel added a subscriber: daniel.

@ArielGlenn since you are currently part of the Clinic Duty team, I'm moving this to "doing", rather than "external code review" :)

daniel triaged this task as Low priority.May 1 2020, 12:08 PM

tagging as low, since nobody noticed for a long time - so I assume the impact of this problem is low. That doesn't mean we shouldn't fix it, but we could consider just waiting for T218446.

Tgr added a comment.May 1 2020, 1:13 PM

The page_restrictions table is already dumped separately and is available for download. I would suggest that we remove references to the page_restrictions field from WikiExporter.php and XmlDumpWriter.php, replace that information with nothing, and never output that information in the metadata or page content dumps.

That table does not contain old restrictions (which have been put in place before the table was created), though. Of course T218446: Remove use of legacy page.page_restrictions field will eventually fix that.

Also, the page_restrictions table is not included in the Analytics data lake.

See also T35334 for another bug on removing that field from core.

I think we could stall the issue for now and ask the Data Lake folks to process the contents of the page_restrictions table which will eventually have all the data.

I created T251749 to add the page_restrictions table to the tables we sqoop.

Making the command decision to block this task on T218446 / T35334.

Naike changed the task status from Open to Stalled.May 22 2020, 7:07 AM
Aklapper edited projects, added Analytics-Radar; removed Analytics.Jun 10 2020, 6:33 AM