Page MenuHomePhabricator

`mwxml2sql` fails to process `enwikinews-20140605-pages-meta-current.xml.bz2` when it encounters `<ns>90</ns>`
Open, LowPublic

Description

0) Summary

I tried to build a mirror of enwikinews using mwxml2sql. This failed whenever mwxml2sql encountered a page from namespace 90 (Thread).

I tried again using maintenance/importDump.php. This worked better. However, it appears that importDump.php ignores namespace 90, because no such pages are later found in the enwikinews.page database table.

  1. Dataset

enwikinews-20140605-pages-meta-current.xml.bz2

  1. Error messages

WHINE: (155323) no end page tag

When I divide the XML data dump into smaller files of say 1000 pages, I can find many more such errors.

  1. Pages that cause errors

<page>
<title>Thread:Comments:Chip and PIN 'not fit for purpose', says Cambridge researcher/Those in positions of power shirking responsibility and lying?</title\>
<ns>90</ns>
<id>155323</id>
<DiscussionThreading>
<ThreadSubject>Those in positions of power shirking responsibility and lying?</ThreadSubject>
<ThreadPage>Comments:Chip and PIN 'not fit for purpose', says Cambridge researcher</ThreadPage>
<ns>90</ns>
<id>155323</id>
<DiscussionThreading>
<ThreadSubject>Those in positions of power shirking responsibility and lying?</ThreadSubject>
<ThreadPage>Comments:Chip and PIN 'not fit for purpose', says Cambridge researcher</ThreadPage>
<ThreadID>92</ThreadID>
<ThreadAuthor>70.31.58.181</ThreadAuthor>
<ThreadEditStatus>has-reply</ThreadEditStatus>
<ThreadType>normal</ThreadType>
<ThreadSignature>[[Special:Contributions/70.31.58.181|70.31.58.181]] ([[User talk:70.31.58.181|talk]])</ThreadSignature>
</DiscussionThreading>
<revision>
<id>958267</id>
<timestamp>2010-02-15T04:04:56Z</timestamp>
<contributor>
<ip>70.31.58.181</ip>
</contributor>
<comment>New thread: Those in positions of power shirking responsibility and lying?</comment>
<text xml:space="preserve">&quot;All the banks are lying. They are maliciously and wilfully deceiving the customer [...] The system is not fit for purpose.&quot; I'm so surprised that I've apparently transcended a serious remark and instead am being sarcastic. Incidentally, only part of that sentence was sarcastic.</text>
<sha1>rjidk12i4hv2mxia3a8qq620rlc7lok</sha1>
<model>wikitext</model>
<format>text/x-wiki</format>
</revision>
</page>

  1. Namespace of pages that cause errors

<namespace key="90" case="first-letter">Thread</namespace>

  1. Use of importDump.php

Apparently importDump.php ignores namespace 90.

mysql> select page_id,page_namespace,page_title from enwikinews.page where page_id=155323;
Empty set (0.00 sec)
mysql> select page_id,page_namespace,page_title from enwikinews.page where page_namespace=90;
Empty set (0.00 sec)

Sincerely Yours,
Kent


Version: unspecified
Severity: major
OS: Linux
Platform: PC

Details

Reference
bz66661
Related Gerrit Patches:
operations/dumps : arielExtend maximum allowed MediaWiki version to 1.26
operations/dumps/import-tools : masterAdd option to skip specified namespaces
operations/dumps : arielSkip LiquidThread namespaces

Event Timeline

bzimport raised the priority of this task from to Needs Triage.Nov 22 2014, 3:13 AM
bzimport set Reference to bz66661.
Aklapper triaged this task as Low priority.Mar 23 2015, 5:43 PM
Aklapper added a subscriber: Aklapper.
wpmirrordev renamed this task from `mwxml2sql' fails to process `enwikinews-20140605-pages-meta-current.xml.bz2' when it encounters `<ns>90</ns>' to `mwxml2sql` fails to process `enwikinews-20140605-pages-meta-current.xml.bz2` when it encounters `<ns>90</ns>`.Jul 8 2015, 11:35 PM
wpmirrordev updated the task description. (Show Details)
wpmirrordev set Security to None.
jayvdb added a subscriber: jayvdb.Oct 3 2015, 2:28 AM

Change 113103 had a related patch set uploaded (by John Vandenberg):
Support MediaWiki version 1.23

https://gerrit.wikimedia.org/r/113103

Change 243365 had a related patch set uploaded (by John Vandenberg):
Skip LiquidThread namespaces

https://gerrit.wikimedia.org/r/243365

Change 350255 had a related patch set uploaded (by ArielGlenn; owner: Wpmirrordev):
[operations/dumps/import-tools@master] Add option to skip specified namespaces

https://gerrit.wikimedia.org/r/350255

Change 350255 merged by ArielGlenn:
[operations/dumps/import-tools@master] Add option to skip specified namespaces

https://gerrit.wikimedia.org/r/350255

Change 171976 had a related patch set uploaded (by Nemo bis; owner: Wpmirrordev):
[operations/dumps@ariel] Extend maximum allowed MediaWiki version to 1.26

https://gerrit.wikimedia.org/r/171976

I believe the above compat with 1.26 is covered by this changeset: https://gerrit.wikimedia.org/r/#/c/347625/