Page MenuHomePhabricator

Importing XML dumps should validate that the target wiki has the same namespaces as the pages being imported
Open, LowestPublic

Description

When one imports a XML dump from other wiki, and it contains pages in custom namespaces, if those custom namespaces doesn't exist on the target wiki, it will end in pages imported in the main namespace (with the original namespace in the title, but not without being in that namespace, since it doesn't exist). That can be confusing, and if a lot of pages are being imported with that problem, it would be a pain to fix the issue.

It would be good to add some validation when importing the dump, based on the list of namespaces already present on the header of the dump: If a page in the dump is in a namespace not present in the target wiki, abort the import.

Things we should consider:

  • Adding a checkbox in Special:Import, and an option in importDump.php to ignore namespace validation: When marked, it should generate a warning at the end, but proceed with the import of such pages. Otherwise, abort the import when a page of a not-existing namespace is found.
  • Don't validate the list of namespaces that are in the header of the dump, just when each page is being imported. For example, the original wiki may have custom namespaces not present in the target wiki, but the dump only contains known namespaces (eg: namespace 0), in this case it shouldn't abort the import "early".

A nice addition would be to allow mapping namespaces from the dump to other namespaces on the target wiki (bug 41969)


Version: 1.23.0
Severity: enhancement
See Also:
https://bugzilla.wikimedia.org/show_bug.cgi?id=41969

Details

Reference
bz62111

Event Timeline

bzimport raised the priority of this task from to Medium.Nov 22 2014, 3:07 AM
bzimport set Reference to bz62111.
bzimport added a subscriber: Unknown Object (MLST).
TTO added a subscriber: TTO.

Need to check what current behaviour is here. Probably not the same as in March 2014.

Restricted Application added a subscriber: Aklapper. · View Herald TranscriptSep 23 2015, 1:04 PM

I just tried it again on MediaWiki 1.25 (trying to import an API: page of mediawiki.org) and the page was imported without a warning

Ciencia_Al_Poder set Security to None.
Ciencia_Al_Poder removed a subscriber: wikibugs-l-list.
TTO lowered the priority of this task from Medium to Lowest.Sep 24 2015, 10:15 AM

Seems low priority. If you're importing a few pages, you can force them into a specific namespace using the UI on Special:Import or appropriate API or maintenance script parameters. If importing a dump, you would be better off either massaging the dump itself, creating the custom namespaces on your wiki (at least temporarily), or implementing T43969.

jayvdb added a subscriber: jayvdb.Aug 25 2016, 1:14 AM

I dont think T43969 is enough. Sometimes there are many namespaces missing on the target wiki, and they should all be discarded or dumped into one 'junk' namespace.

Creating the 'original' namespaces temporarily means the logs are problematic when the 'original' namespace is deleted.

What if the user was able to select a 'junk' namespace, where any source pages without a target namespace will be put.

The 'junk' namespace could default to 0 to reproduce the existing behaviour.
On wikis with an "Interwiki" namespace, that would be an appropriate default.

The user could also select to not import these pages, which under the covers could be a mapping to namespace -1.

This approach would allow all pages to be easily imported without pre-parsing the XML to create a namespace mapping.

Krauss added a subscriber: Krauss.Feb 16 2017, 6:55 PM

Hi , I have big problem, the "no alert", "no notice" (even with --debug) about *namespaces* cause crash in the importation process. The maintenance/importDump.php must show a message to say "hey You need to define namespaces in the LocalSettings.php"

TTO added a comment.Feb 16 2017, 11:02 PM

Hi , I have big problem, the "no alert", "no notice" (even with --debug) about *namespaces* cause crash in the importation process.

What are the details of the crash? Which dump is causing the problem - if possible could you upload it here, if it is small enough?

Thanks @TTO , was a fast comment, perhaps the best is to split the problem in two, 1) the "Warnings functionality for the import process" (as T113472); 2) the crash.

The most important is 1, that avoid the crash. You can simulate the "no warnings problem" in any importDump that use other namespace... Or as in my case, a Mediawiki without this namespace specific declaration,

define("NS_DOO", 3000);
define("NS_DOO_TALK", 3001);
$wgExtraNamespaces[NS_DOO] = "doo";
$wgExtraNamespaces[NS_DOO_TALK] = "doo_talk";

when I add it at LocalSettings.php the importation works fine, but before it, --dry-run and normal import (and --debug), no one show warnings about that "DOO:title" pages... XML with <namespace key="3000">.