Page MenuHomePhabricator

Include more namespaces in Wiktionary HTML dumps
Open, Needs TriagePublic

Description

Thanks for providing HTML dumps, on the English Wiktionary we've already started to use them to generate lists of "wanted" entries. This works very well, but some links are not taken into account as there are currently no dumps available for the namespaces Appendix (id: 100), Thesaurus (id: 110), Reconstruction (id: 118), and Citations (id: 114).

Would it be possible to include these in the next run?

Event Timeline

jberkel renamed this task from Include more namespaces in Wiktionary HTML dump to Include more namespaces in Wiktionary HTML dumps.Mar 11 2022, 10:26 PM

Apparently the *-NS0-*.json.tar.gz files really do contain only namespace 0 pages even in places where $wgContentNamespaces contains namespaces other than 0, unlike the more flexible *-pages-articles-*.xml* dumps.

$ curl https://dumps.wikimedia.org/other/enterprise_html/runs/20220301/idwikisource-NS0-20220301-ENTERPRISE-HTML.json.tar.gz | tar xzOf - | jq -r ".namespace.identifier" | LANG=C sort -u
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100 39.2M  100 39.2M    0     0  4333k      0  0:00:09  0:00:09 --:--:-- 4591k
0

Not sure what logic was followed to decide the contents of the JSON dumps but it's clear they don't map the XML dumps, so I can't tell whether that's a bug. Either way there are two separate tasks here (+1):

  • make sure relevant content namespaces are reflected in the configuration of each wiki, ideally by adding them to $wgContentNamespaces (this is a community decision);
  • respect this configuration and place said content somewhere;
  • if that place is not the existing file, adapt clients to find out where that is (if they are separate files, clients need to also retrieve the aforementioned configuration, to find out they need to look for those additional files).

This would be also really useful for the German Wiktionary, where the Flexion namespace contains a lot of grammatical information. The namespace is also included in the normal articles-XML dump (however not easily usable without the expanded templates).

On the English Wiktionary we now use HTML dumps to generate our stats. Some of our content is not in the mainspace and therefore not reflected in the statistics. There are also problems generating information related to proto-languages, these live in the Reconstruction: namespace.