Include more namespaces in Wiktionary HTML dumps
Open, Needs TriagePublic
Actions

Assigned To

None

Authored By

	jberkel
	Mar 11 2022, 10:10 PM

Description

Thanks for providing HTML dumps, on the English Wiktionary we've already started to use them to generate lists of "wanted" entries. This works very well, but some links are not taken into account as there are currently no dumps available for the namespaces Appendix (id: 100), Thesaurus (id: 110), Reconstruction (id: 118), and Citations (id: 114).

Would it be possible to include these in the next run?

Related Objects

Mentioned Here: T318371: Provide dumps of Template (10) and Module (828) namespaces Wkimedia Enterprise HTML dumps

Event Timeline

jberkel created this task.Mar 11 2022, 10:10 PM

Restricted Application added a subscriber: Aklapper. · View Herald TranscriptMar 11 2022, 10:10 PM

jberkel renamed this task from Include more namespaces in Wiktionary HTML dump to Include more namespaces in Wiktionary HTML dumps.Mar 11 2022, 10:26 PM

Apparently the *-NS0-*.json.tar.gz files really do contain only namespace 0 pages even in places where $wgContentNamespaces contains namespaces other than 0, unlike the more flexible *-pages-articles-*.xml* dumps.

$ curl https://dumps.wikimedia.org/other/enterprise_html/runs/20220301/idwikisource-NS0-20220301-ENTERPRISE-HTML.json.tar.gz | tar xzOf - | jq -r ".namespace.identifier" | LANG=C sort -u
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100 39.2M  100 39.2M    0     0  4333k      0  0:00:09  0:00:09 --:--:-- 4591k
0

Not sure what logic was followed to decide the contents of the JSON dumps but it's clear they don't map the XML dumps, so I can't tell whether that's a bug. Either way there are two separate tasks here (+1):

make sure relevant content namespaces are reflected in the configuration of each wiki, ideally by adding them to $wgContentNamespaces (this is a community decision);
respect this configuration and place said content somewhere;
if that place is not the existing file, adapt clients to find out where that is (if they are separate files, clients need to also retrieve the aforementioned configuration, to find out they need to look for those additional files).

jberkel added a project: Dumps-Generation.Mar 18 2022, 8:37 AM

ArielGlenn moved this task from Backlog to Other teams on the Dumps-Generation board.Mar 18 2022, 9:50 AM

Erutuon subscribed.Apr 28 2022, 8:55 PM

This would be also really useful for the German Wiktionary, where the Flexion namespace contains a lot of grammatical information. The namespace is also included in the normal articles-XML dump (however not easily usable without the expanded templates).

On the English Wiktionary we now use HTML dumps to generate our stats. Some of our content is not in the mainspace and therefore not reflected in the statistics. There are also problems generating information related to proto-languages, these live in the Reconstruction: namespace.

Related to T318371

JArguello-WMF moved this task from Incoming to Engineering Backlog (DevOps, Maintenance, Tech debt) on the Wikimedia Enterprise board.Jul 24 2023, 5:57 PM

Mitar subscribed.Aug 18 2023, 8:35 AM

Include more namespaces in Wiktionary HTML dumpsOpen, Needs TriagePublicActions

Description

Related Objects

Event Timeline

Include more namespaces in Wiktionary HTML dumps
Open, Needs TriagePublic
Actions