Page MenuHomePhabricator

Support dublin core sub-fields in html-metadata/citoid
Open, MediumPublic

Description

There is a resloved task on enabling Dublin Core, T76224: Read dublin core embedded metadata in lib/scrape.js, but it seems to fail on several Dublin Core fields. In particular it fails on dc.date.issued at articles from NRK.no, like Michelet rakk å fullføre siste bind om krigsseilerne. Inspecting index.js#L227 the problem seems to be

var property = nameAttr.substring(nameAttr.lastIndexOf('.') + 1).toLowerCase();

that is a simple assumption that the last fragment is the valid field name. This is wrong as this case shows. The last field is a subfield specifying what the field is about, in this case it is the date when the article was issued. That is property is set to issued while it should have been set to date.

A simple fix would be to only use the first field, not the subfield. Correct code would be to find the first index of the punctuation, and then strip off any later punctuation and following text. That wold leave the primary field, and strip the subfield.

In some cases the subfield changes the interpretation sufficiently that the primary field should be renamed, but I'm not quite sure how common this is. Fixing this would be more complex, as it requires some kind of lookup table to translate the fields.

See also T128461: Add support for qualified dublin core and Open Archives Initiative Protocol for Metadata Harvesting (OAI-PMH)

There is also a thread at w:no:Wikipedia:Torget#Automatisk referanseformattering NRK (permalink to first revision)

Event Timeline

Restricted Application added a subscriber: Aklapper. · View Herald Transcript
jeblad updated the task description. (Show Details)
jeblad renamed this task from Enable Dublin Core meta fields during scrape by citoid to Failing Dublin Core meta fields during scrape by citoid due to subfields.Apr 15 2018, 8:38 PM
jeblad updated the task description. (Show Details)
Mvolz renamed this task from Failing Dublin Core meta fields during scrape by citoid due to subfields to Support dublin core sub-fields in html-metadata/citoid.Apr 19 2018, 9:03 AM
Mvolz triaged this task as Medium priority.