Maniphest T192240

Support dublin core sub-fields in html-metadata/citoid
Open, MediumPublic
Actions

Assigned To

None

Authored By

	jeblad
	Apr 15 2018, 7:33 PM

Description

There is a resloved task on enabling Dublin Core, T76224: Read dublin core embedded metadata in lib/scrape.js, but it seems to fail on several Dublin Core fields. In particular it fails on dc.date.issued at articles from NRK.no, like Michelet rakk å fullføre siste bind om krigsseilerne. Inspecting index.js#L227 the problem seems to be

var property = nameAttr.substring(nameAttr.lastIndexOf('.') + 1).toLowerCase();

that is a simple assumption that the last fragment is the valid field name. This is wrong as this case shows. The last field is a subfield specifying what the field is about, in this case it is the date when the article was issued. That is property is set to issued while it should have been set to date.

A simple fix would be to only use the first field, not the subfield. Correct code would be to find the first index of the punctuation, and then strip off any later punctuation and following text. That wold leave the primary field, and strip the subfield.

In some cases the subfield changes the interpretation sufficiently that the primary field should be renamed, but I'm not quite sure how common this is. Fixing this would be more complex, as it requires some kind of lookup table to translate the fields.

There is also a thread at w:no:Wikipedia:Torget#Automatisk referanseformattering NRK (permalink to first revision)

Related Objects

Mentioned Here: T128461: Add support for qualified dublin core and Open Archives Initiative Protocol for Metadata Harvesting (OAI-PMH)
T76224: Read dublin core embedded metadata in lib/scrape.js

Event Timeline

jeblad created this task.Apr 15 2018, 7:33 PM

Restricted Application added a project: VisualEditor. · View Herald TranscriptApr 15 2018, 7:33 PM

Restricted Application added a subscriber: Aklapper. · View Herald Transcript

jeblad updated the task description. (Show Details)Apr 15 2018, 7:36 PM

jeblad updated the task description. (Show Details)

jeblad renamed this task from Enable Dublin Core meta fields during scrape by citoid to Failing Dublin Core meta fields during scrape by citoid due to subfields.Apr 15 2018, 8:38 PM

jeblad updated the task description. (Show Details)

jeblad updated the task description. (Show Details)Apr 15 2018, 8:40 PM

Restricted Application added a subscriber: Danmichaelo. · View Herald TranscriptApr 15 2018, 8:40 PM

jeblad updated the task description. (Show Details)Apr 15 2018, 8:42 PM

Mvolz renamed this task from Failing Dublin Core meta fields during scrape by citoid due to subfields to Support dublin core sub-fields in html-metadata/citoid.Apr 19 2018, 9:03 AM

Mvolz triaged this task as Medium priority.

Mvolz moved this task from Backlog to Service: Scraper & Validation on the Citoid board.Jul 2 2018, 12:14 PM

• Deskana removed a project: VisualEditor.Aug 27 2018, 1:01 PM

Mvolz moved this task from Service: Scraper & Validation to Service on the Citoid board.Dec 11 2018, 11:29 AM

Support dublin core sub-fields in html-metadata/citoidOpen, MediumPublicActions

Description

Related Objects

Event Timeline

Support dublin core sub-fields in html-metadata/citoid
Open, MediumPublic
Actions