Page MenuHomePhabricator

Add option for Munger to sanitize data to strictly match RDF standards
Closed, DeclinedPublic

Description

Some URLs in RDF data from Wikidata contain invalid Unicode. This works with Blazegraph, but other tools are less permissive. So, there should be a way for Munger to ensure that IRIs generated by it are conformant to standards, in particular, be in NFKC form. Example error:

[line: 818877, col: 10] Bad IRI: <http://docs.sevsovet.com.ua/index.php?option=com_k2&view=item&id=312:№-3657-от-160812г-о-реализации-положений-закона-украины-об-основах-государственной-языковой-политики-в-городе-севастополе> Code: 47/NOT_NFKC in QUERY: The IRI is not in Unicode Normal Form KC.
[line: 818877, col: 10] Bad IRI: <http://docs.sevsovet.com.ua/index.php?option=com_k2&view=item&id=312:№-3657-от-160812г-о-реализации-положений-закона-украины-об-основах-государственной-языковой-политики-в-городе-севастополе> Code: 56/COMPATIBILITY_CHARACTER in QUERY: Bad character

Event Timeline

Restricted Application added a subscriber: Aklapper. · View Herald Transcript

We may also want for it to sanitize non-standard dates and other things that third-party solutions may find wrong, so that we'd be able to generate RDF that is acceptable to any tool.

Smalyshev renamed this task from Add option for Munger to sanitize IRIs to Add option for Munger to sanitize data to strictly match RDF standards.Jun 14 2019, 5:21 PM