Page MenuHomePhabricator

latest all rdf dump: bad IRI scheme
Closed, ResolvedPublicBUG REPORT

Description

During a test run to create a HDT file I encountered a bad IRI scheme with an invalid 2F char.

Steps to Reproduce:

docker run -v pwd:/wikidata rdfhdt/hdt-cpp:v1.3.3 rdf2hdt -p -i wikidata/latest-all.nt.gz wikidata/latest-all.hdt

Actual Results:

error: wikidata/latest-all.nt.gz:604276348:139: bad IRI scheme char `2F'
Catch exception load: Error parsing input.
ERROR: Error parsing input.

Expected Results:

No parsing errors.

Related Objects

Event Timeline

In the ShEx CG, the following fix was suggested:

sed -i -E 's/(<.*)}(.*>)/\1\2/' <dump_file>
sed -i -E 's/(<.*)\\n(.*>)/\1\2/' <dump_file>
sed -i -E 's/(<.*)\|(.*>)/\1\2/' <dump_file>

@hoo I think I saw another ticket realting to something odd going on in dump recently?
Is this also related?

@hoo I think I saw another ticket realting to something odd going on in dump recently?
Is this also related?

I'm not sure yet… I'm currently trying to see what exactly is wrong here (extracting the relevant parts on stat1007), but yes, this could also be a shard concatenation problem.

These bad entries were added in October 2013 (fixed now), probably due to a broken input validator.

Invalid data like this is impossible to add now and this was (AFAICT, I've naively grep-ed the whole dump) the only instance of this, thus I think we can close this.

Lydia_Pintscher claimed this task.
Lydia_Pintscher subscribed.

Closing this based on Marius' comment.