This month job converting XML-history-dumps to parquet on Hadoop failed becasue of a format issue.
Investigation have shown that some revisions have empty XML element as contributor.id: <id />, while they were having <id>0</id> in previous dumps version (tested on zhwikisource and frwikisource).
Qunatification:
- 2019-05 zhwikisource empty ids: 10311
- 2019-04 zhwikisource empty ids: 0
Example of revision with empty id from 2019-05 zhwikisource:
<revision> <id>209484</id> <timestamp>2001-01-15T00:00:00Z</timestamp> <contributor> <username /> <id /> </contributor> <model>wikitext</model> <format>text/x-wiki</format> <text xml:space="preserve">{{copy|http://open-lit.com}}</text> <sha1>91nyffwsl71ubxm0dji7u1bljr5d8fl</sha1> </revision>
Same revision on 2019-04 zhwikisource:
<revision> <id>209484</id> <timestamp>2001-01-15T00:00:00Z</timestamp> <contributor> <username /> <id>0</id> </contributor> <model>wikitext</model> <format>text/x-wiki</format> <text xml:space="preserve">{{copy|http://open-lit.com}}</text> <sha1>91nyffwsl71ubxm0dji7u1bljr5d8fl</sha1> </revision>