Page MenuHomePhabricator

Unexpected end of file for stub-meta-history.xml
Closed, InvalidPublic

Description

I use a small software for about 10 years to analyze the article histories based on the stub-meta-history.xml dump files. Beside other statistics, I prepare these regularly updated series of stats on huwiki using this tool:
https://hu.wikipedia.org/wiki/Wikip%C3%A9dia:Szerkeszt%C5%91k_list%C3%A1ja_szerkeszt%C3%A9ssz%C3%A1m_szerint
The software is not under free license nor open source, and was written by a long-time inactive Wikipedia editor-developer. Therefore I cannot check or change the code. I used the software without any problem until the previous dump (in February). When I wanted to generate the new statistics in February, I received an error. I hoped, that this is only a temporary problem, and it will be solved by the next dump, but I have the same problem with the dump from March:

  • huwiki-20190301-stub-meta-history.xml.gz
  • huwiki-20190301-user_groups.sql.gz

The error message:

Unexpected end of file has occurred.  The following elements are not closed: mediawiki. Line 41, position 1.
XmlExpection
 at System.Xml.XmlTextReaderImpl.Throw(Exception e)
 at System.Xml.XmlTextReaderImpl.Throw(String res, String arg)
 at System.Xml.XmlTextReaderImpl.ThrowUnclosedElements()
 at System.Xml.XmlTextReaderImpl.ParseElementContent()
 at System.Xml.XmlTextReaderImpl.Read()
 at qcz.Dump.StubMetaHistory.StubMetaHistoryDumpReader.<GetEnumerator>d__1.MoveNext()
 at qczWikiStat.MainForm.bw_DoWork(Object sender, DoWorkEventArgs e)
 at System.ComponentModel.BackgroundWorker.WorkerThreadStart(Object argument)"

Did you make any change in the dump structure in the previous months? Is this something, that is possible to solve on the dump side?

Event Timeline

Samat created this task.Mar 16 2019, 4:50 PM
Restricted Application added a subscriber: Aklapper. · View Herald TranscriptMar 16 2019, 4:50 PM

@Samat: Please share the exact size (in bytes) of the downloaded file (and the extracted file, in case you extract manually).

Aklapper changed the task status from Open to Stalled.Mar 16 2019, 5:51 PM
Aklapper renamed this task from Unexpexted end of file for stub-meta-history.xml to Unexpected end of file for stub-meta-history.xml.Mar 16 2019, 7:34 PM
Aklapper closed this task as Invalid.Mar 16 2019, 7:39 PM

Cannot reproduce The following elements are not closed: mediawiki. so I guess your download is simply incomplete:

$:acko\> wget https://dumps.wikimedia.org/huwiki/20190301/huwiki-20190301-stub-meta-history.xml.gz
$:acko\> gzip -d huwiki-20190301-stub-meta-history.xml.gz 
$:acko\> ls -al huwiki*
-rw-rw-r--. 1 acko acko 9435661537 Mar  1 22:26 huwiki-20190301-stub-meta-history.xml
-rw-rw-r--. 1 acko acko 1434668795 Mar  1 22:26 huwiki-20190301-stub-meta-history.xml.gz
$:acko\> more huwiki-20190301-stub-meta-history.xml | tail -n 2
  </page>
</mediawiki>
Samat added a comment.EditedMar 16 2019, 7:44 PM

@Aklapper the size of huwiki-20190301-stub-meta-history.xml.gz file is 1 434 668 795 bytes. (The tool uses the .gz file, so I do not need to extract it before use.)
I had the same problem with huwiki-20190201-stub-meta-history.xml.gz, the size of the latter is 1 427 856 252 bytes.
Are these sizes are incorrect?

Samat added a comment.EditedMar 16 2019, 7:53 PM

I can manually extract the gz file, and the size of huwiki-20190301-stub-meta-history.xml is 9 435 661 537 bytes by me. But the tool accept only .gz files...

What is the next step?

@Samat: I posted the same numbers in T218484#5029846: 1 434 668 795 bytes and 9 435 661 537 bytes so it looks like your download seems complete.

When I look at the very last line in huwiki-20190301-stub-meta-history.xml though it shows the </mediawiki> closing tag (is that the same for you?), contrary to your error which says The following elements are not closed: mediawiki. That's why I do not think the mistake is in the dump file itself.

Samat added a comment.EditedMar 16 2019, 11:56 PM

I checked, and I have the </mediawiki> at the end as well.
I downloaded the same file for srwiki (srwiki-20190301-stub-meta-history.xml.gz), and I have the same problem with it, but not in the 41th, but in the 37th line, according to the error message.

I checked an older file again (huwiki-20181001-stub-meta-history.xml.gz), and this works perfectly. If I load huwiki-20190301-stub-meta-history.xml.gz, it does not work.
I checked, and Line 41 is the place, where the first <page> tag is in the xml, after the </siteinfo> tag. I compared the two extracted xml above (from 20181001 and from 20190301), but I don't see any difference (except that the <generator>MediaWiki 1.32.0-wmf.23</generator> changed to <generator>MediaWiki 1.33.0-wmf.19</generator>).

I checked the srwiki-20190301-stub-meta-history.xml file, and the 37th line is the same there: the first <page> tag after the </siteinfo>.

Maybe that helps.

Beginning in February, the gzipped stubs, page logs, and abstracts files are generated as three concatenated gzip streams, one for the xml header (siteinfo metadata), one for the page or log content, and one for the closing mediawiki tag. Since concatenated gzip files are themselves a valid gzip file, this should not affect robust tools that comply with the specification. Perhaps your tool exits after the end of the first stream?

Samat added a subscriber: Tgr.EditedMar 18 2019, 9:54 PM

Thank you for your explanation, apparently this is the case. :)

I got a hint from @Tgr , that I could try to decompress and then compress (with one word re-compress) the original gzip file, and that trick helped indeed, and solved my problem.