Page MenuHomePhabricator

DCAT-AP: XML produces invalid output with HHVM
Closed, DeclinedPublic

Description

I noticed that the xml output has changed since the code was merged (it now includes a lot of empty things like xmlns:rdf="").

Since the php code itself hasn't changed I assume it is something deeper (maybe XMLWriter?) which has changed.

As a result the xml no longer validates.

Event Timeline

Lokal_Profil raised the priority of this task from to Needs Triage.
Lokal_Profil updated the task description. (Show Details)
Lokal_Profil added subscribers: Lokal_Profil, ArielGlenn, hoo.

Change 250878 had a related patch set uploaded (by Hoo man):
Use Zend to create the DCAT-AP RDF on snapshot1003

https://gerrit.wikimedia.org/r/250878

Change 250878 merged by Dzahn:
Use Zend to create the DCAT-AP RDF on snapshot1003

https://gerrit.wikimedia.org/r/250878

The cause here is that the php command switched from Zend PHP 5.3 to HHVM when snapshot1003 got upgraded. I don't know whether this is a bug in your code or in HHVM or something else. Please note that we don't run the latest HHVM.

For reference:

hoo@snapshot1003:~$ php5 --version
PHP 5.5.9-1ubuntu4.13 (cli) (built: Sep 29 2015 15:24:49) 
Copyright (c) 1997-2014 The PHP Group
Zend Engine v2.5.0, Copyright (c) 1998-2014 Zend Technologies
    with Zend OPcache v7.0.3, Copyright (c) 1999-2014, by Zend Technologies

hoo@snapshot1003:~$ hhvm --version
HipHop VM 3.6.5 (rel)
Compiler: 1442470674_297057398
Repo schema: 599829f574c631684b463e9fe5f9093266631b9c
Extension API: 20150212

I manually recreated the RDF using Zend php now and changed the binary being used to Zend php. I don't know if there's more to do here (like fixing an underlying problem in the script) or whether we can just close this.

Krinkle set Security to None.
Krinkle moved this task from Backlog to Defect on the HHVM board.

I think this issue with HHVM was fixed in this diff

But that patch is part of HHVM 3.6.5 (which we use), so I don't think that's it.

What ever happened here? Is this still an issue?

Just tested it and it still is.

For reference, tested dcat 92ab37d94edefc36107eceb2197252dd9825af8b with:

hoo@snapshot1003:~$ hhvm --version
HipHop VM 3.12.1 (rel)
Compiler: 3.12.1+dfsg-1
Repo schema: 8aa242a1c26f122330b9188732cdd84948fd7706

Tested, per @ArielGlenn's request:

hoo@mwdebug1001:~/dcat$ hhvm --version
HipHop VM 3.18.1 (rel)
Compiler: 3.18.1+dfsg-1+wmf1
Repo schema: 47e86d9e41dac7800783dc589035875b00f7231c

The output still contains various xmlns:rdf="", like <dcterms:license rdf:resource="http://creativecommons.org/publicdomain/zero/1.0/" xmlns:rdf=""/>.

This could probably be worked around as noted in the comment on https://github.com/facebook/hhvm/issues/4336 (look for Workaround).

Seems to be fixed, at least in the current file.

Seems to be fixed, at least in the current file.

IIRC we now generate the dcat files using a special setup which avoids hhvm. This task is about fixing the underlying issue so that that special fix isn't needed.

@ArielGlenn @hoo As this is an effect of hhvm would converting the script to python be a viable solution or is PHP a must for it to fit with other dump related mechanisms? (Haven't checked if it is viable codewise but asking to see what options there are).

It makes no difference to my setup if you hand me a python script or a php one. @hoo may have something to say about it however.

@ArielGlenn @hoo As this is an effect of hhvm would converting the script to python be a viable solution or is PHP a must for it to fit with other dump related mechanisms? (Haven't checked if it is viable codewise but asking to see what options there are).

I'm not as proficient in Python (compared to PHP) and Python is not used very widely within Wikimedia (operations/dumps uses it though). If this makes maintaining/ developing the tool easier, I don't see a strong reason not to go for Python.

The script is not interfering with MediaWiki internally, so it's fine for it to not be in PHP.

Note: When rewriting this, we might also want to use the opportunity to make it customizable so that it works for all of Wikimedia in the end.

Creating RDF/XML by hand is bad practice anyway, no matter in which programming language. The current script could be modified to use Purtle like other Wikimedia software, this would also solve the bug.

Change 391489 had a related patch set uploaded (by Tim Starling; owner: Tim Starling):
[operations/dumps/dcat@master] Work around HHVM bug by using XMLWriter::writeAttribute()

https://gerrit.wikimedia.org/r/391489

Change 391489 merged by Hoo man:
[operations/dumps/dcat@master] Work around HHVM bug by using XMLWriter::writeAttribute()

https://gerrit.wikimedia.org/r/391489

hoo renamed this task from DCAT-AP: XML output no longer valid to DCAT-AP: XML produces invalid output with HHVM.Nov 16 2017, 12:28 PM

Change 391808 had a related patch set uploaded (by Hoo man; owner: Hoo man):
[operations/puppet@production] Snapshot: Use canonical php (= hhvm) for dcat

https://gerrit.wikimedia.org/r/391808

hoo claimed this task.

The new version works on hhvm as well as on Zend. We can switch this back to hhvm now…thanks for tackling this.

Change 391808 merged by ArielGlenn:
[operations/puppet@production] Snapshot: Use canonical php (= hhvm) for dcat

https://gerrit.wikimedia.org/r/391808

This is either not solved or not installed. The latest dcat-ap file from November 22th is broken: https://dumps.wikimedia.org/wikidatawiki/entities/dcatap.rdf (contains xmlns:rdf="")

The parent task has been Declined; is it worth keeping this open given that we're moving away from HHVM to Zend PHP 7?

thiemowmde triaged this task as High priority.
thiemowmde added a project: User-thiemowmde.

I am not sure if I am simply rephrasing what @Jdforrester-WMF asked above, but any way: Are we positive this issue would go away if we updated box the command runs on to PHP 7?

meh. I meant, if we moved the task to run on PHP 7 machine, that would mean we could still get correct XML, right? I.e. it does not require PHP 5?

meh. I meant, if we moved the task to run on PHP 7 machine, that would mean we could still get correct XML, right? I.e. it does not require PHP 5?

That should be, but I didn't check that, yet.

meh. I meant, if we moved the task to run on PHP 7 machine, that would mean we could still get correct XML, right? I.e. it does not require PHP 5?

That should be, but I didn't check that, yet.

Have you had a chance yet? Right now this is the only open task underneath T86081; if it's an on-going issue, should we stash it under T172165 too? That's coming up very quickly…

meh. I meant, if we moved the task to run on PHP 7 machine, that would mean we could still get correct XML, right? I.e. it does not require PHP 5?

That should be, but I didn't check that, yet.

Have you had a chance yet? Right now this is the only open task underneath T86081; if it's an on-going issue, should we stash it under T172165 too? That's coming up very quickly…

This is going to come up relatively soon, as we are going to move the snapshot hosts (and thus this job) to PHP7 this coming quarter.

$ cat test.sh 
#!/bin/bash

tmpDir=`mktemp -d`
touch $tmpDir/foo.json.gz
touch $tmpDir/bar.ttl.gz
touch $tmpDir/bar.ttl.bz2

php DCAT.php --config=./config.example.json --dumpDir="$tmpDir" --outputDir=/tmp &&
rm -rf $tmpDir &&
echo "Find the output at /tmp/dcatap.rdf" &&
exit 0

rm -rf $tmpDir
echo Error

Identical output with PHP 5.5.9-1ubuntu4.24 and PHP 7.1.15 (cli):

$ bash test.sh 
Find the output at /tmp/dcatap.rdf
$ md5sum /tmp/dcatap.rdf 
f488c991f5a7dfd58f05ac2fc48c3706  /tmp/dcatap.rdf

As far as I'm concerned we can 'decline' it since we'll be moving to php7 for dumps this quarter.

Per above: The script works fine with Zend (5.5 and 7), and we no longer really care for HHVM.