Page MenuHomePhabricator

Ask API with XML format produces invalid XML (title tags)
Closed, ResolvedPublic

Description

Author: alj62888

Description:
Child elements under <results> are given the name of the page. It is easy to create pages with titles that result in illegal XML tag names. Just to name a few that I've tried:

4me
Some "quoted" text <- an important one for special purpose wikis
xml

Example query:

http://www.mywikidev.com/wiki/api.php?action=ask&query=[[Modification%20date::%3E4%20February%202013]]%20[[Has property::%2B]]|?Has property&format=xml

<?xml version="1.0"?>
<api>

<query>
  <printrequests>
    <printrequest label="" typeid="_wpg" mode="2" />
    <printrequest label="Has property" typeid="_txt" mode="1" />
  </printrequests>
  <results>
    <some_"quoted"_text fulltext="some &quot;quoted&quot; text" fullurl="http://www.mywikidev.com/wiki/index.php/some_%22quoted%22_text">
      <printouts>
        <Has_property>
          <value>1234</value>
        </Has_property>
      </printouts>
    </some_"quoted"_text>
  </results>
</query>

</api>

Workaround(s): Unknown, but would love to hear of one.


Version: unspecified
Severity: blocker

Details

Reference
bz44696

Related Objects

StatusSubtypeAssignedTask
ResolvedNone
ResolvedNone

Event Timeline

bzimport raised the priority of this task from to Needs Triage.Nov 22 2014, 1:36 AM
bzimport set Reference to bz44696.
bzimport added a subscriber: Unknown Object (MLST).

alj62888 wrote:

I would like to suggest that the result tag names (currently set to the page names) be replaced by something simple, such as <result> or <result-<index>>, since the title and page is already specified by the fulltext and fullurl properties.

So, the sample output would instead look like this:

<?xml version="1.0"?>
<api>

<query>
  <printrequests>
    <printrequest label="" typeid="_wpg" mode="2" />
    <printrequest label="Has property" typeid="_txt" mode="1" />
  </printrequests>
  <results>
    <result fulltext="some &quot;quoted&quot; text"

fullurl="http://www.mywikidev.com/wiki/index.php/some_%22quoted%22_text">

      <printouts>
        <Has_property>
          <value>1234</value>
        </Has_property>
      </printouts>
    </result>
  </results>
</query>

</api>

Thank you

alj62888 wrote:

Of course, I'm not suggesting to break backwards compatibility with the above suggestion :) So, maybe a new format/query param will be acceptable?

I think it would be acceptable to break bc, most APIs I know of do <pages><p> and this should be no different.

Unknown Object (User) added a comment.Feb 6 2013, 3:10 AM

Woh ... not so fast. We are not jumping ship here and break things up. The SMW\DISerialzier provides serialization for the SMWAPI, the JSON format, and the SMW\ApiResultPrinter (since SMW 1.9). Before considering any change, please be aware of the legacy support that comes with the serialization and its content structure.

I am not sure how it has been useful till now, I would find it hard to parse.
Still if you think there has to be bc support please add in a follow-up change or put precise comments in https://gerrit.wikimedia.org/r/#/c/47707/

Unknown Object (User) added a comment.Apr 18 2013, 1:33 AM

[1] was breaking compatibility and therefore abandoned.

This was only important for XML and similar formats it is therefore suggested to only change the output for these formats, and not for JSON.

https://gerrit.wikimedia.org/r/#/c/47707/

Unknown Object (User) added a comment.Apr 18 2013, 1:44 AM

It is not a tag problem but rather a problem in how 'fulltext' => $title->getFullText() encodes special characters (&' etc.). It results in encoded strings like &#039; &quot; that causes problems in the XML output format.

Unknown Object (User) added a comment.Apr 18 2013, 2:05 AM

Another issue with XML could be that for example Property:GG, XML is claiming that "Namespace prefix Property on ... is not defined"

Example

<Property:GG fulltext="..." namespace="106" exists="1">
<printouts>

<Modification_date>
  <value>1365684120</value>
</Modification_date>

</printouts>
</Property:GG>

alj62888 wrote:

James,

I think trying to create XML tag names that are page titles is just asking for trouble. The XML spec has restrictions on what characters can be in a tag name[1] so any character that can be in a page title will have to be mapped into an XML element. It also makes the XML unnecessarily verbose and hard to read... just looks flaky, imo. Finally, it is also redundant information since the page name is provided by the fulltext attribute already.

I propose putting in the change just for the XML format if that solves the JSON compatibility conflict.

  1. http://www.w3.org/TR/REC-xml/#NT-NameStartChar
Unknown Object (User) added a comment.Apr 18 2013, 3:24 AM

read... just looks flaky, imo. Finally, it is also redundant information
since
the page name is provided by the fulltext attribute already.

For more information about SMW related serialization see [1].

PS: I will not take a crack on it in near future, so feel free to tackle this issue but please keep in mind to add PHPUnit/QUnit tests to ensure consistency among the output serialization.

[1] http://www.semantic-mediawiki.org/wiki/Serialization_%28JSON%29

alj62888 wrote:

Hi James, what was the reference for? BTW, I'm afraid I'm not qualified to hack on the wiki code myself.

Hey, in case it matters. This is a major pain for me.. I hit it while trying to upgrade my SMW installation and it is a real blocker for downstream code.

Unknown Object (User) added a comment.Apr 19 2013, 5:16 AM

(In reply to comment #11)

Hi James, what was the reference for? BTW, I'm afraid I'm not qualified to
hack on the wiki code myself.

It will give some insights in how serialization works in SMW works and why [1] wasn't a fit as it only eliminates a possible tag parameter at the head by replacing

$results[$diWikiPage->getTitle()->getFullText()] = $result;
with
$results[] = $result;

This solves the issue half way because if you happen to use a property like "Has_xml'_label" and use it as printout parameter, it would face the same problem but at this level you need to know to which printout you are referring since it a reference key to the printrequests array .

While the subject "tag" at the head might seem as information redundancy (it isn't but that's not the issue of this discussion), you clearly can't get away by eliminating the property label from the structure as it is used as key for the a purpose to eliminate redundancy by splitting printrequest and result information.

XML (pretty-print) output

<?xml version="1.0"?>
<api>

<query>
  <printrequests>
    <printrequest label="" typeid="_wpg" mode="2" format="" />
    <printrequest label="Has date" typeid="_dat" mode="1" format="ISO" />
    <printrequest label="Has xml" typeid="_wpg" mode="1" format="" />
    <printrequest label="Has xml&#039; label" typeid="_wpg" mode="1" format="" />
  </printrequests>
  <results>
    <XML_Example fulltext="XML Example" fullurl=".." namespace="0" exists="1">
      <printouts>
        <Has_date>
          <value>631152000</value>
        </Has_date>
        <Has_xml>
          <value fulltext="Test" fullurl=".." namespace="0" exists="" />
        </Has_xml>
        <Has_xml'_label>
          <value fulltext="Test" fullurl=".." namespace="0" exists="" />
        </Has_xml'_label>
      </printouts>
    </XML_Example>
  </results>
  <meta hash="d3a1a814ff424003d9cfaa9a3ab7221f" count="1" offset="0" />
</query>

</api>

JSON (pretty-print) output

{

"query": {
    "printrequests": [
        {
            "label": "",
            "typeid": "_wpg",
            "mode": 2,
            "format": false
        },
        {
            "label": "Has date",
            "typeid": "_dat",
            "mode": 1,
            "format": "ISO"
        },
        {
            "label": "Has xml",
            "typeid": "_wpg",
            "mode": 1,
            "format": ""
        },
        {
            "label": "Has xml' label",
            "typeid": "_wpg",
            "mode": 1,
            "format": ""
        }
    ],
    "results": {
        "XML Example": {
            "printouts": {
                "Has date": [
                    "631152000"
                ],
                "Has xml": [
                    {
                        "fulltext": "Test",
                        "fullurl": "...",
                        "namespace": 0,
                        "exists": false
                    }
                ],
                "Has xml' label": [
                    {
                        "fulltext": "Test",
                        "fullurl": "...",
                        "namespace": 0,
                        "exists": false
                    }
                ]
            },
            "fulltext": "XML Example",
            "fullurl": "...",
            "namespace": 0,
            "exists": true
        }
    },
    "meta": {
        "hash": "d3a1a814ff424003d9cfaa9a3ab7221f",
        "count": 1,
        "offset": 0
    }
}

}

[1] https://gerrit.wikimedia.org/r/#/c/47707/

Unknown Object (User) added a comment.May 22 2013, 12:13 PM

*** Bug 48705 has been marked as a duplicate of this bug. ***

Related URL: https://gerrit.wikimedia.org/r/65646 (Gerrit Change Icbc92c9e74161c1ec626775bf6f95703a6df8de1)

alj62888 wrote:

I don't see any use for the printrequests element in the XML format other than just confirmation of the output part of the query. Consumers will know what elements they are looking for and their XPath.

Maybe it would be easier to let the XML format diverge from the JSON format by eliminating the printrequests element. I don't think the two formats need to mirror one another element-for-element; the formats are too different. It's issues like this that are already known to cause problems with JSON->XML conversion.

Just my $.02

Unknown Object (User) added a comment.May 26 2013, 11:26 PM

JSON/XML will mirror available information in order to support interoperability which means output formats will stay as close as possible. A content consumer (Custom parser that implements the individual parsing on client-side) can ignore the information if necessary.

alj62888 wrote:

Interoperability between what?

alj62888 wrote:

I see, but perfect interoperability btw JSON and XML is impossible... as you may have noticed. This is a major bug, 5 months old, w/an easy fix by Nischay, but it's been rolled back in an attempt to do the impossible (commendable, but impossible). Google JSON to XML conversion and you'll see that no solution is perfect and will fail exactly like this one does with invalid tags.

Change 65646 merged by jenkins-bot:
(Bug 44696) AskApi to support valid XML using the SMW\ApiQueryResultFormatter

https://gerrit.wikimedia.org/r/65646