Page MenuHomePhabricator

Figure out a way for WDQS example parsing not rely on parsoid
Closed, ResolvedPublic

Description

Summary:
Originally, the query service UI loaded the example via Parsoid, which meant that examples could not be loaded from wikis without a VisualEditor/Parsoid set up. This severely limited the usefulness of the query service on third-party installations.

This was ultimately resolved in I46420935e5. The following approaches were explored:

  • The original approach, to ask Parsoid for the HTML of the page (which, unlike the legacy Parser’s output, includes annotations for each template invocation and argument), and extract the query arguments of all SPARQL and SPARQL2 transclusions. Doesn’t work for most third-party installs.
  • Parse wikitext, looking for {{SPARQL}} and {{SPARQL2}} transclusions and their preceding headings. Fragile.
  • Model query examples as structured data (e. g. in Wikidata statements). Didn’t go anywhere.
  • Use the parse tree of the wikitext. Looked promising, ultimately wasn’t implemented.
  • Use the parsed HTML from the legacy parser, extracting the contents of any <syntaxhighlight> block and finding the preceding headings similarly to the Parsoid version. Implemented and deployed.

Event Timeline

Restricted Application added a subscriber: Aklapper. · View Herald Transcript

It looks like you’re using the visualeditor API action, so doesn’t this still require parsoid to be setup?

And requires VisualEditor which wouldn't really help with T179262

@Lucas_Werkmeister_WMDE I don't think it needs parsoid - OSM wiki doesn't have it as far as I can see, and this approach works there. @Addshore correct, this approach does require visual editor extension. I wonder if it would be possible to use action=parse instead.

action=parse looks pretty good, though I suppose we still want to use the REST API if available for improved caching behavior.

action=parse looks pretty good, though I suppose we still want to use the REST API if available for improved caching behavior.

Sounds good!

Hm, but it looks like the data-mw attributes we use to extract the query from the template are Parsoid-specific again :(

Change 388035 had a related patch set uploaded (by Lucas Werkmeister (WMDE); owner: Lucas Werkmeister (WMDE)):
[wikidata/query/gui@master] WIP: optionally load query examples from action API

https://gerrit.wikimedia.org/r/388035

I think we might have to fall back to parsing wikitext – look for {{SPARQL}} and {{SPARQL2}} templates and the preceding === headings.

@Lucas_Werkmeister_WMDE manually doing regex-style parsing of Wiki markup in JavaScript is a guaranteed path to hell. Trust me on this one :) CCing @Anomie - is there an easy api way to get resolved template parameters on a wiki page via a GET request?
UPDATE: @Anomie, the goal is to parse this page to get the headers and the query= parameter for each SPARQL query.

manually doing regex-style parsing of Wiki markup in JavaScript is a guaranteed path to hell

True in general, but given we are talking about one (ok, two maybe) template with known content on page that we can exercise a measure of control over it - maybe it's still possible?
Theoretically, we could go as far as requre special markup (like translations do) for this particular page, if it's too hard to find templates - but I don't think it's really that hard, is it?

@Smalyshev I suspect it will be relatively easy to do with the standard API - and if so, why not reuse the existing functionality? POST is a very small price to pay for this (think how often this feature is used - not worth creating a special parser just to avoid a few CPU cycles)

manually doing regex-style parsing of Wiki markup in JavaScript is a guaranteed path to hell. Trust me on this one :)

Yes, I guessed as much :) but we would only be doing that for custom installs anyways (I’d definitely stick to Parsoid for wikidata.org).

Is this the point where my workaround becomes a feature?
Maybe this is a chance to get rid of the Wikitext parsing!

We have a property on Wikidata, but unfortunately we also have a very small size limit.
I think it would be really cool to use SPARQL for querying and federating examples.

Maybe we can share some ideas about different approaches.
@Yurik you can explain more about your use case and constraints.

@daniel pointed out that we can use action=parse&prop=parsetree. This returns an XML tree like this:

<!-- ... -->
<h level="3" i="3">=== <translate><comment><!--T:11--></comment> Cats</translate> ===</h>
\n
<template lineStart="1">
<title>SPARQL2</title>
<part>
<name>query</name>
<equals>=</equals>
<value>SELECT ?item ?itemLabel \nWHERE \n{\n  ?item wdt:P31 wd:Q146.\n  SERVICE wikibase:label { bd:serviceParam wikibase:language "[AUTO_LANGUAGE],en". }\n}\n</value>
</part>
</template>
\n\n

This already contains all the templates with parameters, which is mostly the same as what Parsoid gives us. We would still need to do a bit of parsing, but nothing worse than what we already do to extract the adjacent title of a query example.

@Anomie - is there an easy api way to get resolved template parameters on a wiki page via a GET request?

No. That would involve some particularly deep diving into the parser internals.

You can use action=parse&prop=parsetree to get the wikitext annotated with XML-style tags setting off the templates and their unresolved parameters' wikitext. I suppose you might then pass that wikitext back into action=parse or action=expandtemplates to "resolve" it, if necessary.

You might also be able to use TemplateSandbox to replace the actual template with something that just prints the parameter out in a machine-readable format. I note the specific template you're looking at there seems to already do that, embedding the parameter into a link.

Your best bet for a general parsing is to use something like mwparserfromhell; I don't know if there's a JavaScript version of something like that.

TemplateSandbox doesn’t seem to be part of a default MediaWiki installation, so it wouldn’t help in @Addshore’s case. I think the parse tree is our best bet for now – there shouldn’t be any nested templates (other than the {{!}} workaround for | in queries), so I hope another action=parse or something won’t be necessary.

Change 388035 abandoned by Lucas Werkmeister (WMDE):
WIP: optionally load query examples from action API

Reason:
– this now has merge conflicts in all of the files it touches, and can’t work without a lot more changes, as detailed in the linked tasks.

https://gerrit.wikimedia.org/r/388035

Change 548490 had a related patch set uploaded (by Lucas Werkmeister (WMDE); owner: Lucas Werkmeister (WMDE)):
[wikidata/query/gui@master] Load example queries from parsed wikitext

https://gerrit.wikimedia.org/r/548490

Change 548490 merged by jenkins-bot:
[wikidata/query/gui@master] Load example queries from parsed wikitext

https://gerrit.wikimedia.org/r/548490

Announcement made on the example queries talk page, on project chat, and with a different and more 3rd-parties-oriented version on wikidata-tech ML and Wikibase ML.

@Lucas_Werkmeister_WMDE In my opinion we can probably close this task now?
I for one am already using this code / feature in the wild.
Thoughts?