Page MenuHomePhabricator

Implement JSON-LD selection
Closed, ResolvedPublicFeature

Description

JSON-LD seems to be widely used. It is actually the structured data format recommended by Google Search.

However, it is not yet supported by Zotero or Citoid (as far as I know, although this is pending confirmation from @Mvolz). Hence, supporting this in Web2Cit may make a big difference.

Theoretically, because JSON-LD is part of the HTML, we should be able to select them with the XPath selection (already implemented). But using transformation steps afterwards to isolate the relevant metadata would be tedious and break-prone.

A specific JSON-LD selection step may have a configuration value to indicate what root or nested property in the JSON-LD one wants to select.

Because multiple JSON-LD objects may be present in one webpage, we should provide a way to indicate which one we want to select from. Should we concatenate all JSON-LD objects in a single array and select one of them using [n] indexing before indicating which property we want? For example, [1].author.name would select the name property of the author object of the second ([1]) JSON-LD object.

Alternatively, is there a standard JSON-LD query language, similar to XPath, we can use?

Event Timeline

Alternatively, is there a standard JSON-LD query language, similar to XPath, we can use?

XPath v3.1 supports JSON

I've been experimenting with JSON-LD selection. What I'm doing so far is:

  1. Concatenate multiple JSON-LD objects in a webpage into a single array (some webpages may have more than one JSON-LD object)
  2. Use JSONPath to select nodes.

You can make the following code into a bookmarklet and use it to copy the JSON-LD array described in step 1 from any webpage:

function concatAndCopy() {
  let jsonld = [];
  Array.from(
    document.querySelectorAll('script[type="application/ld+json"')
  ).forEach((script) => {
    const content = script.innerHTML; 
    try {
      const json = JSON.parse(content);
      jsonld = jsonld.concat(json);
    } catch {
      console.log("failed to parse");
    }
  });
  navigator.clipboard.writeText(JSON.stringify(jsonld, undefined, 2)).then(() => {
    alert("JSON-LD copied to clipboard!")
  });
};
concatAndCopy();

Then, paste the JSON-LD array to services like https://jsonpath.com/ to experiment with JSONPath.

For example, for URL https://abcnews.go.com/Politics/anger-outrage-meet-exasperation-math-guns-note/story?id=84977728 (with which were having some trouble earlier today using XPath), $..author..name and $..datePublished seem to work pretty well to retrieve authors and publication date, respectively. @Pcoombe, FYI.

I haven't deployed this as a selection step, yet. Any thoughts?

diegodlh changed the subtype of this task from "Task" to "Feature Request".Aug 31 2022, 2:03 PM

JSONPath supports script expressions. However, this can be security problem, particularly in our case where expressions are user-defined.

This is handled differently in different JSONPath implementations:

  • dchester's implemenation uses static-eval, which would be safer than using eval. However, this would still be insecure, which is the reason why jsonpath-sandbox (below) appeared. In addition, this package seems to be no longer maintained (last commit from Apr 2021), and a static-eval security update seems to be needed already. Unfortunately, there seems to be no way to disable script evaluation, as allowed by jsonpath-plus (see below).
  • the jsonpath-sandbox implementation addresses the security problem of using static-eval mentioned above by using Google's V8. However, because this library is not supported in browsers, it may be a problem with Web2Cit-Editor.
  • jsonpath-plus uses less-secure eval instead of static-eval. However, script evaluation may be disabled altogether (via the preventEval option). This seems to be the most promising compromise so far.

JSONPath seems to be quite popular, but it hasn't been standardized (although there seem to be efforts in that direction, see here, here and here).

Alternatively, we may use JSON Pointers, which are [[ JSON Pointers are standard already, but very simple: https://datatracker.ietf.org/doc/html/rfc6901 | standardized ]] already, and would not pose the security problems of script evaluation. However, they seem to be very basic.

I found another problem while trying to implement JSON-LD selection using jsonpath-plus. Apparently, there is no way to validate the syntax of an expression before actually trying it against a JSON object. And even in these cases it may not fail if the point of error is not reached during the query.

As a workaround, I thought of using jsonpath's parse() to check the syntax while setting a configuration value. Although this seems to work, (1) we would be losing the extra features provided by jsonpath-plus (because they wouldn't be valid syntax for jsonpath), and (2) we can't tell it to fail on script evaluation expressions (which will fail on trying the expression, and again only if this point is actually reached).

While trying these, I found about JMESPath and jq. I'll look into them now.

diegodlh claimed this task.

Done in e5f92da7.

Will be available on Web2Cit-Core v2, which will be used by Web2Cit-Server v1.1. We may make it available for testing on w2c-beta.toolforge.org soon.

Documentation already available on the Templates docs page.

After reading this and this posts, I used JMESPath in the end, although it doesn't support recursive traversal, because it seems to be the only JSON query language which is backed by a specification, and the corresponding JS library on npm seems to be widely popular.

We may make it available for testing on w2c-beta.toolforge.org soon.

Done. You can check an example here https://w2c-beta.toolforge.org/sandbox/Diegodlh/https://www.pagina12.com.ar/365595-mi-reloj-interno-una-app-que-permite-mejorar-el-descanso, where JSON-LD selection is being used to get the publication date.