Page MenuHomePhabricator

JSON-LD selection doesn't work
Closed, InvalidPublicBUG REPORT

Description

Steps to replicate the issue (include links if applicable):
Web-page: https://www.tportal.hr/vijesti/clanak/americka-bomba-konacno-stigla-u-ukrajinu-sto-ce-se-sad-dogoditi-na-ratistu-foto-20240224

Author full-name extraction steps:

  • select element containing json-ld via X-Path: /html/head/script[@type="application/ld+json"]

Works OK, returns:

{ "@context": "https://schema.org", "@type": "NewsArticle", "mainEntityOfPage": { "@type": "WebPage", "@id": "https://www.tportal.hr/vijesti/clanak/americka-bomba-konacno-stigla-u-ukrajinu-sto-ce-se-sad-dogoditi-na-ratistu-foto-20240224" }, "headline": "Američka bomba konačno stigla u Ukrajinu: Što će se sad dogoditi na ratištu?", "image": { "@type": "ImageObject", "url": "https://www.tportal.hr/media/thumbnail/1200x720/2216024.jpeg?cropId=0", "height": 720, "width": 1200 }, "datePublished": "2024-02-24T08:30:45+01:00", "dateModified": "2024-02-24T08:30:45+01:00", "author": [ { "@type": "Person", "name": "Vanja Majetić", "url": "https://www.tportal.hr/autor/vanja-majetic" } ], "publisher": { "@type": "Organization", "name": "tportal.hr", "logo": { "@type": "ImageObject", "url": "https://www.tportal.hr/bundles/tportalpublishing/builds/1.0.204/images/tportal-logo-article-schema.png" } }, "description": "GLSDB", "keywords": " svijet rat u Ukrajini glsdb bomba malog promjera " }
  • extract author full name via json-ld: author[0].name
  • Expected: "Vanja Majetić"; returns "Empty array" instead
  • json-ld query tested on https://jmespath.org/ and works as expected

Template available here: https://meta.wikimedia.org/wiki/User:Ivi104/Web2Cit/data/hr/tportal/www/templates.json

Event Timeline

diegodlh subscribed.

Hi, @Ivi104! Thank you for your interest in Web2Cit and for posting your question here.

The way how you tried to use the JSON-LD selection steps makes sense, but it's not the way how it works. Instead of applying it on the JSON-LD output from a previous step (as I think you tried to do), it works on the entire HTML document by serializing all JSON-LD objects found into an array. According to the documentation, the JMESPath expression is "evaluated against an array including all JSON-LD objects found in the target webpage".

I have updated the translation template you sent me to use JSON-LD selection instead. As you can see, I added [0]. at the beginning of your expression to indicate that I want to select the first JSON-LD object found on the webpage.

Also note that I have replaced /** with an actual path in the template's "path" field. Translation templates should be based on a specific webpage (indicated by its path). This is important to let other contributors understand the decisions that you have made, and also for the correct functioning of the Web2Cit monitor.

I'm closing this issue for now. Please, reopen if you think I've misunderstood your question :)