Page MenuHomePhabricator

{Machine Readability} Parsing Sections
Open, Needs TriagePublic

Description

OKR 4.3

This Sections epic has 2 goals

  1. As a user, I'd like to have easy access to pre-parsed sections: headings and plain text below
  2. As a user, I'd like to have the above pre-parsed sections enriched: include paragraphs, links and references

Dependencies

Checklist

  • Review existing plain text PoC, sync with Ruairi
  • Review existing Section PoC, see code in documentation folder MR POC
  • Decide wether to include the first version of list and table parser as part of this effort
  • Investigate schema options (schema.org (hasparts), options, bring back to team)
  • document approach and implementation suggestions + 2 examples/approach for pm
  • bring outcomes back to team and decide approach
  • other engineer to test
  • add to Structured Contents dev
  • pm sign off
  • add to Structured Contents prod

Notes

  • rtl languages might use opposite section order (may require own investigation)

Event Timeline

JArguello-WMF renamed this task from {Machine Readability} Parsing Sections to Parsing Sections .Jul 19 2023, 2:14 PM
JArguello-WMF added a project: Epic.
JArguello-WMF updated the task description. (Show Details)
ROdonnell-WMF renamed this task from Parsing Sections to Parsing Sections - Migrate MR Section demo JSON to new MR API Prototype.Jul 19 2023, 2:31 PM
ROdonnell-WMF updated the task description. (Show Details)
SDelbecque-WMF renamed this task from Parsing Sections - Migrate MR Section demo JSON to new MR API Prototype to {Machine Readability} Parsing Sections - Migrate MR Section demo JSON to new MR API Prototype.Jul 27 2023, 1:25 PM
SDelbecque-WMF renamed this task from {Machine Readability} Parsing Sections - Migrate MR Section demo JSON to new MR API Prototype to {Machine Readability} Parsing Sections.Sep 7 2023, 1:56 PM
SDelbecque-WMF updated the task description. (Show Details)

Waiting on the ticket to move to the current Sprint (parser TXXX).

Apart from migrating "Section" logic, this ticket has more sub-tasks. I need clarification on specific coding tasks:

  • Is there a technical part to "First of all: investigate whether we need to include html (links, references) into sections or just headings and plain text."?
  • The dependency with "Credibility work", should this ticket add parsed references to structured-contents API, or is underway by Prabhat in other tickets?
  • What is the scope of work with "Investigate schema options"? What options are in and out of scope? What is the customer deliverable from this sub-task?

@prabhat for the "credibility work", there is a bit of overlap here with parsing Wikipedia reference links. Shall we do the reference parsing in this ticket? If we add the reference parsing to parser.go we can re-use it in the structured-contents API.

THe outstanding question is the credibility signals use of Reference links. Should we extract HTML references in the ticket or leave it for another ticket? The Section demo code does have reference extraction, so it's not much work to include it in the code migration ticket

We talked to @SDelbecque-WMF yesterday, my understanding is: if its low effort to add it, let's do it!

@SDelbecque-WMF can you please update the description with the OKR thins one belongs to? Thanks!