Page MenuHomePhabricator

Remove references and other "linked list" sections from our MR JSON output
Open, Needs TriagePublic

Description

User Story: “As a PM, I want the sections parser to remove more instances of references, extra links, sources (that appear at the end of the articles and in different languages),
so that I can improve the output of sections in the MR parser.”

Acceptance criteria

  1. Add more filters to remove sections that have references, portals, sources, related to, etc in the last sections of the articles
  2. The header names should be used as section identifiers/labels in our filter rules

ToDo

  • Create a fuzzy string matcher that checks the heading text for "reference" like words. Have a lookup table for all these words, include more languages so we catch more of these footer sections
  • Use the fuzzy string matcher to remove sections from the JSON output
  • Check for increased FN and FP scores in the parser metrics

Checklist for testing

  • Continue to use snapshot testing as before, update the snapshot test folder.
Things to consider:
  • We need a configurable way to add more header labels to remove when we extend the parser for more languages