User Story: “As a PM, I want the sections parser to remove more instances of references, extra links, sources (that appear at the end of the articles and in different languages),
so that I can improve the output of sections in the MR parser.”
Acceptance criteria
- Add more filters to remove sections that have references, portals, sources, related to, etc in the last sections of the articles
- The header names should be used as section identifiers/labels in our filter rules
ToDo
- Create a fuzzy string matcher that checks the heading text for "reference" like words. Have a lookup table for all these words, include more languages so we catch more of these footer sections
- Use the fuzzy string matcher to remove sections from the JSON output
- Check for increased FN and FP scores in the parser metrics
Checklist for testing
- Continue to use snapshot testing as before, update the snapshot test folder.
Things to consider:
- We need a configurable way to add more header labels to remove when we extend the parser for more languages