We need to add abstract field with leads section text to the article object so that we can that present it in the snapshots, realtime, realtime batch and on-demand API(s).
Acceptance criteria
abstract field with lead section text is being produced in structured-data service
ToDo
- add new abstract field to the schema
- create function that will extract lead section text from HTML
- should leave in the parser repository in wikimedia-enterprise/general with a method called GetAbsract
- add extraction of the lead section to the articleupdate handler
Test Strategy
Main strategy would be unit testing for now, collect around 10 - 20 articles to run unit tests against.
Decisions to be made based on technical complexity
- should we exclude everything between curly braces in the lead section or remove the noise like pronunciation? opt out to just removing it if cleaning the noise is to complex
Notes
Please refer to leads PoC to get the example of function and output.