We want to be able to easily point to or extract leads from the Wiki dataset.
- figure out if structured page and leads are considered same effort or be treated separately (see below modelling)
- figure out workflow approach:
- which datasets be can uses to test out lead extraction?
- handpick # articles
- use simple wiki
- 500-1000 articles based on most viewed, category or other?
- Will the same be used for structured page?
- testing against existing endpoints: https://www.mediawiki.org/wiki/Page_Content_Service#/page/summary and/or https://www.mediawiki.org/wiki/Extension:TextExtracts (options to return the first N sentences and/or just the introduction, as HTML or plain text)
- using categories or templates (either to build datasets or to test) eg https://en.wikipedia.org/wiki/Category:Pages_missing_lead_section and https://en.wikipedia.org/wiki/Template:Lead_missing
- modelling: packaging: separate endpoint or fit it current APIs, repeat this exercise for all APIs
- modelling: schema: is schema.org enough or do we need to expand (modelling: presentation layer: evaluate which elements, fields, json structure, etc)
- which datasets be can uses to test out lead extraction?
To-Do
- figure out the datasets for testing
- figure out what to test against
- figure out how to present test results (visual)
- create ticket(s)