User Story: “As a PM, I want to sections to ignore references, citations and external links,
so that I can compare our sections to other parsers”
Acceptance criteria
- remove unnecessary sections and external links
- Our section text is more similar to Huggingface Wikipedia dataset
ToDo
- remove unnecessary sections and external links
- Our section text is more similar to the Huggingface Wikipedia dataset
- Tech session presentation
Test Strategy
Use normal parser unit testing
Description (optional)
For complete transparency, these are the exact changes:
- For infoboxes we now extract: .infobox, .infobox_v3, .infobox_v2, .sinottico (Italian infoboxes were added in this update)
- When parsing Sections we remove these CSS DOM nodes:
figure, footer, input, link, nav, noscript, script, style, sub, sup, table, .book, .catlinks, .citation, .gallery, .gallerybox, .hatnote, .listaref, .metadata, .mw-authority-control, .mw-editsection-bracket, .mw-editsection-divider, .mw-editsection-like, .mw-editsection-visualeditor, .mw-editsection, .mw-footer-container, .mw-gallery-packed, .mw-magiclink-isbn, .mw-mf-linked-projects, .mw-redirectedfrom, .NavFrame, .navigation-not-searchable, .noprint, .normdaten, .portal-bar, .printfooter, .refbegin-columns, .refbegin, .references, .reflist, .side-box, .sister-box, .sistersitebox, .vector-body-before-content, .vector-dropdown, .vector-header-container, .vector-page-toolbar, .vector-settings, *[role="navigation"]
- When parsing Sections we remove these CSS selectors and their parent DOM nodes:
#References, #Explanatory_notes, #Further_reading, #See_also, #External_links, #Notes, #Notable_people #Primary_sources, #Secondary_sources, #Tertiary_sources, #Citations, #General_and_cited_sources, #Bibliography, #Referencias, #Véase_también, #Bibliografía, #Enlaces_externos, #Ciudades_hermanas, #Referencias_y_notas, #Literatur, #Weblinks, #Einzelnachweise, #Références, #Voir_aussi, #Bibliographie, #Liens_externes, #Notes_et_références, #Notes, #Références, #Galerie_photos, #Liens_externes, #Notes_et_références, #Notes, #Références, #Bibliographie, #Voir_aussi, #Referencias, #Véase_también, #Bibliografía, #Enlaces_externos, #Ciudades_hermanas, #Referencias_y_notas, #Referências, #Ver_também, #Bibliografia, #Ligações_externas, #Notas, #Referências, #Annexes, #Bibliographie, #Liens_externes, #Notes_et_références, #Notes, #Références, #Article_connexe #Bibliografia, #Voci_correlate, #Altri_progetti, #Collegamenti_esterni, #Note_e_riferimenti, #Note, #Riferimenti
- In Sections parsing we now remove all infoboxes
- In sections parsing we remove trailing spaces from end of sentences