Page MenuHomePhabricator

Improve section parsing - remove citations and related sections (Prod)
Closed, ResolvedPublic3 Estimated Story Points

Description

User Story: “As a PM, I want to sections to ignore references, citations and external links,
so that I can compare our sections to other parsers”

Acceptance criteria

  1. remove unnecessary sections and external links
  2. Our section text is more similar to Huggingface Wikipedia dataset

ToDo

  • remove unnecessary sections and external links
  • Our section text is more similar to the Huggingface Wikipedia dataset
  • Tech session presentation
Test Strategy

Use normal parser unit testing

Description (optional)

For complete transparency, these are the exact changes:

  • For infoboxes we now extract: .infobox, .infobox_v3, .infobox_v2, .sinottico (Italian infoboxes were added in this update)
  • When parsing Sections we remove these CSS DOM nodes:
	figure, footer, input, link, nav, noscript, script, style, sub, sup, table,
	.book, .catlinks, .citation, .gallery, .gallerybox, .hatnote, .listaref, .metadata,
	.mw-authority-control, .mw-editsection-bracket, .mw-editsection-divider,
	.mw-editsection-like, .mw-editsection-visualeditor, .mw-editsection,
	.mw-footer-container, .mw-gallery-packed, .mw-magiclink-isbn, .mw-mf-linked-projects, .mw-redirectedfrom,
	.NavFrame, .navigation-not-searchable, .noprint,
	.normdaten, .portal-bar, .printfooter,
	.refbegin-columns, .refbegin, .references, .reflist,
	.side-box, .sister-box, .sistersitebox,
	.vector-body-before-content, .vector-dropdown, .vector-header-container, .vector-page-toolbar, .vector-settings,
	*[role="navigation"]
  • When parsing Sections we remove these CSS selectors and their parent DOM nodes:
#References, #Explanatory_notes, #Further_reading, #See_also, #External_links, #Notes, #Notable_people
	#Primary_sources, #Secondary_sources, #Tertiary_sources, #Citations, #General_and_cited_sources, #Bibliography,
	#Referencias, #Véase_también, #Bibliografía, #Enlaces_externos, #Ciudades_hermanas, #Referencias_y_notas,
	#Literatur, #Weblinks, #Einzelnachweise,
	#Références, #Voir_aussi, #Bibliographie, #Liens_externes, #Notes_et_références, #Notes, #Références,
	#Galerie_photos, #Liens_externes, #Notes_et_références, #Notes, #Références, #Bibliographie, #Voir_aussi,
	#Referencias, #Véase_también, #Bibliografía, #Enlaces_externos, #Ciudades_hermanas, #Referencias_y_notas,
	#Referências, #Ver_também, #Bibliografia, #Ligações_externas, #Notas, #Referências,
	#Annexes, #Bibliographie, #Liens_externes, #Notes_et_références, #Notes, #Références, #Article_connexe
	#Bibliografia, #Voci_correlate, #Altri_progetti, #Collegamenti_esterni, #Note_e_riferimenti, #Note, #Riferimenti
  • In Sections parsing we now remove all infoboxes
  • In sections parsing we remove trailing spaces from end of sentences

Event Timeline

ROdonnell-WMF renamed this task from Improve section parsing - remove citations and related sections to Improve section parsing - remove citations and related sections (Dev).Feb 26 2024, 2:03 PM
ROdonnell-WMF moved this task from MR to QA on the Wikimedia Enterprise (Sprint 56) board.

@ROdonnell-WMF will take a look and decide if we move ahead or reverse the MR

After reviewing the code changes, I've changed this ticket status to declined.

We can introduce the features individually, as the product needs them.

I'll cancel the MR and revert the dev API.

Tickets are merged to DEV. Will QA test this afternoon.

If all is good, I'll create a MR to merge to PROD.

ROdonnell-WMF renamed this task from Improve section parsing - remove citations and related sections (Dev) to Improve section parsing - remove citations and related sections (Prod).Mar 6 2024, 10:40 AM
ROdonnell-WMF moved this task from QA to MR on the Wikimedia Enterprise (Sprint 56) board.

DEV is QA-tested and passes my shakedown tests. I documented my tests and saved them to Google Drive as a Postman collection

I've renamed this ticket to deploy to PROD and moved it to MR.

Waiting for MR approvals and will then deploy