Make improvements to mwparserfromhtml
Closed, ResolvedPublic
Actions

Assigned To

Authored By

	Isaac
	May 12 2023, 8:55 PM

Description

There's a start of a Python library for parsing Parsoid HTML to make it easier to do research / technical tasks like extracting plaintext from articles or identifying whether an article has an infobox. There are a lot of open issues as well though that would be great to hack through -- some are more technical but others just require some knowledge of Wikipedia or willingness to look for edge cases such as figuring out the best heuristic is to identify infoboxes consistently across languages.

Overview: https://techblog.wikimedia.org/2023/02/24/from-hell-to-html/
Open issues: https://gitlab.wikimedia.org/repos/research/html-dumps/-/issues

Event Timeline

Isaac created this task.May 12 2023, 8:55 PM

Restricted Application added a subscriber: Aklapper. · View Herald TranscriptMay 12 2023, 8:55 PM

Isaac and I just took a full pass on cleaning up list of issues https://gitlab.wikimedia.org/repos/research/html-dumps/-/issues

Thanks for participating in the Hackathon! We hope you had a great time.

If this task was being worked on and resolved at the Hackathon: Please change the task status to resolved via the Add Action... → Change Status dropdown, and make sure that this task has a link to the public codebase.
If this task is still valid and should stay open: Please add another active project tag to this task, so others can find this task (as likely nobody in the future will look back at the Hackathon workboard when trying to find something they are interested in).
In case there is nothing else to do for this task, or nobody plans to work on this task anymore: Please set the task status to declined.

Thank you,
Phabricator housekeeping service

Isaac closed this task as Resolved.Jun 12 2023, 1:01 PM

Make improvements to mwparserfromhtmlClosed, ResolvedPublicActions

Description

Event Timeline

Make improvements to mwparserfromhtml
Closed, ResolvedPublic
Actions