Page MenuHomePhabricator

Make improvements to mwparserfromhtml
Closed, ResolvedPublic

Description

There's a start of a Python library for parsing Parsoid HTML to make it easier to do research / technical tasks like extracting plaintext from articles or identifying whether an article has an infobox. There are a lot of open issues as well though that would be great to hack through -- some are more technical but others just require some knowledge of Wikipedia or willingness to look for edge cases such as figuring out the best heuristic is to identify infoboxes consistently across languages.

Overview: https://techblog.wikimedia.org/2023/02/24/from-hell-to-html/
Open issues: https://gitlab.wikimedia.org/repos/research/html-dumps/-/issues

Event Timeline

Thanks for participating in the Hackathon! We hope you had a great time.

  • If this task was being worked on and resolved at the Hackathon: Please change the task status to resolved via the Add Action...Change Status dropdown, and make sure that this task has a link to the public codebase.
  • If this task is still valid and should stay open: Please add another active project tag to this task, so others can find this task (as likely nobody in the future will look back at the Hackathon workboard when trying to find something they are interested in).
  • In case there is nothing else to do for this task, or nobody plans to work on this task anymore: Please set the task status to declined.

Thank you,
Phabricator housekeeping service