There's a start of a Python library for parsing Parsoid HTML to make it easier to do research / technical tasks like extracting plaintext from articles or identifying whether an article has an infobox. There are a lot of open issues as well though that would be great to hack through -- some are more technical but others just require some knowledge of Wikipedia or willingness to look for edge cases such as figuring out the best heuristic is to identify infoboxes consistently across languages.
Overview: https://techblog.wikimedia.org/2023/02/24/from-hell-to-html/
Open issues: https://gitlab.wikimedia.org/repos/research/html-dumps/-/issues