Please provide the following information.
- Provide a short summary of your proposed post for the Wikimedia Technical Blog. Blog readers will see this as the preview to your post:
- The Wikimedia Enterprise HTML dumps are an exciting new resource for researchers and developers to analyze the content of Wikipedia articles at scale. While the community has developed invaluable tools such as mwparserfromhell to effectively parse the wikitext-version of an article contained in the "traditional" XML-dumps, there does currently not exist a similar tool to work with the HTML-dumps. We recently built the first version of a python-library to easily work with Wikimedia's HTML-dumps. This will lower the technical barriers for developers and researchers to access this publicly available resource.
- Which topic type does your blog post fall under? See: https://www.mediawiki.org/wiki/Wikimedia_technical_blog_editorial_guidelines#Outlines_for_topics:
- A feature update | explainer
- Which audience or audiences do you think your post is appropriate for?:
- Researchers who want to use the HTML-dumps to analyze Wikipedia's content. Especially those who dont have the technical expertise to parse HTML or know how wikipedia-elements (links, images etc) get parsed from wikitext
- Developers who want to use the HTML-dumps to build tools for (semi)-automated editing
- Developers who would like to contribute to the libary to further improve the tool
- Will you need assistance with writing your blog post, or do you already have a draft? If you have a draft, please provide a link here:
- Does your post need to be published by a certain date?
- No.
- Do you have an image in mind for the featured image? You can learn more here: https://www.mediawiki.org/wiki/Wikimedia_technical_blog_editorial_guidelines#Images_used_in_your_post
- We have some ideas (listed in the draft) but havent yet made up our mind. any suggestions are more than welcome.
- Do you have any other questions or comments?
- not for now.
Once your request is received, a technical blog admin will review it and reach out to you through Phabricator.