Page MenuHomePhabricator

Story idea for Blog: From hell to HTML: releasing a Python package to easily work with Wikimedia HTML dumps
Closed, ResolvedPublic

Description

Please provide the following information.

  • Provide a short summary of your proposed post for the Wikimedia Technical Blog. Blog readers will see this as the preview to your post:
    • The Wikimedia Enterprise HTML dumps are an exciting new resource for researchers and developers to analyze the content of Wikipedia articles at scale. While the community has developed invaluable tools such as mwparserfromhell to effectively parse the wikitext-version of an article contained in the "traditional" XML-dumps, there does currently not exist a similar tool to work with the HTML-dumps. We recently built the first version of a python-library to easily work with Wikimedia's HTML-dumps. This will lower the technical barriers for developers and researchers to access this publicly available resource.
  • Which audience or audiences do you think your post is appropriate for?:
    • Researchers who want to use the HTML-dumps to analyze Wikipedia's content. Especially those who dont have the technical expertise to parse HTML or know how wikipedia-elements (links, images etc) get parsed from wikitext
    • Developers who want to use the HTML-dumps to build tools for (semi)-automated editing
    • Developers who would like to contribute to the libary to further improve the tool
  • Does your post need to be published by a certain date?
    • No.
  • Do you have any other questions or comments?
    • not for now.

Once your request is received, a technical blog admin will review it and reach out to you through Phabricator.

Event Timeline

@MGerlach thanks for writing this post! One of the members of the Developer Advocacy team will proofread it this week.

@mseckington I just wanted to check in on this and see when we might be able to get a review. No hard deadline but would be great to get out so we can start sharing with folks. Thanks in advance!

@TBurmeister per T324756#8516907 this draft is ready for review when you can make time to fit it into your schedule.

TBurmeister triaged this task as Low priority.

@apaskulin has kindly agreed to help out with this since Developer Advocacy has low capacity due to Melinda's departure from WMF. Thanks everyone for your patience!

Great post! I've added some comments and suggestions to the Google Doc. Let me know when you've had a chance to review them.

Great post! I've added some comments and suggestions to the Google Doc. Let me know when you've had a chance to review them.

Thanks for taking the time and for your detailed review, this is super helpful : )
I am planning to go over your comments next week and will ping you if I have any follow-up questions.

@apaskulin I revised the blogpost addressing all of your comments -- they helped a lot to improve the writing. I kept all changes in suggestion mode so you could easily identify the differences to the previous version. I also added two suggestions for illustrations. Let me know if you have any other suggestions.

@MGerlach Looks great! I accepted your suggestions and added a last comment. Feel free to move this task into the "Ready for publication" column whenever you're ready.

@apaskulin Thank you. I resolved all remaining comments and moved to "ready for publication".

apaskulin subscribed.

🎉

@TBurmeister, this is ready to be published.

TBurmeister claimed this task.

This has been published, thanks to @apaskulin and to all the authors! It's a great post :-) https://techblog.wikimedia.org/2023/02/24/from-hell-to-html/