Page MenuHomePhabricator

Extract text from wikimarkup
Closed, ResolvedPublic

Description

Wikipedia articles contain lots of formatting text. We need a python package to extract pure text from them. This text should not contain tables,lists,references,lists etc.

Event Timeline

Hi @Pavan91727, thanks for taking the time to report this and welcome to Wikimedia Phabricator! Who is "we" exactly?
@lakshmi: Do you plan to work on this?

See also https://www.mediawiki.org/wiki/Extension:TextExtracts#API (but that has limits). Maybe https://www.mediawiki.org/wiki/API:Parsing_wikitext and https://stackoverflow.com/questions/753052/strip-html-from-strings-in-python provide pointers?

@lakshmi: Thanks for participating in the Hackathon! We hope you had a great time.

  • If this task was being worked on and resolved at the Hackathon: Please change the task status to resolved via the Add Action...Change Status dropdown, and make sure that this task has a link to the public codebase.
  • If this task is still valid and should stay open: Please add another active project tag to this task, so others can find this task (as likely nobody in the future will look back at the Hackathon workboard when trying to find something they are interested in).
  • In case there is nothing else to do for this task, or nobody plans to work on this task anymore: Please set the task status to declined.

Thank you,
your Hackathon venue housekeeping service

https://github.com/lakshmi-warrier/wikimarkup-formatter

This is the link to the initial draft of the project. This needs to be worked on, and is not completed.

Thanks! Resolving this task as the Hackathon is over.