Page MenuHomePhabricator

Make a plain text dump of Wikipedia available alongside the XML
Open, Needs TriagePublic

Description

Many research applications of Wikipedia would be much easier if the wiki text was presented as plaintext instead of XML dumps. From what I am told, there are plaintext versions of the articles already available.

These sources should be made available through the same channels as the main XML dumps.

Event Timeline

Restricted Application added a subscriber: Aklapper. · View Herald TranscriptMar 30 2017, 2:22 AM
Audiodude added a subscriber: Tgr.Mar 30 2017, 2:23 AM

Many research applications of Wikipedia would be much easier if the wiki text was presented as plaintext instead of XML dumps.

@Audiodude: Could you explain why (to better understand why XML can be a problem)?
What is "Wikipedia" here? All articles on all Wikipedias?

In fact the XML is not the problem, it's the wiki markup. From what I understand, the wiki markup is not defined by any formal grammar, so parsing it is non-trivial. Tools like mwparserfromhell help, but can be difficult to configure and not always accurate. There's also of course the problem of recursive template expansion.

Yes "Wikipedia" would imply all articles on all Wikipedias.

I believe the XML wrapper would still be useful as article delimiters and to provide metadata, but perhaps the payload of such a theoretical plaintext dump could be plaintext.

I thought TextExtracts stores article plaintext in the page_props table, but apparently it uses memcached. So there might be no easy way to dump that. I guess a script could just fetch it from the API, but that might be too slow or fragile.

Is HTML what's asked for on this ticket, a full html expansion of the wikitext?

Actually, I believe plaintext would be more useful, but HTML would be a good starting point since it's reasonable to strip out.

I ask because there are things like infoboxes, which get turned into tables; making those into plain text seems not so useful to me. And you might want to preserve some formatting, e.g. <p> markers turning into blank lines, <br /> into carriage returns, and so on. How exactly that would be done is a task for the plaintext consumer to implement, as they will know their needs best.

Once that determination is made, a filtering script could be written and made available to anyone who wants to use or adopt it for similar purposes. Is this something you would be interested in taking on? You can look at sample HTML output by viewing content returned by the RestBase API, as shown here: P6766 from the command https://el.wiktionary.org/api/rest_v1/#!/Page_content/get_page_html_title

MER-C added a subscriber: MER-C.EditedMay 7 2018, 6:37 PM

How many pages are you trying to fetch? Do you need the live version or can you wait? If you need a small number of pages, you could:

  1. Fetch the unparsed wikitext of an article through the API.
  2. Remove all reference tags using regex replace.
  3. Send the resulting wikitext to the parser via the API.
  4. Run the following python code using the Beautiful Soup parser on the resulting HTML:
parsetree = BeautifulSoup(html)
    
# Strip tables
tabletags = parsetree.find_all("table")
for tabletag in tabletags:
    tabletag.decompose()
# Strip table of contents
divtags = parsetree.find_all("div", "toc")
for divtag in divtags:
    divtag.decompose()

plaintext = parsetree.get_text()

Yes, this is dumb and shouldn't be necessary -- I should be able to get plain text through the API -- but it mostly solves the problem.

And you might want to preserve some formatting, e.g. <p> markers turning into blank lines, <br /> into carriage returns, and so on. How exactly that would be done is a task for the plaintext consumer to implement, as they will know their needs best.

This wouldn't be your problem if you always return the paragraph breaks. They can easily be removed by the user. (And besides, natural language processing software makes it really easy to ignore paragraphs.)