Page MenuHomePhabricator

NetEase/YouDao company seeks guidance for setting up local mirror of wikipedia
Closed, ResolvedPublic

Description

In the context of a larger collaboration with NetEase, we will advise their folks in how best to create a local archive of HTML content from our wiki projects.

They have asked for information about the dumps, OpenZIM, and HTML retrieval. I'mnot quite clear which projects they wish to mirror, whether the mirrors need to be up to the minute or not, whether they must have all of the content or just the mosst frequently read pages, etc.

Jan is our go between and will be getting more information on their requirements, as well as adding the previous email correspondence to this ticket.

Event Timeline

ArielGlenn claimed this task.
ArielGlenn raised the priority of this task from to Medium.
ArielGlenn updated the task description. (Show Details)
ArielGlenn added a project: acl*sre-team.
ArielGlenn added subscribers: ArielGlenn, mark, faidon, JanWMF.

Quoting from one of those emails:

"As you mentioned in the meeting, you provide three methods for downloading Wiki content(Dump, HTML, OpenZim). We tend to choose the HTML way. Could you kindly give us the API URL and relative instructions? We want to improve existing method that we are using on crawling your website."

If we are talking about anything but a very small subset, retrieving HTML is not going to be viable for e.g. grabbing all of the English language Wikipedia. Typically static HTML dumps in the past were created by using the DumpHTML MediaWiki extension which may need some love at this point to run on current installations. We would not want to run it here as it would take forever. An OpenZIM wiki page suggests that this approach is what they use to grab HTML content for OpenZIM files. http://www.openzim.org/wiki/Wiki2html It would be good to check with OpenZIM community members to see if this is still accurate; the project has an irc channel on freenode, #openzim thoguh I don't know how active it is.

If you are retrieving HTML of a small subset of pages of a wiki project, this can be done by requesting each page separately: http://www.mediawiki.org/wiki/API:FAQ#get_the_content_of_a_page_.28HTML.29.3F If you think grabbing separate bits of the page is more convenient, you can look into calls to the parser: http://www.mediawiki.org/wiki/API:Parsing_wikitext#parse Again this is viable only for a small subset of pages.

The OpenZIM file format, should you decide to go that route, is described here: http://www.openzim.org/wiki/ZIM_File_Format It contains the text content HTML rendered, as you will see if you look at the earlier link describing the wikitext to html conversion process. There is a full English language Wikipedia package available for OpenZIM dated July 2014 http://www.kiwix.org/wiki/Wikipedia_in_all_languages/el These files are maintaind by the Kiwix offline reader project.

Another option: you may want to set up a local copy of MediaWiki with a database fully loaded with data from one of the recent XML dumps, and grab recent changes every so often via the MediaWiki API and import them. This would require having someone become a local MediaWiki expert at least to a degree, being familiar with installation, upgrades, extensions, and changes we make to our code in production.

As I understand that part of the goal of this collaboration is a focus on translation, it may be useful to have a look at the Wiktionary projects, which are all multilingual, containing words from all languages and translations of these terms into the language of the particular wiki community. One advantage that these projecs have is that they are much smaller than Wikipedia which would give you something easier to start out with. There is not a common structure to Wiktionary content across projects, which would mean that digging out translations could be time-consuming to automate.

I hope this is enough to get this conversation going. Jan, go ahead and add the previous emails when you like so we have them on record. Thanks!

Thanks Ariel. "The meeting" in the quote on the top was a breakfast NetEase's delegation had with Erik, Damon, and Amir on 1/28 and this ticket is part of a wider range of cooperation topics being worked on atm with Damon owning it on c-level. I will fill in tech specifications today, if the non-technical correspondence we have had up to this point contains relevant details already (don't expect much) and otherwise intend to invite NetEase's folks to join here directly.

Amire80 renamed this task from NetEase/YouDa company seeks guidance for setting up local mirror of wikipedia to NetEase/YouDao company seeks guidance for setting up local mirror of wikipedia.Feb 12 2015, 1:31 PM
Amire80 set Security to None.
Amire80 subscribed.

Adding content from an email received today from Brent at NetEase:

"We have assigned people who is responsible for direct communicating with your leaders in 3 ways' cooperation.

  1. API for wrapping Wiki content: Development Manager: Hu Chen <email deleted> is the main contact, who will work closely with Ariel Glenn. He will research and work on the platform Ariel has provided and he might have questions in detail right after the end of national holiday completes - Feb 25."