Page MenuHomePhabricator

Improve Wikimedia dumping infrastructure
Open, MediumPublic

Description

https://dumps.wikimedia.org/ is powered by https://gerrit.wikimedia.org/r/#/admin/projects/operations/dumps . I believe that there are several good arguments for core/platform/whatever team to take a look at it:

  1. Lots of tools use these dumps for getting more eyeballs on them would be good.
  2. Dumps issues are typically handled by a single opsen expert.
  3. Wikidata has some continuing needs to hack on dumps (link in comments please!)
  4. Having a WMF-specific dump infrastructure is a bit counter to our values. Setting operations/dumps up outside of WMF would be very difficult.

I don't think any of these arguments are "omg, stop the presses!" urgent but they seem worth thinking about.
Wiki page for additional discussion: https://www.mediawiki.org/wiki/Wikimedia_MediaWiki_Core_Team/Backlog/Improve_dumps

Please also see wiki dev summit proposal for Jan 2016: T114019: Dumps 2.0 for realz (planning/architecture session) whereby the current dumps maintainer has taken the leap and suggested we should redo the entire thing from scratch.

Event Timeline

There are a very large number of changes, so older changes are hidden. Show Older Changes

This seems to have some similarities to the state of scap 12 months ago. Very heavily used home grown tech that is only understood by a small number of people. Seems like a great candidate for a "devops" project lead by either MediaWiki-Core-Team or Release Engineering.

Nice! This is exactly a conversation I wanted to start in a few weeks when I'm back. :) Yes, it would be really great if we could work with MW Core on improving our dumps infrastructure. There are longstanding issues with it, and having it live isolated inside Ops only is more a relic than a conscious decision that still makes sense today.

Legoktm renamed this task from Hack on dumping infrastructure to Improve Wikimedia dumping infrustructure.Feb 6 2015, 12:22 AM
Legoktm set Security to None.
Legoktm subscribed.

(Every time I see "Hack" I always think of the programming language)

I started adding Wikidata's open dump issues in a subtask.

RobLa-WMF subscribed.

I would like this to be considered as something that the MediaWiki-Core-Team does in concert with the Services team. This seems like the kind of problem that RESTbase should play a role in addressing.

RESTBase can indeed help, especially with HTML dumps. We might try to do one manual pass of HTML dumps after the restbase deploy to satisfy users asking for it (including Kiwix and Google). We could try to automate that as well, but will have to see what the storage space requirements are.

What are the main pain points with the wikitext dumps?

They attempt to serve a number of audiences and so probably satisfy no one either in format or in speed or in contents. They take a long time to run, even allowing for the ability to rerun individual stages, and this is a PITA for things like shuffling db servers around, applying security updates to the hosts where they run or the one host where they are written, etc. The rolling nature of the dumps is handy for new wikis that are added but not so handy for automation of runs. Run times will just get longer and longer over time as there's not an easy way to add more computing power for e.g. just the en wikipedia dumps. I could go on and on.

Also please consider this comment as me volunteering to assist in the 'new dumps' as long as folks consider that useful. I should add that I'm not wedded to keeping any piece of the existing infrastructure, it can all be up for discussion.

I think a great place to start on grooming this project for eventual implementation would be collect user stories for dumps in general.

Some top of the head questions from me:

  • Who uses them (or would like to use them)?
  • What do they want to do with the dump data (populate a wiki, have a data corpus for non-wiki uses, create an index of some sort for a power user tool, ...)?
  • What dump formats would be most universally useful?
  • Are there practical file size limits that should be considered?
  • What transports are most efficient/easiest to use for downloading dumps?
  • Should we supply tools for requesting and processing the dumps?
  • How often should full dumps be produced?
  • How often should incremental dumps be produced?
  • How long should full and incremental dumps be retained?

Here are a few stories I've run across:

A company wants to mirror English Wikipedia with relatively up to the minute changes (or at least up to the hour) and use it to return search results to its customers, with changes to the format, extra capabilities, etc.

A Wiktionary editor wants to update all pages containing definitions for a word in a given language, that have a certain template.

A Wiktionary editor wants to update all pages with new interwiki links depending on the pages added or removed on other Wiktionary pages.

A researcher wants to examine reversions of revisions across all articles in the main Wikipedia namespace.

Someone wants to create an offline reader based on a static dump of Wikipedia. It would be nice if the output were easily mungeable HTML (not the skin and all that other stuff). They then want to update it once a week with new content and removal of deleted content.

A bot runner wants to check all recent pages for formatting and other issues once a day, doing corrections as needed.

Someone wants to download all the articles about X topic and dump them into a local wiki, or an offline reader, or serve them as HTML on their laptop.

Someone wants to mirror the contents of all projects in their language, with a homegrown search across all wiki prjects in that language returning results.

This is only a start, please help fill in.

actually can we start a wiki page for this someplace?

actually can we start a wiki page for this someplace?

Sure. I started a page: https://www.mediawiki.org/wiki/Wikimedia_MediaWiki_Core_Team/Backlog/Improve_dumps

We can move it somewhere else if anyone thinks this is a horrible place for it.

bd808 renamed this task from Improve Wikimedia dumping infrustructure to Improve Wikimedia dumping infrastructure.Feb 16 2015, 3:57 PM

I've added a first round of notes to that page. I'd like to invite the members of the xmldatadumps-l list to weigh in on user stories and on frequency of the dumps; is that okay? They willl likely have some thoughts on format also.

I've added a first round of notes to that page. I'd like to invite the members of the xmldatadumps-l list to weigh in on user stories and on frequency of the dumps; is that okay? They willl likely have some thoughts on format also.

Yes please! Getting requirements from customers is the best thing ever IMO.

They've been invited. If we don't have some feedback in a few days I'll float it past wikitech next.

They've been invited. If we don't have some feedback in a few days I'll float it past wikitech next.

Some specific customers are cited in the wiki page, which are not following xmldatadumps-l. Can you or someone at WMF email them in private (I mean mainly mirrors, Kiwix, Wikia).

Some comments on process.

Before we start to actually work on this Big Hairy Project aka Grand Redesign. This initiative could well lead to a multi-year project. I know our projects usually don't start as such, but still that's what often happens. Are we confident that the database scheme is going to stay for another five years? Because if we're not certain, it would be a waste to start this huge endeavor first, if we then later decide that the current principle of storing a full copy of raw text if one only byte has changed in an article should be rethought. Just saying we should check that box.

Based on that assumption that this is going to be a huge project, we should not give up optimizing the current processes. So my suggestion is to find new people to manage and implement this overhaul, while current staff focuses on keeping us afloat, and feed input to and sanity-check the new design.

I agree with the "keep the lights on" sentiment and I feel responsible for a possibly-excessive enlargement of the wiki page's scope, so let me apologise if that was beyond the intention. My personal interpretation was that, by inspecting *all* needs, we could find some pattern(s) which give high returns on little investment.

For instance, maybe at the end of this exercise we discover that 50 % of the issues can be solved by adding some tables to the whitelist and increasing the bandwidth of the server. :P Who knows.

A separate point is that we may lack an over-arching goal. If I were to propose one, it would be this: collect requirements for the dumps to satisfy all needs of WMF internal researchers, and abolish all the analytics replicas. Dumps are by definition more transportable than databases, so they ensure that everyone is more or less equal. In an ideal world where dumps satisfy all needs, only one infrastructure needs to be maintained and all research is as easy from the outside as it is from inside, so we have both less maintenance costs and more outcomes.

I will cetainly be maintaining/fixingup the current dumps while any investigation/scheming goes on towards this project. My immediate goal on this task is to make sure my brain dump makes it onto the wiki page; beyond that, someone else needs to organize and lead this effort.

some thoughts....

currently there is a throttle on download speed for dumps. This makes it a bit slow (~ 2MB/s) to download wikidata json dumps (to linode, digital ocean, etc.), in comparison to downloading dump files from planet.openstreetmap.org (~6MB/s).

also, it would be nice to keep old dump files around longer, at least stuff like "*-pages-meta-current". OSM has dumps, changesets, etc. going back to 2012. :)

It would be nice for example to still be able to get the langlinks tables and such from the time before wikidata, for analysis purposes, and would be nice if we had the geodata tables (available in dumps, in the first place, and historical ones) without needing full revision dump.

also, it would be great to have incremental dumps more fully supported.

finally, some sort of images / files would be nice. I don't know the status of that, but is important for fully being able to recreate a Wikipedia or other wikimedia site clone based on dumps.

apparently it took a month for the full revisions dump of wikidata to finish:

http://dumps.wikimedia.org/wikidatawiki/20150207/

and sometimes it fails:

http://dumps.wikimedia.org/wikidatawiki/20141208/

or doesn't complete (yet?):

http://dumps.wikimedia.org/wikidatawiki/20150113/

would be nice if that was better, though not sure exactly what the issues are other than more resources for dumps.

Another data point: it took 3 weeks and 6 days after Ariel restarted dumps process at February 12 to complete all dumps. Commons dumps completed March 11.

It would be nice for example to still be able to get the langlinks tables and such from the time before wikidata, for analysis purposes, and would be nice if we had the geodata tables (available in dumps, in the first place, and historical ones) without needing full revision dump.

I think this is worth filing separately: there could be a whitelist of files to keep for a longer time and/or blacklist of files to keep for a shorter time (e.g. the bz2 files where a 7z is available?).

In the meanwhile, whoever has a copy of significant old dumps (like langlinks before the Wikidata migration) should upload them on archive.org and can contact Ariel to get them on https://dumps.wikimedia.org/archive/

For HTML dumps we are currently considering the best distribution format to use. One option on the table is distributing an lzma-compressed SQLite database. Please let us know if that sounds like a useful format for your application (or whichever other format you'd prefer) in T93396: Decide on format options for HTML and possibly other dumps.

Aklapper triaged this task as Medium priority.Mar 23 2015, 5:42 PM

It is time to promote Wikimedia-Hackathon-2015 activities in the program (training sessions and meetings) and main wiki page (hacking projects and other ongoing activities). Follow the instructions, please. If you have questions, about this message, ask here.

Did someone work on this project during Wikimedia-Hackathon-2015? If so, please update the task with the results. If not, please remove the label.

Please confirm and promote this activity by assigning it to its owner, listing it or scheduling it at the Hackathon wiki page and by placing it in the right column at Wikimania-Hackathon-2015. Thank you!

@daniel, do you know whether this activity is confirmed for Wikimania (and who owns it)?

What is the status of this task, now that Wikimania 2015 is over? As this task is in the "Backlog" column of the Wikimania-Hackathon-2015 project's workboard: Did this task take place and was successfully finished? If yes: Please provide an update (and if the task is not completely finished yet, please move the project to the "Work continues after Mexico City" column on the Wikimania-Hackathon-2015 workboard). If no: Please edit this task by removing the Wikimania-Hackathon-2015 project from this task. Thanks for your help and keeping this task updated!

A message to all open tasks related to the Wikimania-Hackathon-2015. What do you need to complete this task? Do you need support from the Wikimedia Foundation to push it forward? Help promoting this project? Finding an intern to work on it? Organizing a developer sprint? Pitching it to WMF teams? Applying for a grant? If you need support, share your request at T107423: Evaluate which projects showcased at the Wikimania Hackathon 2015 should be supported further or contact me personally. Thank you!

Here a small status of the ZIM file generation side:

  • People of Kiwix were at the hackathon but we don't have worked on this
  • Work continues anyway and we now *almost* achieve to release, one time a month, new ZIM files for all our projects (on with media files, one without)
  • Generation runs on the 3 VMs on the labs but this is a little bit too short in term of resources. A few improvements are planned on the software side, but we really need at least one additional x-large instance to be able to run everything @labs.
  • Regarding the file serving, this is always the same problem since years: both ZIM & portables versions of the project would benefit to be hosted in the datacenter, but this looks to be impossible :( Only Wikipedia ZIM files are hosted http://dumps.wikimedia.org/other/kiwix/

Just to be clear, is there anything we could do to help you, or are you fine more or less?

Thx Quim for asking. The answer is "yes":
1 - Increase a little bit the "mwoffliner" labs project quota to allow to replace "mwoffliner3" VM by a xlarge instance https://wikitech.wikimedia.org/wiki/Nova_Resource:Mwoffliner
2 - Have all projects offline versions at dumps.wikimedia.org, here is the really old but still open ticket about it https://phabricator.wikimedia.org/T57503

Steps (1) and (2) are necessary to do step (3) which is implementing an easy solution to allow visitors to download an offline full copy the project they are currently visiting (for example develop an ad-hoc extension).

@Kelson Regarding 1: In the past I got that done for me by asking in the labs channel on IRC, probably a simple ticket with project Cloud-Services might work alternatively.

@Qgil Kelson covered the ZIM part of this task of your question. Regarding the rest: Before the reorg I had the impression that someone at the WMF was driving this. Who is that now?

@JanZerebecki
Thx you for your comment. I'll as soon as I have finished with the last optimization of mwoffliner.