Page MenuHomePhabricator

Provide data dumps in the Analytics Data Lake
Open, Needs TriagePublic

Description

As discussed during the 2018 offsite, it would be very useful to have the data dumps, which contain all past and present versions of Wikimedia project content, systematically uploaded to the Analytics Data Lake so they can be queried using Hive, Spark, and other Hadoop ecosystem tools.

Use-cases
There are a variety of use-cases for the result of this task. We list a few of them below:

  • Measurement of AfC improvements (T192515)
  • Short regular expressions that can take hours when going through parsing through the dumps locally can be done in minutes. This has use-cases for a variety of research projects including identifying unsourced statements (T186279).
  • Section recommendation (T171224) : Here we need to parse section titles across different languages, this takes days with the current tools, and might be done in few hours using Spark.
  • Create a Historical Link Graph for Wikipedia (T186558): We want to keep a updated version of the historical link graph across different languages, as a complement for the Clickstream Dataset. Again, doing this Spark might reduce the parsing time from days to hours.
  • Parse the XML dumps with the mwparserfromhell library to figure out information box and language usage of files on Commons (Part 1 of T177358; Results and code).
  • Calculating historical article counts (part of T194562): to use the standard definition of an article, you have to know which pages contained at least one wiki link at any given time in the past.

Event Timeline

diego created this task.Feb 5 2018, 8:07 PM
Restricted Application added a subscriber: Aklapper. · View Herald TranscriptFeb 5 2018, 8:07 PM

@JAllemandou I heard you've done this kind of work before. Would you be able to give us some pointers on how to do this? Although systematic upload is nice, I have an immediate need for this.

@bmansurov: The last snapshot I realized was beginning of 2017-06 (named 2017-05, since the last full month is May 2017). It's available in two formats:

    • hdfs:///user/joal/wmf/data/raw/mediawiki/xmldumps/20170601 in xml (files are stored inside folder by wiki, formatted as hive partitions)
  • hdfs:///user/joal/wmf/data/wmf/mediawiki/wikitext/snapshot=2017-05 in parquet (same here, files inside folder to be accessible as partitions).

If you need more up-to-date data, the process I follow is:

  • Bash script to copy XML files to HDFS in partition folders
  • Run an XML2Parquet job using the analytics/wikihadoop repo (patch still to be merged)
diego added a comment.Feb 9 2018, 10:43 AM

@bmansurov could you try to upload the 20180201 dumps for en,ru,ar,jp,fr,es in parquet_
This is not urgent but might be useful for the section recommendations project.

Thanks, @JAllemandou!
@diego I'll take a look at it next week. Btw, I think you mean ja, not jp.

diego added a comment.Feb 10 2018, 5:43 PM

@diego I'll take a look at it next week. Btw, I think you mean ja, not jp.

true, ja.!

Milimetric moved this task from Incoming to Dashiki on the Analytics board.Feb 12 2018, 4:34 PM
Milimetric moved this task from Dashiki to Backlog (Later) on the Analytics board.

@JAllemandou I found some documentation about using Wikihadoop. Is there any other place where I can read up on how to use it to generate data in Parquet format?

@bmansurov : This doc is a it outdated. It should work but there is better tooling now.

In the process of making sure those patches were ok I downloaded a dump from 2018-02 (/user/joal/wmf/data/raw/mediawiki/xmldumps/20180201) and I am now parsing it into parquet (/user/joal/wmf/data/wmf/mediawiki/wikitext/snapshot=2018-01).
I'll ping here when the job finishes (can be tomorrow or the day after).

Cheers !
Joseph

Took longer that I reminded but it's done: /user/joal/wmf/data/wmf/mediawiki/wikitext/snapshot=2018-01
Looks like that patches work :)

@JAllemandou thanks! Would you be able to give some instructions on how to run these patches? The first one seems straightforward, but not sure about the scala one.

@bmansurov : There is an example command line in the header-comment of the XmlConverter file.
Little reminder: these two patches deal with huge datasets (2TB of bz2 compressed XML and 18TB of snappy compressed parquet). My wish is really for them to be productionized so that the data they import/compute is not duplicated.

@JAllemandou makes sense.

@diego how soon do you need these files? Can we wait until the patches are productionized?

@bmansurov and @diego : Data is available up to 2018-01 included at hdfs:///user/joal/wmf/data/wmf/mediawiki/wikitext/snaphsot=2018-01.
I think we're not going to put more effort into productionization as of now, but new imports can be done.

That's great. I didn't realize you did it for every wiki!

@bmansurov No worries :) The whole point of this two things is to work for 'every' wiki :)

@JAllemandou : just came back to this. The parquet version is amazing!! Thank you very much!

I don't need newer dumps now, but productionization would be great, specially to encourage everybody on the stats machines to move their work to spark.

@diego : Thanks :) I push for productionization to be one of our priorities, but there is a fight for spots in the prioritized items ;)

diego added a comment.Mar 13 2018, 2:15 PM

@JAllemandou : as we discussed on IRC, could you please add the timestamp for each revision?
Also it would be good to have the data partitioned by revision_id, because this would make easer futures joins to get additional information (e.g. user)
Thanks.

@diego : PR has been submitted (https://gerrit.wikimedia.org/r/#/c/361440/)
Can you explain more about how you'd partition the data here?
Given that revision_id is a unique row identifier (or almost, some nulls, 0) , it's not a good candidate for partition-key.

diego added a comment.Mar 13 2018, 5:06 PM

@JAllemandou : My understanding is that if you partition by a unique id, you sort by that key,and then all the contiguous ids are in the same partition, as explained here: https://hackernoon.com/managing-en,spark-partitions-with-coalesce-and-repartition-4050c57ad5c4

Then, if you have two dataframes A and B, partitioned with the same partitioner, joins are much faster, as explained in here:
https://stackoverflow.com/questions/43831387/how-to-avoid-shuffles-while-joining-dataframes-on-unique-keys

makes sense for you?

@diego: Spark-partitionning happens in Spark. There is no such thing as to "spark-partition" a dataset outside of spark. Let's discuss this in IRC or hangout, should be easier.

diego added a comment.Mar 13 2018, 5:43 PM

@JAllemandou : just for the record, in this case I meant the parquet partitions. See you in IRC

leila updated the task description. (Show Details)May 16 2018, 5:12 PM
leila added subscribers: Nuria, leila.

@JAllemandou @Nuria a heads up that we discussed this task in Analytics Hangtime today and given that there is interest from Marshall (Audiences) and Research in it, we decided to compile the use-cases we have and come back to the two of you to work together to understand how to prioritize this task.

diego updated the task description. (Show Details)May 16 2018, 5:23 PM

I'm all for this project, just want to chime in to say that we shouldn't be too married to specifically the XML dumps as a means of accomplishing what's in the task description. The dumps are a fairly indirect way to get to the content, and there may be better and more direct ways to do so.

chelsyx updated the task description. (Show Details)May 18 2018, 5:34 AM
Neil_P._Quinn_WMF renamed this task from Upload XML dumps to hdfs to Provide data dumps in the Analytics Data Lake.May 28 2018, 6:54 PM
Neil_P._Quinn_WMF updated the task description. (Show Details)

Q: How does ElasticSearch get the text for indexing?

EBernhardson added a subscriber: EBernhardson.EditedJun 21 2018, 6:05 PM

Essentially we get it from the mediawiki object ParserOutput. For the literal text the rendered html is run through a process to remove some specific css selectors and then strip all the html tags.

Also if you simply want the text in hadoop we could pull that from the elasticsearch cluster on a weekly basis or some such. That's quite easy to setup with spark and the elasticsearch connector. But not sure that really fills many use cases, as it's a snapshot in time.

Hm, I wonder how heavy it would be to spin up mediawiki and iterate over all revisions generating a ParserOutput object for each one. We'd only need to do that once, store it somewhere in HDFS, and append to it whenever there's a new edit.

Parsing is one of the more expensive things in mediawiki. Due to the expense the ParserOutput is serialized into a multi-layer cache (memcached, then mysql) so for the most part you shouldn't have to actually do the parsing. Probably this could be done via the parse api, but we explicitly ask other people not to do that and download dumps (across all wikis it's something like 250-300M pages to parse). Probably some code could be written and run in the production job queue to iterate through all the pages and emit some data to be picked up in analytics, but I can't really guess at how long that would run or how much concurrency would be reasonable. You also have alternate datastores like restbase/cassandra that have the rendered html stored in a database. I don't know enough about the storage model to say if that would be easy to source from or not.

It may be monstrous to run, but that may be something we need to do to extract certain metrics historically, like how categories change. Over this quarter and the next we'll look in depth at the difference between the HTML output stored in Cassandra, doing a parse, getting it from dumps, or various other options. The details matter here, how templates are expanded, what context the parser would even need if it's asked to render something as if it was 2012, like does it need to start at the beginning so it knows what each template looked like back then? Thanks for the info, it'll be useful as we learn more about this.

It may be monstrous to run, but that may be something we need to do to extract certain metrics historically, like how categories change. Over this quarter and the next we'll look in depth at the difference between the HTML output stored in Cassandra, doing a parse, getting it from dumps, or various other options. The details matter here, how templates are expanded, what context the parser would even need if it's asked to render something as if it was 2012, like does it need to start at the beginning so it knows what each template looked like back then? Thanks for the info, it'll be useful as we learn more about this.

If I understand correctly the problem here it is changes in templates would change the HTML output, meaning that for one revision, you might have different HTML versions, true? Anyhow, I think having at least one HTML rendered version, it is better that nothing. Reconstructing what users were exactly seeing in given point in time it is one of the most needed - but difficult - uses cases of the dumps. Ignoring templates (what many researchers will do) it's the easiest solution, and having a time machine to reconstruct exactly what was the HTML version in point in time might be too difficult. I think having at least one HTML version per revision it's not perfect, but a good solution.
In any case, I understand that this is an incremental parsing, so the huge parsing work would be needed just once.

nettrom_WMF added a subscriber: nettrom_WMF.
Groceryheist added a subscriber: Groceryheist.