Page MenuHomePhabricator

Generate pagecounts-ez data back to 2008
Open, MediumPublic

Description

From https://lists.wikimedia.org/pipermail/analytics/2018-February/006190.html it became apparent that it would be nice to have pagecount data in pagecounts-ez/merged going back to the earliest available pagecounts-raw archives. This might be as easy as running Erik's compression on top of the existing pagecounts-raw dumps or it might be more complicated, a closer look is needed.

Event Timeline

I'm coming from T193759, I can help with this. Is the script doing the merge available? I can run it on one of my machines and let it run even for several days.

@CristianCantoro, sorry for the delay. I think the goals for us are:

  1. serve per-article stats from an API so they can be incorporated into @MusikAnimal's tool.
  2. publish more complete dumps in the pagecounts-ez format for easy download.

For 2, the scripts are available, yes: https://github.com/wikimedia/analytics-wikistats/tree/master/pageviews_reports (but I think Erik might have some stuff not comitted yet or there may be another folder and I don't really understand the perl). But it might be easier to just write something up that conforms to the same format: https://dumps.wikimedia.org/other/pagecounts-ez/. (From that page: Example: 33 views on day 2, hour 4, and 155 views on day 3, hour 7 are coded as 'BE33,CH155')

For 1, we just don't have the space to host the full dump, but I just thought of something that might work. So, maybe if we cut the long long tail of pages that have a very small number of views, then would we dramatically reduce the size of the data? I did some checks of the form:

select count(*)
  from (select page_title
          from pageview_hourly
         where year=2018 and month=5 and day=10
         group by page_title
        having sum(view_count) > N
       ) with_more_than_N;

N = 0: 57,018,908
N = 5: 10,649,948
N = 10: 5,476,529
N = 100: 807,495 (looks like too much of a cut)

So it looks like with 10 the data size decreases to roughly 1/10th of the original. And with daily aggregation instead of hourly, maybe we can load this into the API after all. And just call it "abridged" or something, making sure people know some pages are missing. What do you both think, would people/you be happy with something like this?

@Milimetric, no problem.

About 2.:

For 2, the scripts are available, yes: https://github.com/wikimedia/analytics-wikistats/tree/master/pageviews_reports (but I think Erik might have some stuff not comitted yet or there may be another folder and I don't really understand the perl). But it might be easier to just write something up that conforms to the same format: https://dumps.wikimedia.org/other/pagecounts-ez/. (From that page: Example: 33 views on day 2, hour 4, and 155 views on day 3, hour 7 are coded as 'BE33,CH155')

Ok, I am already rewriting the script in Python (I do not really know Perl).

About 1.:

I think that for @MusikAnimal's tool the data are already aggregated daily and that would reduce the size of the dataset, about selecting just a fraction of them I am not sure.

just call it "abridged" or something, making sure people know some pages are missing. What do you both think, would people/you be happy with something like this?

Better than nothing, for sure! :)

For pages that have zero pagecounts, I think it's fine to omit them entirely. Pageviews Analysis queries the MediaWiki API too, so it will know the page exists and the pagecounts are zero. For those with 1-9 pagecounts, maybe we could return a specific error saying those pages were intentionally omitted? If not, that's okay, as I said I'm sure people will be thrilled to have at least some data available. I can add a note in the FAQ that not all pages are included.

I think that for @MusikAnimal's tool the data are already aggregated daily

Yes daily and monthly are the only options. Would we be able to add an endpoint for monthly pagecounts? If not I can just do the math using the daily endpoint.

Thanks so much Cristian, I'll wait for your update. Now, I thought of another thing. If you transform the data on your machine, you'd have to then upload it which may take a really long time.

Instead of that, do you want to run the script(s) on our machines? It would duplicate the processing but eliminate the transfer, which is more of a bottleneck. Either way is ok with me, whatever you prefer. If you do want to run it on our machines, I can help debug the logic and launch it myself, or we can get you access to do it yourself.

Thanks so much Cristian, I'll wait for your update. Now, I thought of another thing. If you transform the data on your machine, you'd have to then upload it which may take a really long time.

Instead of that, do you want to run the script(s) on our machines? It would duplicate the processing but eliminate the transfer, which is more of a bottleneck. Either way is ok with me, whatever you prefer. If you do want to run it on our machines, I can help debug the logic and launch it myself, or we can get you access to do it yourself.

I am using (with permission ;-) ) my University's machines. They have lots of RAM and connectivity so, it should be too much of a problem upload the data, once elaborated.

I am using a Jupyter Notebook for my test and pandas. The problem is that I think I am doing something wrong because I am using a ton of RAM (~40GB and counting) just to elaborate one day worth of data and it is taking quite long.

The code is in this repo if anybody cares to take a look:
https://github.com/CristianCantoro/merge-pageviews

(sorry for the lack of README, I just wanted to put the code out there)

Any help is appreciated.

The code is in this repo if anybody cares to take a look:
https://github.com/CristianCantoro/merge-pageviews

I only took a very quick look so apologies if I missed some nuance, but I think you might be trying to crunch all the data in memory before outputting, which would definitely use up all your memory (pandas is not very light and the files are big). So here's a complicated writeup https://indico.io/blog/fast-method-stream-data-from-big-data-sources/ that essentially boils down to: "use the yield concept" to process only what you need in memory and dump it back out to your output before going on to the next thing.

But of course, this is mostly solved for you in Hadoop where you can just write some logic and the parallelism is HDFS's job, with no work from you. You can even put a Hive table on top of all the source files and then do the transformation with a simple SQL statement.

The code is in this repo if anybody cares to take a look:
https://github.com/CristianCantoro/merge-pageviews

I only took a very quick look so apologies if I missed some nuance, but I think you might be trying to crunch all the data in memory before outputting, which would definitely use up all your memory (pandas is not very light and the files are big). So here's a complicated writeup https://indico.io/blog/fast-method-stream-data-from-big-data-sources/ that essentially boils down to: "use the yield concept" to process only what you need in memory and dump it back out to your output before going on to the next thing.

But of course, this is mostly solved for you in Hadoop where you can just write some logic and the parallelism is HDFS's job, with no work from you. You can even put a Hive table on top of all the source files and then do the transformation with a simple SQL statement.

It is right that I am reading the data in memory, but it is just a day worth of them (only 2007-12-10, in the code) and the compressed files are just 650 MB, I don't why the aggregation is taking that much memory (reading takes only a few minutes).

I will change processing system and move to Spark so that I can parallelize everything (I though it would not be necessary and then I would analyze one day worth of data at a time, but I am evidently wrong).

Yeah, I wouldn't be surprised if pandas balloons the data that much, though 100x does seem a little odd.

I worked on this during the Wikimedia hackathon and now I have a final version of the code that computes the daily total and the compact string representation for hourly views from the pagecounts-raw data.

For easier reproducibility I have decided to use the original pagecounts-raw data and not the dataset of pageviews sorted by page, which I have.

The script is written in Python 3 and uses Spark, it is available through the repo:
https://github.com/CristianCantoro/merge-pageviews

I have launched the script to process one day worth of data, I will report here how much time it takes.

C

p. s.: I really feel the urge to point out this bug report that has costed me several hours of head-scratching: https://issues.apache.org/jira/browse/SPARK-24324

@CristianCantoro this is very useful, thank you very much. When we do this task (which may be next quarter), we can just use your code directly because we can run pyspark. Once we do that we'll upload the results to dumps and this data will be available for download.

And by the way that bug looks crazy, I'll try to play with it when I get around to running your code.

I have run the script over 1 day worth of data (2007-12-11) , it took a little more than 8 hours (484 minuts) and around 34GB of RAM on a single machine with 8 cores. I am testing on another day (2007-12-12).

I was experiencing a memory leak with the data for (2007-12-10), I am investigating why.

I have run other tests and they took between 8 to 9.5 h using between 34GB to 36.5GB on a single machine with 8 cores. Also, I limited the problem with the data from 2007-12-10 to a few files. (I suspect the root of the problem may be some corrupt file).

I have to say that I am little disappointed by the performance of the script since to process all remaining data (from Dec-2007 to Nov-2011) with the current machine it would take me a year and a half.

On the brighter side, the Spark bug (SPARK-24324) I reported was a major issue and now it is being fixed.

So, I am basically writing another script that does not use Spark but simply process the data in a streaming fashion (the basic idea of the algorithm is: take one day worth of data, sort them by page and then process the data stream one line at a time).

To be honest, this was what my gut told me to do, but I wanted to try to use Pandas+Spark. From my initial tests this seems a much more efficient method (~ 60x faster and using 1/15 of the RAM).

Ok, I am done writing the new "streaming" script. It takes ~70 minutes on a single core to process one day. About the RAM, it takes 20GB at peak (when reading the input data and sorting the rows), but then it uses ~4GB, and it is using just one core.

So, in a little more than 10 days I should be able to process everything, I just need to write a little bash script to automate the launch of the script and compressing the results.

Ok, I have written a script that uses GNU Parallel to process multiple days at the same time. Using 6 cores I was able to process 23 days worth of data in a little more than 4 hours, as expeced.

You can check out the results (from 2007-12-09 to 2007-12-31) here:
https://drive.google.com/drive/folders/1PvdoJcYzw1uTFzPZja1tXX5TwSqXwn1N?usp=sharing

This is great, thanks very much Cristian. I've been out with an injury for a few weeks. Now I have to catch up with other stuff but I'll get back to this not too far in the future.

This is great, thanks very much Cristian. I've been out with an injury for a few weeks. Now I have to catch up with other stuff but I'll get back to this not too far in the future.

I'm really sorry to hear that, @Milimetric, take care!

As a general info, I have processed data until 2010-11-17, so I still have 363 days to go, this should take me around 4 more days of processing (I am just using 5 cores on my machine).

(ping @Milimetric)

I am done with the computation, I have processed all pages untill 2011-11-15. I have 1432 files averaging ~400MB in size, for a total of 581GB total. I can transfer them to WMF server if you tell me where.

@CristianCantoro where do you have access? The files have to end up on terbium at some point, but I can move them to the right place if you put them anywhere.

@CristianCantoro where do you have access? The files have to end up on terbium at some point, but I can move them to the right place if you put them anywhere.

@Milimetric:

I have put them on Google Drive ( have space with my university account):
https://drive.google.com/drive/folders/1we4l2nVxFt7PyBJXZneJSAlxNkRoHmwE?usp=sharing

Hope this is ok for you.

That works, let me know if you need to take them down before I get to copy them, and I'll try to squeeze it in.

That works, let me know if you need to take them down before I get to copy them, and I'll try to squeeze it in.

I'm in no particular hurry, nor I have storage concerns, so take the time you need.

Nuria raised the priority of this task from Low to High.Sep 26 2018, 7:17 PM
Milimetric lowered the priority of this task from High to Medium.Oct 18 2018, 5:39 PM
Milimetric added a project: Analytics-Kanban.

I'm so sorry I delayed so long, will work on this next week if I can.

Hi,

it may be of interest that I have published the sorted pagecounts-raw dataset. You can find it at: http://cricca.disi.unitn.it/datasets/pagecounts-raw-sorted/.
There are more info at this page: http://disi.unitn.it/~consonni/datasets/wikipedia-pagecounts-raw-sorted/.

For the moment you can download the dataset, if you wish, via HTTP. I am also setting up a dat share (see https://datproject.org/).

I have also published the pagecounts-ez files, for the period from 2007-12-09 to 2011-11-15, these are the same files as are available through the Google Drive link above, but hosted by my University.
http://cricca.disi.unitn.it/datasets/pagecounts-ez/.

More info at this page: http://disi.unitn.it/~consonni/datasets/wikipedia-pagecounts-ez/.

Also in this can, for the moment you can download the dataset via HTTP and I am also setting up a dat share.

mforns moved this task from Smart Tools for Better Data to Mentoring on the Analytics board.