From https://lists.wikimedia.org/pipermail/analytics/2018-February/006190.html it became apparent that it would be nice to have pagecount data in pagecounts-ez/merged going back to the earliest available pagecounts-raw archives. This might be as easy as running Erik's compression on top of the existing pagecounts-raw dumps or it might be more complicated, a closer look is needed.
|Open||fdans||T251777 Creation of canonical pageview dumps for users to download|
|Open||fdans||T192474 Migrate pagecounts-ez generation to hadoop|
|Open||None||T188041 Generate pagecounts-ez data back to 2008|
@CristianCantoro, sorry for the delay. I think the goals for us are:
- serve per-article stats from an API so they can be incorporated into @MusikAnimal's tool.
- publish more complete dumps in the pagecounts-ez format for easy download.
For 2, the scripts are available, yes: https://github.com/wikimedia/analytics-wikistats/tree/master/pageviews_reports (but I think Erik might have some stuff not comitted yet or there may be another folder and I don't really understand the perl). But it might be easier to just write something up that conforms to the same format: https://dumps.wikimedia.org/other/pagecounts-ez/. (From that page: Example: 33 views on day 2, hour 4, and 155 views on day 3, hour 7 are coded as 'BE33,CH155')
For 1, we just don't have the space to host the full dump, but I just thought of something that might work. So, maybe if we cut the long long tail of pages that have a very small number of views, then would we dramatically reduce the size of the data? I did some checks of the form:
select count(*) from (select page_title from pageview_hourly where year=2018 and month=5 and day=10 group by page_title having sum(view_count) > N ) with_more_than_N;
N = 0: 57,018,908
N = 5: 10,649,948
N = 10: 5,476,529
N = 100: 807,495 (looks like too much of a cut)
So it looks like with 10 the data size decreases to roughly 1/10th of the original. And with daily aggregation instead of hourly, maybe we can load this into the API after all. And just call it "abridged" or something, making sure people know some pages are missing. What do you both think, would people/you be happy with something like this?
just call it "abridged" or something, making sure people know some pages are missing. What do you both think, would people/you be happy with something like this?
Better than nothing, for sure! :)
For pages that have zero pagecounts, I think it's fine to omit them entirely. Pageviews Analysis queries the MediaWiki API too, so it will know the page exists and the pagecounts are zero. For those with 1-9 pagecounts, maybe we could return a specific error saying those pages were intentionally omitted? If not, that's okay, as I said I'm sure people will be thrilled to have at least some data available. I can add a note in the FAQ that not all pages are included.
I think that for @MusikAnimal's tool the data are already aggregated daily
Yes daily and monthly are the only options. Would we be able to add an endpoint for monthly pagecounts? If not I can just do the math using the daily endpoint.
Thanks so much Cristian, I'll wait for your update. Now, I thought of another thing. If you transform the data on your machine, you'd have to then upload it which may take a really long time.
Instead of that, do you want to run the script(s) on our machines? It would duplicate the processing but eliminate the transfer, which is more of a bottleneck. Either way is ok with me, whatever you prefer. If you do want to run it on our machines, I can help debug the logic and launch it myself, or we can get you access to do it yourself.
I am using (with permission ;-) ) my University's machines. They have lots of RAM and connectivity so, it should be too much of a problem upload the data, once elaborated.
I am using a Jupyter Notebook for my test and pandas. The problem is that I think I am doing something wrong because I am using a ton of RAM (~40GB and counting) just to elaborate one day worth of data and it is taking quite long.
The code is in this repo if anybody cares to take a look:
(sorry for the lack of README, I just wanted to put the code out there)
Any help is appreciated.
The code is in this repo if anybody cares to take a look:
I only took a very quick look so apologies if I missed some nuance, but I think you might be trying to crunch all the data in memory before outputting, which would definitely use up all your memory (pandas is not very light and the files are big). So here's a complicated writeup https://indico.io/blog/fast-method-stream-data-from-big-data-sources/ that essentially boils down to: "use the yield concept" to process only what you need in memory and dump it back out to your output before going on to the next thing.
But of course, this is mostly solved for you in Hadoop where you can just write some logic and the parallelism is HDFS's job, with no work from you. You can even put a Hive table on top of all the source files and then do the transformation with a simple SQL statement.
It is right that I am reading the data in memory, but it is just a day worth of them (only 2007-12-10, in the code) and the compressed files are just 650 MB, I don't why the aggregation is taking that much memory (reading takes only a few minutes).
I will change processing system and move to Spark so that I can parallelize everything (I though it would not be necessary and then I would analyze one day worth of data at a time, but I am evidently wrong).
I worked on this during the Wikimedia hackathon and now I have a final version of the code that computes the daily total and the compact string representation for hourly views from the pagecounts-raw data.
For easier reproducibility I have decided to use the original pagecounts-raw data and not the dataset of pageviews sorted by page, which I have.
The script is written in Python 3 and uses Spark, it is available through the repo:
I have launched the script to process one day worth of data, I will report here how much time it takes.
p. s.: I really feel the urge to point out this bug report that has costed me several hours of head-scratching: https://issues.apache.org/jira/browse/SPARK-24324
@CristianCantoro this is very useful, thank you very much. When we do this task (which may be next quarter), we can just use your code directly because we can run pyspark. Once we do that we'll upload the results to dumps and this data will be available for download.
And by the way that bug looks crazy, I'll try to play with it when I get around to running your code.
I have run the script over 1 day worth of data (2007-12-11) , it took a little more than 8 hours (484 minuts) and around 34GB of RAM on a single machine with 8 cores. I am testing on another day (2007-12-12).
I was experiencing a memory leak with the data for (2007-12-10), I am investigating why.
I have run other tests and they took between 8 to 9.5 h using between 34GB to 36.5GB on a single machine with 8 cores. Also, I limited the problem with the data from 2007-12-10 to a few files. (I suspect the root of the problem may be some corrupt file).
I have to say that I am little disappointed by the performance of the script since to process all remaining data (from Dec-2007 to Nov-2011) with the current machine it would take me a year and a half.
So, I am basically writing another script that does not use Spark but simply process the data in a streaming fashion (the basic idea of the algorithm is: take one day worth of data, sort them by page and then process the data stream one line at a time).
To be honest, this was what my gut told me to do, but I wanted to try to use Pandas+Spark. From my initial tests this seems a much more efficient method (~ 60x faster and using 1/15 of the RAM).
Ok, I am done writing the new "streaming" script. It takes ~70 minutes on a single core to process one day. About the RAM, it takes 20GB at peak (when reading the input data and sorting the rows), but then it uses ~4GB, and it is using just one core.
So, in a little more than 10 days I should be able to process everything, I just need to write a little bash script to automate the launch of the script and compressing the results.
Ok, I have written a script that uses GNU Parallel to process multiple days at the same time. Using 6 cores I was able to process 23 days worth of data in a little more than 4 hours, as expeced.
You can check out the results (from 2007-12-09 to 2007-12-31) here:
I have put them on Google Drive ( have space with my university account):
Hope this is ok for you.
it may be of interest that I have published the sorted pagecounts-raw dataset. You can find it at: http://cricca.disi.unitn.it/datasets/pagecounts-raw-sorted/.
There are more info at this page: http://disi.unitn.it/~consonni/datasets/wikipedia-pagecounts-raw-sorted/.
For the moment you can download the dataset, if you wish, via HTTP. I am also setting up a dat share (see https://datproject.org/).
I have also published the pagecounts-ez files, for the period from 2007-12-09 to 2011-11-15, these are the same files as are available through the Google Drive link above, but hosted by my University.
More info at this page: http://disi.unitn.it/~consonni/datasets/wikipedia-pagecounts-ez/.
Also in this can, for the moment you can download the dataset via HTTP and I am also setting up a dat share.