Mar 27 2023
Nov 21 2022
Ok, I have just realized that the right page is https://research.wikimedia.org/index.html, but the README in the repo links to GitHub.
Aug 5 2022
We are done with the upgrade, so the resources can be scaled back.
Aug 3 2022
Jul 30 2022
Jul 28 2022
Mar 11 2022
Additional context on why do we need this.
Aug 19 2021
is there anything we can do to help with this? Any info we can provide?
Let us know and thanks in advance.
Aug 16 2021
Aug 4 2021
Jul 5 2021
Thanks for the pointers, however, let me clarify a little bit the background of our request:
- we are already working with mwxml and we are already working in a streaming fashion on the dumps. (Also, let me point out that all the code we are writing is public in this GitHub organization;
- the main problem in terms of storage is the WikiConv dataset, which is quite large, as it is a collection of all discussions from a given Wikipedia. We can try to find another solution for at least preprocessing this dataset so that we are able to do everything with less data;
- at this moment, it is quite urgent for us to be able to deploy the servers and test our code, so we would like to find a reasonable compromise to be able to deploy our system.
Jun 18 2021
thanks for your question. At the moment we are processing the following datasets:
- MediaWiki history dumps
- Wikipedia XML dumps
- WikiConv dataset
the analysis that we do are applied to all languages, with the exclusion of WikiConv for which we only focus on ca, en, es, it.
Jun 9 2021
Mar 1 2021
Feb 25 2021
I have created an instance of a VPS within this project, for which I have an internal IP address (172.16.3.146).
Feb 11 2021
My 2cents: I created a new query by mistake, it is a draft and the fact that I cannot delete it is super annoying. I am ok with the idea of not deleting published queries.
Nov 3 2020
Nov 2 2020
Thanks, @srodlund, I think this bug can be closed now.
Oct 29 2020
Oct 28 2020
Jun 1 2020
May 15 2020
My 2 cents:
- Redirects are very useful and something to be taken into account when working on Wikipedia - the paper linked by Isaac is the prime reference in that regard - however, handling them is quite tricky. I have some experience in that, having built snapshots of the graph of Wikilinks for several Wikipedia over several years ([shameless plug] see the paper, the code is on GitHub). I have also worked with redirects and the pageviews data (code on GitHub). In short, it would be useful to have the redirects solved, but it is a project on its own, IMHO.
- Page ids are very useful, but beware that all kind of quicky things can happen to them over time with page moves, deletions and re-creations, etc. My fear is that selecting a page by id would not be exactly equivalent to select it by title and since these data come from web requests, using the page title would be the "right" way to do it. Note that this is kind of a subproblem to the redirect problem in some sense. I would like to put together some concrete examples of what I mean, but I really do not have the time to do that at the moment.
Sep 13 2019
It seems that I have had a problem with Caddy's cache plugin. Now, the server is back up.
Sep 12 2019
I've fixed this, it seems that Caddy implemented a directive to handle this case, here's the pull request caddy#2144.
Sep 7 2019
Sep 6 2019
Apr 30 2019
I am taking on, together with @Geofrizz and @Alessandro.palmas, the administration of these machines.
Dec 8 2018
I have also published the pagecounts-ez files, for the period from 2007-12-09 to 2011-11-15, these are the same files as are available through the Google Drive link above, but hosted by my University.
it may be of interest that I have published the sorted pagecounts-raw dataset. You can find it at: http://cricca.disi.unitn.it/datasets/pagecounts-raw-sorted/.
There are more info at this page: http://disi.unitn.it/~consonni/datasets/wikipedia-pagecounts-raw-sorted/.
Nov 14 2018
Jul 23 2018
Jul 22 2018
Jul 12 2018
I am done with the computation, I have processed all pages untill 2011-11-15. I have 1432 files averaging ~400MB in size, for a total of 581GB total. I can transfer them to WMF server if you tell me where.
Jul 3 2018
Sibling iOS bug T198693
Jun 19 2018
Jun 14 2018
May 28 2018
Ok, I have written a script that uses GNU Parallel to process multiple days at the same time. Using 6 cores I was able to process 23 days worth of data in a little more than 4 hours, as expeced.
May 26 2018
Ok, I am done writing the new "streaming" script. It takes ~70 minutes on a single core to process one day. About the RAM, it takes 20GB at peak (when reading the input data and sorting the rows), but then it uses ~4GB, and it is using just one core.
May 25 2018
@Samwilson, thanks for the heads up. I have added you as a maintaner of the tool wscontest.
So, I am basically writing another script that does not use Spark but simply process the data in a streaming fashion (the basic idea of the algorithm is: take one day worth of data, sort them by page and then process the data stream one line at a time).
I have run other tests and they took between 8 to 9.5 h using between 34GB to 36.5GB on a single machine with 8 cores. Also, I limited the problem with the data from 2007-12-10 to a few files. (I suspect the root of the problem may be some corrupt file).
May 23 2018
I have run the script over 1 day worth of data (2007-12-11) , it took a little more than 8 hours (484 minuts) and around 34GB of RAM on a single machine with 8 cores. I am testing on another day (2007-12-12).
May 22 2018
I worked on this during the Wikimedia hackathon and now I have a final version of the code that computes the daily total and the compact string representation for hourly views from the pagecounts-raw data.
May 16 2018
May 15 2018
@Milimetric, no problem.
May 8 2018
I'm coming from T193759, I can help with this. Is the script doing the merge available? I can run it on one of my machines and let it run even for several days.
Anyway, I am totally ok with uploading these data, I think I just need a server where to save them.
May 3 2018
May 2 2018
Thank you! I was able to login!
I was able to:
- create a Wikitech account named "CristianCantoro SUL": https://wikitech.wikimedia.org/w/index.php?title=User:CristianCantoro_SUL&redirect=no
- create a Mediawiki account named "CristianCantoro SUL": https://www.mediawiki.org/w/index.php?title=User:CristianCantoro_SUL&redirect=no
Ok, if I go (logged in as CristianCantoro_SUL) to https://phabricator.wikimedia.org/settings/user/CristianCantoro_SUL/page/external/ and I try to disconnect it I get the following message:
I see that most of my activity is done with CristianCantoro, I would like to keep this. I think I can disconnect CristianCantoro_SUL and re-connect my Mediawiki accounts to CristianCantoro. I will try it now.
Commenting here with the other account to confirm the request.
Nov 28 2017
I second what Legoktm is saying: adding the option (or default) of using Orbot to route the traffic of the Wikipedia is completely independent of creating a Tor hidden service.
Oct 31 2017
Oct 29 2017
I would also suggest adding the possibility of having a field or a drop-down menu with the motivation for rejection, I think it would be useful to know.
Some reasons that I could think of: