Thu, Oct 29
Wed, Oct 28
Jun 1 2020
May 15 2020
My 2 cents:
- Redirects are very useful and something to be taken into account when working on Wikipedia - the paper linked by Isaac is the prime reference in that regard - however, handling them is quite tricky. I have some experience in that, having built snapshots of the graph of Wikilinks for several Wikipedia over several years ([shameless plug] see the paper, the code is on GitHub). I have also worked with redirects and the pageviews data (code on GitHub). In short, it would be useful to have the redirects solved, but it is a project on its own, IMHO.
- Page ids are very useful, but beware that all kind of quicky things can happen to them over time with page moves, deletions and re-creations, etc. My fear is that selecting a page by id would not be exactly equivalent to select it by title and since these data come from web requests, using the page title would be the "right" way to do it. Note that this is kind of a subproblem to the redirect problem in some sense. I would like to put together some concrete examples of what I mean, but I really do not have the time to do that at the moment.
Sep 13 2019
It seems that I have had a problem with Caddy's cache plugin. Now, the server is back up.
Sep 12 2019
I've fixed this, it seems that Caddy implemented a directive to handle this case, here's the pull request caddy#2144.
Sep 7 2019
Sep 6 2019
Apr 30 2019
I am taking on, together with @Geofrizz and @Alessandro.palmas, the administration of these machines.
Dec 8 2018
I have also published the pagecounts-ez files, for the period from 2007-12-09 to 2011-11-15, these are the same files as are available through the Google Drive link above, but hosted by my University.
it may be of interest that I have published the sorted pagecounts-raw dataset. You can find it at: http://cricca.disi.unitn.it/datasets/pagecounts-raw-sorted/.
There are more info at this page: http://disi.unitn.it/~consonni/datasets/wikipedia-pagecounts-raw-sorted/.
Nov 14 2018
Jul 23 2018
Jul 22 2018
Jul 12 2018
I am done with the computation, I have processed all pages untill 2011-11-15. I have 1432 files averaging ~400MB in size, for a total of 581GB total. I can transfer them to WMF server if you tell me where.
Jul 3 2018
Sibling iOS bug T198693
Jun 19 2018
Jun 14 2018
May 28 2018
Ok, I have written a script that uses GNU Parallel to process multiple days at the same time. Using 6 cores I was able to process 23 days worth of data in a little more than 4 hours, as expeced.
May 26 2018
Ok, I am done writing the new "streaming" script. It takes ~70 minutes on a single core to process one day. About the RAM, it takes 20GB at peak (when reading the input data and sorting the rows), but then it uses ~4GB, and it is using just one core.
May 25 2018
@Samwilson, thanks for the heads up. I have added you as a maintaner of the tool wscontest.
So, I am basically writing another script that does not use Spark but simply process the data in a streaming fashion (the basic idea of the algorithm is: take one day worth of data, sort them by page and then process the data stream one line at a time).
I have run other tests and they took between 8 to 9.5 h using between 34GB to 36.5GB on a single machine with 8 cores. Also, I limited the problem with the data from 2007-12-10 to a few files. (I suspect the root of the problem may be some corrupt file).
May 23 2018
I have run the script over 1 day worth of data (2007-12-11) , it took a little more than 8 hours (484 minuts) and around 34GB of RAM on a single machine with 8 cores. I am testing on another day (2007-12-12).
May 22 2018
I worked on this during the Wikimedia hackathon and now I have a final version of the code that computes the daily total and the compact string representation for hourly views from the pagecounts-raw data.
May 16 2018
May 15 2018
@Milimetric, no problem.
May 8 2018
I'm coming from T193759, I can help with this. Is the script doing the merge available? I can run it on one of my machines and let it run even for several days.
Anyway, I am totally ok with uploading these data, I think I just need a server where to save them.
May 3 2018
May 2 2018
Thank you! I was able to login!
I was able to:
- create a Wikitech account named "CristianCantoro SUL": https://wikitech.wikimedia.org/w/index.php?title=User:CristianCantoro_SUL&redirect=no
- create a Mediawiki account named "CristianCantoro SUL": https://www.mediawiki.org/w/index.php?title=User:CristianCantoro_SUL&redirect=no
Ok, if I go (logged in as CristianCantoro_SUL) to https://phabricator.wikimedia.org/settings/user/CristianCantoro_SUL/page/external/ and I try to disconnect it I get the following message:
I see that most of my activity is done with CristianCantoro, I would like to keep this. I think I can disconnect CristianCantoro_SUL and re-connect my Mediawiki accounts to CristianCantoro. I will try it now.
Commenting here with the other account to confirm the request.
Nov 28 2017
I second what Legoktm is saying: adding the option (or default) of using Orbot to route the traffic of the Wikipedia is completely independent of creating a Tor hidden service.
Oct 31 2017
Oct 29 2017
I would also suggest adding the possibility of having a field or a drop-down menu with the motivation for rejection, I think it would be useful to know.
Some reasons that I could think of:
Aug 9 2017
Jun 19 2017
Thanks Tim for filing this bug report. Two further considerations:
I would also add that this feature would be particularly useful when visiting from mobile. At the moment, users visiting a mirror of Wikipedia (say en.vikiansiklopedi.org) get redirect of en.m.wikipedia.org regardless of the original domain (instead of en.m.vikiansiklopedi.org).
Jun 17 2017
This bug affects the project wikimirror.
Jun 6 2017
Nov 3 2015
We have received today some questions from Italian WMF donors that have noticed the links. They were asking if the emails were real and they were thinking the email was a phishing attempt.
Oct 20 2015
any updates about this?
Oct 15 2015
Oct 14 2015
Sorry for re-opening but I wanted to keep track of requests here.