Hello,
I am opening this task in order to expose my current struggle with pagelinks and see if people more experienced might have an optimal solution to the problem. Some bottlnecks appear, especially when I try with my 'worst-case scenario', which is the English Wikipedia. I am block and I cannot progress in my research due to this.
PROBLEM: I want to do some calculations about the number of inlinks and the number of outlinks incoming and outgoing from articles. This can be made quite easily with the pagelinks table from the mysql wikireplicas by grouping the results and counting the pl_from and pl_to. The difficult part comes when I want to do these two calculations taking into account specific groups of articles.
I have selected over two million articles from English Wikipedia and I want to calculate for each article the number of inlinks coming from those two million and the number of outlinks to them. Here I have several options:
a) Try to introduce these two million articles by parameter.
a1. To query the entire table and receive the results for all the articles. (This would be ideal)
a2. To query article by article and receive the result.
b) To query the entire table almost as it is (no links incoming-going to other namespaces than 0), just sorted by pl_from or pl_to and perform the calculation in memory. This means that I have to check when each article finishes or begins and calculate the intersection of page_ids with my list of 2 million articles. The problem is that sorting by pl_from and pl_to may not be possible. This dies...
Use enwiki_p;
SELECT pl_from, pl_title FROM pagelinks WHERE pl_namespace=0 AND pl_from_namespace=0 ORDER BY pl_from;
c) To copy the entire table almost as it is (no links incoming-going to other namespaces than 0), in order to recreate it in a local Sqlite3 database which is actually really fast as it is reading from hard disk. Once there, I could either do the inner join in Sqlite3 or do something like b). The problem is that copying this table from wikireplicas to the VPS takes ages. I had to create a connection with a cursorclass=mdb_cursors.SSCursor. Otherwise the process would get killed after few minutes.
I tried to copy the Catalan Wikipedia pagelinks table and it took 5 hours and it copied 10 GB. Considering it has over 500k articles I can estimate what it would take to do this with the English Wikipedia.
This is a quite good option if it was not this slow. As soon as the calculations were over, the script would delete the copy of pagelinks.
d) To copy the entire pagelinks table to another database (associated to the User) where I create a second table and insert the page_ids for the group of article I have previously selected and do the inner join all in MySQL. This was my solution two years ago, when I could iterate several times the groups of articles selected (first they were one million, then two million,...). This iteration with several selections of articles may not be required today (I hope I will not need it, as one calculation may be enough).
As far as I am told, this d) solution is not possible as user databases are not in the location than wikireplicas.
What would be the solution to the problem? The best solution would be a) (I still need to try a2) or b) but you may probably know a better option.
Thank you very much.