Page MenuHomePhabricator

HELP: Calculate the number of inlinks/outlinks from/to a large group of articles / Bottlenecks
Closed, ResolvedPublic

Description

Hello,

I am opening this task in order to expose my current struggle with pagelinks and see if people more experienced might have an optimal solution to the problem. Some bottlnecks appear, especially when I try with my 'worst-case scenario', which is the English Wikipedia. I am block and I cannot progress in my research due to this.

PROBLEM: I want to do some calculations about the number of inlinks and the number of outlinks incoming and outgoing from articles. This can be made quite easily with the pagelinks table from the mysql wikireplicas by grouping the results and counting the pl_from and pl_to. The difficult part comes when I want to do these two calculations taking into account specific groups of articles.

I have selected over two million articles from English Wikipedia and I want to calculate for each article the number of inlinks coming from those two million and the number of outlinks to them. Here I have several options:

a) Try to introduce these two million articles by parameter.
a1. To query the entire table and receive the results for all the articles. (This would be ideal)
a2. To query article by article and receive the result.

b) To query the entire table almost as it is (no links incoming-going to other namespaces than 0), just sorted by pl_from or pl_to and perform the calculation in memory. This means that I have to check when each article finishes or begins and calculate the intersection of page_ids with my list of 2 million articles. The problem is that sorting by pl_from and pl_to may not be possible. This dies...

Use enwiki_p;
SELECT pl_from, pl_title FROM pagelinks WHERE pl_namespace=0 AND pl_from_namespace=0 ORDER BY pl_from;

c) To copy the entire table almost as it is (no links incoming-going to other namespaces than 0), in order to recreate it in a local Sqlite3 database which is actually really fast as it is reading from hard disk. Once there, I could either do the inner join in Sqlite3 or do something like b). The problem is that copying this table from wikireplicas to the VPS takes ages. I had to create a connection with a cursorclass=mdb_cursors.SSCursor. Otherwise the process would get killed after few minutes.

I tried to copy the Catalan Wikipedia pagelinks table and it took 5 hours and it copied 10 GB. Considering it has over 500k articles I can estimate what it would take to do this with the English Wikipedia.

This is a quite good option if it was not this slow. As soon as the calculations were over, the script would delete the copy of pagelinks.

d) To copy the entire pagelinks table to another database (associated to the User) where I create a second table and insert the page_ids for the group of article I have previously selected and do the inner join all in MySQL. This was my solution two years ago, when I could iterate several times the groups of articles selected (first they were one million, then two million,...). This iteration with several selections of articles may not be required today (I hope I will not need it, as one calculation may be enough).
As far as I am told, this d) solution is not possible as user databases are not in the location than wikireplicas.

What would be the solution to the problem? The best solution would be a) (I still need to try a2) or b) but you may probably know a better option.

Thank you very much.

Event Timeline

Restricted Application added a subscriber: Aklapper. · View Herald TranscriptApr 23 2018, 5:52 PM

I have selected over two million articles from English Wikipedia

Is there any pattern on this two million articles? Can the list be generated by eg. a subquery or join?

I previously selected these two million articles with several criteria. I have them in a Sqlite3 database in the VPS.

I could do an INNER JOIN with the English Wikipedia pagelinks table if I could insert these 2 million articles in a table nearby the replica. It is either I move them there, I move the pagelinks away or I do the join by logics. The thing is that if I do it by logics, I cannot group pagelinks by article, as it dies... (this is what I tried to explain in b))..

in a table nearby the replica

Not possible. User databases cannot be joined with replicas unfortunately.

I do the join by logics

Can you try this?

I cannot group pagelinks by article.

GROUP BY is ultra-expensive. Your query SELECT pl_from, pl_title FROM pagelinks WHERE pl_namespace=0 AND pl_from_namespace=0 GROUP BY pl_from; has this EXPLAIN:

+------+-------------+-----------+------+-------------------------------------+--------------+---------+-------+-----------+----------------------------------------------+
| id   | select_type | table     | type | possible_keys                       | key          | key_len | ref   | rows      | Extra                                        |
+------+-------------+-----------+------+-------------------------------------+--------------+---------+-------+-----------+----------------------------------------------+
|    1 | SIMPLE      | pagelinks | ref  | pl_namespace,pl_backlinks_namespace | pl_namespace | 4       | const | 512300149 | Using where; Using temporary; Using filesort |
+------+-------------+-----------+------+-------------------------------------+--------------+---------+-------+-----------+----------------------------------------------+

Can you write out your whole query logic and I'll try to optimize it?

  1. OK to not possible. I had already assumed that I could not do that join.
  1. If I cannot obtain them grouped by article (because it is ultra-expensive) I would need to do one query per article retrieving all its pagelinks, then do the logic join. Is 5 million queries (one per WP article) a viable solution?
  1. Thanks for trying to optimize it, but I cannot think of any other query in this case... Look, these were the queries I used some time ago when I could take the pagelinks table to a personal database and have them together.

'SELECT pl_title,count(pl_from) FROM pagelinks WHERE pl_from IN (SELECT page_id FROM u3532__.'+item+'_page_titlesdict) AND pl_namespace=0 GROUP by pl_title' -> INLINKS

'SELECT pa.page_title, count(p.pl_title) FROM pagelinks p INNER JOIN page pa ON p.pl_from=pa.page_id WHERE p.pl_title IN (SELECT page_title FROM u3532__.'+item+'_page_titleswithredirects) AND p.pl_namespace=0 GROUP by pa.page_title' -> OUTLINKS.

And this is when I tried to pass the 2 M articles by parameter:

mysql_cur = queryreconnect(mysql_cur,'SELECT pl_title,count(pl_from) FROM u3532__.'+item+'_pagelinks WHERE pl_title LIKE %s AND pl_namespace=0',(page_title_wp,),0);

The last test (and the one I measured really well was c) from my first description. I can try the rest again and see if one is a little viable and report. I prefered to expose the problem to see if there was another perspective I did not take into account.

  1. If I cannot obtain them grouped by article (because it is ultra-expensive) I would need to do one query per article retrieving all its pagelinks, then do the logic join. Is 5 million queries (one per WP article) a viable solution?

It depends on how much time you consumption are willing to tolerate. Assuming each query takes 10 milliseconds 5 million queries in series takes only 14 hours. If they were executed as one gigantic query it would be killed far before it finishes, but as many many separate queries it's mostly fine, but if one can make a single query and let the database server do thieir thing, without all the latency of client-server connection, it can be much faster.

I'm working on your queries.

Uh, what is the logic you use to generate u3532__.???_page_titlesdict & u3532__.???_page_titleswithredirects & u3532__???_pagelinks?

p.pl_title IN (SELECT page_title FROM u3532__.'+item+'_page_titleswithredirects)

page_titleswithredicts is the selection of 2 Million articles that I have selected according to multiple criteria. It is a process that uses geolocation, some wikidata properties (I run the entire wikidata dump) among others. I cannot simplify this; it is the process I am designing.

The page_titlesdict is the same as the previous... just that in a previous phase. This query was from when I did the process with multiple iterations, first inlinks from a million, then according to some criteria extend it to 1,5 M, then 2 M... so it made sense to have two selections.

a) Isn't there a way to obtain all the pagelinks sorted and grouped by article? This would be a solution.
b) Is there a limit of parameters? I could try to put the 2 M page_ids in tuples.

zhuyifei1999 added a comment.EditedApr 24 2018, 9:57 PM

a) Isn't there a way to obtain all the pagelinks sorted and grouped by article? This would be a solution.

That will take a long time due to GROUP BY using temp table file (Using temporary) and sorting it (Using filesort), and most likely will get killed by query killer.

b) Is there a limit of parameters? I could try to put the 2 M page_ids in tuples.

Possibly, but I'm not aware of one that we specifically designed. You may encounter weird issues if the query is very long though, but I have not seen any documentation on this.

that I have selected according to multiple criteria. It is a process that uses geolocation, some wikidata properties (I run the entire wikidata dump) among others.

I'm not asking you to simplify it; I'm asking if you can explicitly write your logic / criteria, preferably as SQL. We can make the logic a subquery and join it.

a) OK. Bad news.

b) I should try it, then see... The query is not long. It is:

'SELECT pl_title, count(pl_from) FROM pagelinks WHERE pl_title LIKE %s AND pl_namespace=0 AND pl_from_namespace=0', then the tuple for the %s.

c) Unfortunately I cannot. It is a process with many steps, it is not just SQL that I use to create this list of 2 M articles.

marcmiquel added a comment.EditedApr 24 2018, 10:24 PM

The size/number of parameters depends on the max_allowed_packet value.
I checked that: SELECT @@max_allowed_packet; and is 33554432 (32M).
It can go as high as 1G. The good news is that this buffer is only allocated as needed, so setting it to 1G is fairly harmless. 1G will allow you to insert all those 2 million tuples at once.

Could we change this max pack?

Thanks.

marcmiquel updated the task description. (Show Details)Apr 26 2018, 7:59 PM
marcmiquel renamed this task from Calculate the number of inlinks/outlinks from/to a large group of articles / Bottlenecks to HELP: Calculate the number of inlinks/outlinks from/to a large group of articles / Bottlenecks.Apr 27 2018, 6:49 PM
marcmiquel updated the task description. (Show Details)

The solution was to do the join by code logics, even though this implied that the script had to run for a longer period of time.

bd808 moved this task from Backlog to Wiki replicas on the Data-Services board.May 10 2018, 3:25 PM
Vvjjkkii renamed this task from HELP: Calculate the number of inlinks/outlinks from/to a large group of articles / Bottlenecks to 2eeaaaaaaa.Jul 1 2018, 1:14 AM
Vvjjkkii triaged this task as High priority.
Vvjjkkii updated the task description. (Show Details)
Vvjjkkii removed a subscriber: Aklapper.
AfroThundr3007730 renamed this task from 2eeaaaaaaa to HELP: Calculate the number of inlinks/outlinks from/to a large group of articles / Bottlenecks.Jul 1 2018, 6:01 AM
AfroThundr3007730 raised the priority of this task from High to Needs Triage.
AfroThundr3007730 updated the task description. (Show Details)
AfroThundr3007730 added a subscriber: Aklapper.
bd808 closed this task as Resolved.Mar 26 2019, 8:17 PM
bd808 assigned this task to marcmiquel.
bd808 added a subscriber: bd808.

The solution was to do the join by code logics, even though this implied that the script had to run for a longer period of time.